Artificial Intelligence 19 min read

Comprehensive Survey of AIGC Research: Papers, Resources, and Technical Overview

This survey acts as a comprehensive portal that organizes AIGC research across seven domains—text, image, and audio generation, cross‑modal association, text‑guided image and audio synthesis, and supporting resources—detailing seminal models such as GPT, Diffusion, CLIP, DALL·E, Stable Diffusion, MusicLM, and key papers that shaped each field.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Comprehensive Survey of AIGC Research: Papers, Resources, and Technical Overview

This article serves as a comprehensive "portal" document for organizing and guiding researchers through the AIGC (AI-Generated Content) landscape. It covers seven major areas of multimodal AI research:

1. Single Modality: Text Recognition and Generation

Focuses primarily on the GPT family models, including GPT-1, GPT-2, GPT-3, and InstructGPT. Key papers covered include Efficient Training of Language Models to Fill in the Middle, Text and Code Embeddings by Contrastive Pre-Training, WebGPT, Training Verifiers to Solve Math Word Problems, Codex (Evaluating Large Language Models Trained on Code), and Learning to Summarize with Human Feedback.

2. Single Modality: Image Recognition and Generation

Covers the transition from GAN to Diffusion models. Key architectures include ResNet, Sparse Transformers, MoCo V1/V2/V3, ViT (Vision Transformer), MAE (Masked Autoencoders), VAE, VQ-VAE, VQ-VAE-2, VideoGPT, U-Net, DDPM, Improved DDPM, and GLIDE. The article explains that image generation models follow an "image feature extractor + generator" paradigm.

3. Single Modality: Audio Recognition and Generation

Highlights Whisper for speech recognition (trained on 680K hours of speech-text pairs with zero-shot capability) and Jukebox for music generation using VQ-VAE. Other papers include Conformer, wav2vec, wav2vec 2.0, and SingSong.

4. Cross-modal Association

Centers on CLIP's approach (image-text pairing + contrastive learning). Papers include CLAP (audio-text), ViLT, L-Seg, GroupViT, ViLD, GLIP, CLIPasso, CLIP4Clip, ActionCLIP, AudioCLIP, PointCLIP, and research on multimodal neurons.

5. Cross-modal: Text-guided Image Generation

Covers the evolution from DALL·E (VQ-VAE2 + GPT) to DALL·E V2 (GLIDE-based with CLIP guidance) to Stable Diffusion/Latent Diffusion. Other models include NÜWA, ERNIE-ViLG, CogView, CogView2, CogVideo, Imagen, and Imagen Video.

6. Cross-modal: Text-guided Audio Generation

Features MusicLM and includes AudioLDM, Moûsai, and neural codec language models for TTS.

7. Additional Resources

Mentions OpenAI Microscope for visualizing model internals and lucidrains' GitHub repositories for quality implementations.

multimodal AIcomputer visiontext-to-imageAIGCdiffusion modelsNLPgenerative AIClipGPTText-to-Audio
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.