Visual Capability as a Fundamental Requirement for AGI and the SEEChat Multimodal Dialogue Model
The article reviews why visual ability is essential for artificial general intelligence, compares native multimodal and expert‑stitching integration approaches, details the architectures of models such as KOSMOS‑1, PALM‑E, Flamingo, BLIP‑2, LLAVA, miniGPT‑4, and introduces the SEEChat project that fuses CLIP vision encoders with chatGLM6B via a projection layer, presenting its training pipeline, experimental results, and future directions.
2022 has been hailed as the "AIGC year" with AI painting and ChatGPT sparking global interest; ChatGPT’s unprecedented logical and reasoning abilities have revived expectations for artificial general intelligence (AGI), prompting many organizations to explore AI‑driven productivity.
GPT‑4, released on March 15, extends the GPT‑3.5 text‑only model with visual modality support, enabling image understanding and natural‑language generation, which opens new applications in e‑commerce, entertainment, and game design, and demonstrates a prototype of an AGI with visual capability.
Integrating vision into large language models (LLMs) follows two main routes: (1) native multimodal models, designed from the ground up for multimodal data (e.g., MSRA’s KOSMOS‑1 and Google Robotics’ PALM‑E); and (2) expert‑stitching models that connect pre‑trained vision experts with pre‑trained language models via bridge layers (e.g., DeepMind’s Flamingo, Salesforce’s BLIP‑2, LLAVA, miniGPT‑4).
KOSMOS‑1 uses a CLIP ViT‑L/14 image encoder and a 24‑layer Transformer for multimodal training on text, image‑text pairs, and interleaved image‑text data, concatenating image embeddings with text using the format <s><image>Image Embedding</image>...</s> .
PALM‑E shares a similar architecture but initializes its language component from the PALM model and incorporates a robot state vector <emb> into the multimodal stream.
Native multimodal approaches achieve high performance when abundant data and compute are available, but they cannot fully reuse advances from single‑modal domains and require massive resources.
Expert‑stitching approaches reuse existing vision and language models, reducing training cost. Flamingo freezes the vision encoder and language model, adding cross‑attention layers for alignment. BLIP‑2 connects a frozen CLIP ViT‑G/14 vision encoder with a frozen FLAN‑T5 language model via a Q‑Former bridge, training on only 129 M image‑text pairs for nine days on 16 A100 GPUs. LLAVA simplifies this further with a single projection layer linking CLIP ViT‑L/14 to the Vicuna LLM, using just 595 K image‑text pairs and 158 K instruction‑tuning samples; miniGPT‑4 builds on BLIP‑2’s components and adds a projection layer to Vicuna, training on 5 M image‑text pairs plus 3.5 K instructions.
While stitching reduces cost and speeds up deployment, shallow fusion models (BLIP‑2, LLAVA, miniGPT‑4) often lack deep multimodal in‑context learning capabilities compared to native models like KOSMOS‑1 and PALM‑E.
The SEEChat project aims to fuse visual capability with an existing LLM (chatGLM‑6B) using the expert‑stitching route. Its architecture (see Figure 6) bridges CLIP‑ViT‑L/14 and chatGLM‑6B via a projection layer.
SEEChat v1.0 is trained in two stages: (1) image‑text alignment on the high‑quality Chinese Zero dataset (23 M samples); (2) instruction alignment using translated miniGPT‑4 + LLAVA instruction data.
Demonstrations (Figures 7‑9) show SEEChat’s abilities in image‑text dialogue, code generation, and object classification, highlighting strong visual‑language understanding alongside chatGLM’s conversational skills.
Compared with other Chinese multimodal models such as X‑LLM and VisualGLM, SEEChat v1.0 achieves higher image‑captioning relevance scores on the Zero dataset, evaluated with ChineseCLIP to avoid bias from overlapping training data.
Future work includes adding object detection, cross‑modal capabilities, open‑vocabulary detection, and transitioning from shallow to deep fusion strategies.
References:
[1] Huang, Shaohan, et al. "Language is not all you need: Aligning perception with language models." arXiv preprint arXiv:2302.14045 (2023).
[2] Driess, Danny, et al. "Palm‑e: An embodied multimodal language model." arXiv preprint arXiv:2303.03378 (2023).
[3] Alayrac, Jean‑Baptiste, et al. "Flamingo: a visual language model for few‑shot learning." Advances in Neural Information Processing Systems 35 (2022): 23716‑23736.
[4] Li, Junnan, et al. "Blip‑2: Bootstrapping language‑image pre‑training with frozen image encoders and large language models." arXiv preprint arXiv:2301.12597 (2023).
[5] Liu, Haotian, et al. "Visual instruction tuning." arXiv preprint arXiv:2304.08485 (2023).
[6] Zhu, Deyao, et al. "Minigpt‑4: Enhancing vision‑language understanding with advanced large language models." arXiv preprint arXiv:2304.10592 (2023).
[7] Zero, https://zero.so.com/
[8] Chen, Feilong, et al. "X‑LLM: Bootstrapping Advanced Large Language Models by Treating Multi‑Modalities as Foreign Languages." arXiv preprint arXiv:2305.04160 (2023).
[9] VisualGLM, https://github.com/THUDM/VisualGLM-6B
[10] ChineseCLIP, https://github.com/OFA‑Sys/Chinese‑CLIP
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.