MiniGPT-4: Enhancing Vision‑Language Understanding with Large Language Models
This article presents MiniGPT-4, a multimodal system that combines a frozen visual encoder (Q‑Former + ViT) with an open‑source large language model (Vicuna), describes its motivation, training pipeline, demo capabilities, observed limitations, and includes a brief Q&A session.