DeepSeek Releases Janus‑Pro‑7B Multimodal Model, Beats DALL‑E 3 and Stable Diffusion on Benchmarks
DeepSeek's newly released Janus‑Pro‑7B multimodal model, open‑sourced overnight, outperforms DALL‑E 3 and Stable Diffusion on GenEval and DPG‑Bench, showcases a unified self‑regressive architecture with SigLIP‑L visual encoder, and has sparked massive user adoption and market reactions worldwide.
DeepSeek announced the overnight release of Janus‑Pro‑7B, a multimodal large language model that is immediately open‑source. The model builds on DeepSeek‑LLM‑1.5b‑base and DeepSeek‑LLM‑7b‑base, using a unified self‑regressive framework that integrates visual and textual processing.
The visual encoder is decoupled as a separate path, employing SigLIP‑L to handle 384×384 images, while a VQ tokenizer from LIamaGen converts images into discrete IDs with a down‑sampling factor of 16. These IDs are flattened, embedded, and concatenated with text tokens before being fed to the LLM.
Training has been revised in two major ways: Stage I now includes many more steps on ImageNet to better model pixel dependencies, and Stage II abandons ImageNet in favor of dense text‑to‑image data, improving generation quality. In the final supervised fine‑tuning stage, the data mix was adjusted from a 7:3:10 ratio (multimodal:text‑only:text‑image) to 5:1:4, slightly reducing text‑to‑image data to boost multimodal understanding without sacrificing visual generation.
Benchmarks show Janus‑Pro‑7B surpassing DALL‑E 3 and Stable Diffusion on GenEval and DPG‑Bench, achieving performance comparable to state‑of‑the‑art vision‑language models while maintaining strong text generation capabilities.
The release triggered a wave of market activity: Nvidia’s stock fell about 17% in a single day, wiping out roughly $5.9 trillion in market value, while DeepSeek climbed to the top of the U.S. App Store’s free‑app rankings, overtaking ChatGPT and Meta’s Threads. Competing models such as Alibaba’s Qwen2.5‑VL were also updated in the same period.
DeepSeek emphasizes cost‑effective strategies, including model distillation (six distilled variants on the R1 dataset) and reinforcement learning without a supervised fine‑tuning stage, achieving scores on AIME 2024 comparable to OpenAI’s o1‑0912. The industry response includes heightened research interest from Meta and OpenAI, and discussions about the necessity of massive AI compute investments.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.