Meituan’s Fully Discrete Multimodal Base (LongCat-Next) Shows All Physical Signals Can Converge to Tokens
LongCat-Next, a 3‑billion‑parameter multimodal model released by Meituan, adopts a pure discrete token‑based architecture (DiNA) and next‑token prediction, outperforming same‑size rivals on OmniDocBench‑EN, CharXivRQ, and matching QwenVL on visual tasks, while avoiding catastrophic forgetting and achieving a SWE‑Bench score of 43.0, as demonstrated through extensive benchmarks, receipt extraction, OCR, audio dialect reasoning, and image generation experiments.
The LongCat‑Next model, open‑sourced by Meituan’s LongCat team, introduces a fully discrete multimodal foundation built on the LongCat‑Flash‑Lite MoE base with only 3 B activation parameters. It follows the simplest next‑token prediction (NTP) paradigm, treating code, high‑resolution images, and noisy audio uniformly as discrete tokens.
DiNA Architecture : The model implements Discrete Native Autoregression (DiNA), which unifies representations of all modalities into a common token space. T‑SNE visualizations show tightly interwoven embeddings across text, audio, and vision, confirming the convergence of heterogeneous signals.
Vision Tokenization – dNaViT : LongCat‑Next’s novel Discrete‑Native Vision Transformer (dNaViT) converts continuous visual signals into homogeneous discrete tokens. It employs Residual Vector Quantization (RVQ), recursively fitting residuals across codebooks to achieve a 28× compression ratio while preserving high‑frequency details. The tokenized visual features can be processed at arbitrary resolutions, enabling strong performance on complex chart reasoning tasks.
Generation Head – Depth Transformer : Multi‑modal token streams are summed before entering a Depth Transformer, which serves as the multimodal prediction head without adding overhead to the front‑end encoder.
Semantic Alignment Encoder (SAE) : To mitigate semantic loss during discretization, a Semantic Alignment Encoder aligns token representations globally through multi‑task dense learning, ensuring that generated tokens retain recoverable information.
Dual‑Path Detokenization : For decoding, LongCat‑Next separates the process into two tracks. The first track uses a ViT‑based structural pixel decoder to generate low‑resolution anchor maps, preserving global layout. The second track, a Diffusion Refiner, injects ultra‑high‑frequency details, allowing accurate reconstruction of intricate mathematical formulas and OCR‑level text fidelity.
Benchmark Results : On the OmniDocBench‑EN and CharXivRQ leaderboards, LongCat‑Next surpasses the same‑size Qwen3‑Omni‑A3B across all metrics. Its visual understanding matches the specialized QwenVL model, and it attains a SWE‑Bench score of 43.0, indicating strong code‑generation capability.
Practical Evaluations : Experiments include extracting structured JSON from a supermarket receipt (handling noisy numeric patterns), precise settlement logic verification, reading a Sichuan‑dialect audio reasoning task, and generating a bilingual meeting notice with natural prosody. Image generation tests produce a children’s book cover with flawless typography placement and a high‑fidelity OCR‑friendly rendering of complex charts.
Conclusion : By converting continuous visual and auditory signals into a unified discrete token space, LongCat‑Next demonstrates that a modest‑size (3 B) model can achieve cross‑modal understanding and generation without resorting to large heterogeneous modules. The code, model weights, and full technical report are publicly available, offering a valuable reference for researchers tackling multimodal fusion and token‑level modeling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
