How dots.ocr Achieves SOTA Multilingual Document Parsing with a 1.7B VLM
dots.ocr is a 1.7 billion-parameter multilingual document-parsing model that unifies layout detection and content recognition within a single visual-language model, delivering state-of-the-art performance across text, tables, formulas and reading order while remaining efficient and extensible for future multimodal AI research.
Overview
dots.ocr is a powerful multilingual document‑parsing model that unifies layout detection and content recognition within a single visual‑language model (VLM). Despite being built on a 1.7 billion‑parameter “small” model, it achieves state‑of‑the‑art (SOTA) performance on multiple benchmarks, closing the gap with much larger proprietary models.
Key Strengths
Strong performance : On the OmniDocBench benchmark, dots.ocr attains SOTA results for text, tables and reading order, and its formula‑recognition rivals larger models such as Doubao‑1.5 and Gemini 2.5‑pro.
Multilingual support : Excellent parsing ability for low‑resource languages, outperforming existing open‑source solutions in both layout detection and content recognition.
Unified and simple architecture : A single VLM replaces complex multi‑model pipelines; task switching is achieved by changing the input prompt.
Efficiency : Built on a 1.7 B parameter VLM, inference speed surpasses many larger VLM alternatives.
Resources
GitHub: https://github.com/rednote-hilab/dots.ocr
HuggingFace: https://huggingface.co/rednote-hilab/dots.ocr
Demo: https://dotsocr.xiaohongshu.com
Benchmark Visuals
Pre‑training Pipeline
Stage 1 – Visual encoder pre‑training : Trained a 1.2 B‑parameter visual encoder on a large image‑text dataset.
Stage 2 – Visual encoder continued pre‑training : Used NaViT dynamic‑resolution architecture (up to 11 M‑pixel inputs) and added OCR, video, grounding data; aligned with Qwen2.5‑1.5B language model to produce dots.vit.
Stage 3 – VLM training : Trained on pure OCR data, first freezing VE parameters then fine‑tuning all parameters on 1/5 of the token budget, yielding the base OCR model dots.ocr.base.
Supervised Fine‑tuning (SFT)
Large, diverse SFT dataset combining human‑annotated, synthetic (tables, formulas, multilingual OCR) and open‑source data.
Iterative data‑flywheel: a 15 k‑sample internal multilingual layout dataset refined through three cycles of bad‑case identification, re‑annotation, and reintegration.
Reading‑order correction using “large‑model ranking + rule posterior”.
Quality & robustness: multi‑expert cleaning, distillation, and data augmentations (scaling, rotation, noise).
Multi‑task training via prompt engineering, enabling a single model to perform detection and recognition based on the provided prompt.
Limitations & Future Work
Complex document elements such as highly intricate tables, formulas, and images remain challenging.
Failure cases occur with extremely high character‑to‑pixel ratios or long sequences of special characters; higher DPI or alternative prompts can mitigate.
Efficiency bottleneck: despite the 1.7 B‑parameter backbone, processing very large PDFs can be slow.
Future plans include improving table and formula extraction, enhancing generalization, and extending the model to handle image content within documents.
Contact
For collaboration on future VLM development, reach out to [email protected] .
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
