How dots.ocr Achieves SOTA Multilingual Document Parsing with a 1.7B VLM

dots.ocr is a 1.7 billion-parameter multilingual document-parsing model that unifies layout detection and content recognition within a single visual-language model, delivering state-of-the-art performance across text, tables, formulas and reading order while remaining efficient and extensible for future multimodal AI research.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
How dots.ocr Achieves SOTA Multilingual Document Parsing with a 1.7B VLM

Overview

dots.ocr is a powerful multilingual document‑parsing model that unifies layout detection and content recognition within a single visual‑language model (VLM). Despite being built on a 1.7 billion‑parameter “small” model, it achieves state‑of‑the‑art (SOTA) performance on multiple benchmarks, closing the gap with much larger proprietary models.

Key Strengths

Strong performance : On the OmniDocBench benchmark, dots.ocr attains SOTA results for text, tables and reading order, and its formula‑recognition rivals larger models such as Doubao‑1.5 and Gemini 2.5‑pro.

Multilingual support : Excellent parsing ability for low‑resource languages, outperforming existing open‑source solutions in both layout detection and content recognition.

Unified and simple architecture : A single VLM replaces complex multi‑model pipelines; task switching is achieved by changing the input prompt.

Efficiency : Built on a 1.7 B parameter VLM, inference speed surpasses many larger VLM alternatives.

Resources

GitHub: https://github.com/rednote-hilab/dots.ocr

HuggingFace: https://huggingface.co/rednote-hilab/dots.ocr

Demo: https://dotsocr.xiaohongshu.com

Benchmark Visuals

Benchmark overview
Benchmark overview
Multilingual end-to-end performance
Multilingual end-to-end performance

Pre‑training Pipeline

Stage 1 – Visual encoder pre‑training : Trained a 1.2 B‑parameter visual encoder on a large image‑text dataset.

Stage 2 – Visual encoder continued pre‑training : Used NaViT dynamic‑resolution architecture (up to 11 M‑pixel inputs) and added OCR, video, grounding data; aligned with Qwen2.5‑1.5B language model to produce dots.vit.

Stage 3 – VLM training : Trained on pure OCR data, first freezing VE parameters then fine‑tuning all parameters on 1/5 of the token budget, yielding the base OCR model dots.ocr.base.

Supervised Fine‑tuning (SFT)

Large, diverse SFT dataset combining human‑annotated, synthetic (tables, formulas, multilingual OCR) and open‑source data.

Iterative data‑flywheel: a 15 k‑sample internal multilingual layout dataset refined through three cycles of bad‑case identification, re‑annotation, and reintegration.

Reading‑order correction using “large‑model ranking + rule posterior”.

Quality & robustness: multi‑expert cleaning, distillation, and data augmentations (scaling, rotation, noise).

Multi‑task training via prompt engineering, enabling a single model to perform detection and recognition based on the provided prompt.

Limitations & Future Work

Complex document elements such as highly intricate tables, formulas, and images remain challenging.

Failure cases occur with extremely high character‑to‑pixel ratios or long sequences of special characters; higher DPI or alternative prompts can mitigate.

Efficiency bottleneck: despite the 1.7 B‑parameter backbone, processing very large PDFs can be slow.

Future plans include improving table and formula extraction, enhancing generalization, and extending the model to handle image content within documents.

Contact

For collaboration on future VLM development, reach out to [email protected] .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIOCRbenchmarkDocument Parsingmultilingualvisual language model
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.