Tencent OCR's AI Triumph at ICDAR 2023: Four Championship Wins
At ICDAR 2023, Tencent's OCR team leveraged self‑developed algorithms and large‑model backbones to clinch four official championship titles across the DSText and SVRD tracks, showcasing breakthroughs in dense video text detection, tracking, end‑to‑end recognition, and structured information extraction.
ICDAR 2023 and Tencent OCR Achievements
At the premier global OCR conference ICDAR 2023, Tencent's OCR team won four official championship titles, marking the fourth consecutive appearance with a total of 18 champion titles since 2017, demonstrating world‑class OCR technology.
About ICDAR
ICDAR is the most authoritative academic conference in document image analysis, held biennially, attracting over 8,000 teams from more than 100 countries. The competition uses a blind‑test format with strict data release and submission limits, making it highly challenging.
Competition Tracks and Results
DSText Track (Dense Small‑Text Video Text Recognition)
The DSText competition, co‑hosted by Zhejiang University and others, provided 50 training videos with extremely dense small text, far exceeding other datasets. It includes two tasks: video text tracking and video text end‑to‑end recognition. Tencent OCR secured first place in both tasks.
Task 1 – Video Text Tracking : Tencent achieved a 12.04% absolute lead over the runner‑up in MOTA, winning the championship.
Task 2 – Video Text End‑to‑End Recognition : Tencent led the second place by 11.93% in OCR‑MOTA, also winning.
SVRD Track (Structured Information Extraction)
The SVRD competition, organized by Huazhong University of Science and Technology, Baidu, Harbin Institute of Technology and others, featured the richest application scenarios and semantic attributes to date. It comprises four tasks across two tracks (HUST‑CELL and BAIDU‑FEST).
Task 2 – E2E Complex Entity Labeling : Tencent won with a large margin.
Task 4 – E2E Few‑shot Structured Text Extraction : Tencent also secured the championship.
Key Algorithms in the DSText Track
Video Text Detection
Tencent built a top‑down instance‑segmentation detector using large‑model backbones such as InternImage and ViT‑Adapter, enhanced with Syn‑BN and deformable convolutions. GA‑RPN replaced the classic RPN in Cascade Mask R‑CNN, and various feature‑pyramid networks (PAFPN, BiFPN, FPG) were explored. The R‑CNN stage employed a five‑stage cascade with customized IoU thresholds and a double‑head design separating regression and classification.
Additional supervision included a CTC‑based recognition branch and a global semantic segmentation branch. Post‑processing applied Soft Polygon NMS and a test‑time augmentation strategy (multi‑scale, flip, blur) to improve recall and precision.
Video Text Tracking
The team introduced a ByteTrack‑based multi‑metric tracking method that combines detection box matching, appearance similarity, text similarity, and neighboring box similarity. Scores are normalized and weighted to form a matching loss. High‑confidence and low‑confidence boxes are processed separately using the Kuhn‑Munkres algorithm. A post‑processing pipeline distinguishes natural from artificial objects and removes low‑confidence tracks.
End‑to‑End Text Recognition
A hybrid CTC and 2D‑Attention model based on Multiway‑Transformer was used. The encoder learned multimodal text‑image features, while the decoder incorporated a Global Semantic Reconstruction Module (GLRM) and the PARSeq structure. CTC decoding was refined with a semantic inference model, and confidence‑based fusion produced final text results.
Key Algorithms in the SVRD Track
Tencent employed multimodal pretrained models such as LayoutLMv3 and Structext to jointly encode text, position, and image features. The model was fine‑tuned on downstream SER (entity classification) and RE (entity relation) tasks, using label smoothing, OHEM, and anti‑noise loss to handle class imbalance. Few‑shot tasks leveraged self‑supervised fine‑tuning based on Task 3 results.
Team Overview
The Tencent OCR team, part of the Data Platform and WeChat Architecture divisions, develops high‑precision, stable text detection and recognition technologies that power hundreds of Tencent services, including advertising, WeChat, QQ, Tencent Cloud, video, and information‑flow products.
Tencent Tech
Tencent's official tech account. Delivering quality technical content to serve developers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.