Artificial Intelligence 14 min read

Tencent OCR's AI Triumph at ICDAR 2023: Four Championship Wins

At ICDAR 2023, Tencent's OCR team leveraged self‑developed algorithms and large‑model backbones to clinch four official championship titles across the DSText and SVRD tracks, showcasing breakthroughs in dense video text detection, tracking, end‑to‑end recognition, and structured information extraction.

Tencent Tech

Oct 20, 2023

Tencent OCR's AI Triumph at ICDAR 2023: Four Championship Wins

ICDAR 2023 and Tencent OCR Achievements

At the premier global OCR conference ICDAR 2023, Tencent's OCR team won four official championship titles, marking the fourth consecutive appearance with a total of 18 champion titles since 2017, demonstrating world‑class OCR technology.

About ICDAR

ICDAR is the most authoritative academic conference in document image analysis, held biennially, attracting over 8,000 teams from more than 100 countries. The competition uses a blind‑test format with strict data release and submission limits, making it highly challenging.

Competition Tracks and Results

DSText Track (Dense Small‑Text Video Text Recognition)

The DSText competition, co‑hosted by Zhejiang University and others, provided 50 training videos with extremely dense small text, far exceeding other datasets. It includes two tasks: video text tracking and video text end‑to‑end recognition. Tencent OCR secured first place in both tasks.

Task 1 – Video Text Tracking : Tencent achieved a 12.04% absolute lead over the runner‑up in MOTA, winning the championship.

Task 2 – Video Text End‑to‑End Recognition : Tencent led the second place by 11.93% in OCR‑MOTA, also winning.

Video Text End-to-End Recognition Certificate

SVRD Track (Structured Information Extraction)

The SVRD competition, organized by Huazhong University of Science and Technology, Baidu, Harbin Institute of Technology and others, featured the richest application scenarios and semantic attributes to date. It comprises four tasks across two tracks (HUST‑CELL and BAIDU‑FEST).

Task 2 – E2E Complex Entity Labeling : Tencent won with a large margin.

Task 4 – E2E Few‑shot Structured Text Extraction : Tencent also secured the championship.

Few‑shot Structured Text Extraction Certificate

Key Algorithms in the DSText Track

Video Text Detection

Tencent built a top‑down instance‑segmentation detector using large‑model backbones such as InternImage and ViT‑Adapter, enhanced with Syn‑BN and deformable convolutions. GA‑RPN replaced the classic RPN in Cascade Mask R‑CNN, and various feature‑pyramid networks (PAFPN, BiFPN, FPG) were explored. The R‑CNN stage employed a five‑stage cascade with customized IoU thresholds and a double‑head design separating regression and classification.

Additional supervision included a CTC‑based recognition branch and a global semantic segmentation branch. Post‑processing applied Soft Polygon NMS and a test‑time augmentation strategy (multi‑scale, flip, blur) to improve recall and precision.

Video Text Tracking

The team introduced a ByteTrack‑based multi‑metric tracking method that combines detection box matching, appearance similarity, text similarity, and neighboring box similarity. Scores are normalized and weighted to form a matching loss. High‑confidence and low‑confidence boxes are processed separately using the Kuhn‑Munkres algorithm. A post‑processing pipeline distinguishes natural from artificial objects and removes low‑confidence tracks.

End‑to‑End Text Recognition

A hybrid CTC and 2D‑Attention model based on Multiway‑Transformer was used. The encoder learned multimodal text‑image features, while the decoder incorporated a Global Semantic Reconstruction Module (GLRM) and the PARSeq structure. CTC decoding was refined with a semantic inference model, and confidence‑based fusion produced final text results.

Key Algorithms in the SVRD Track

Tencent employed multimodal pretrained models such as LayoutLMv3 and Structext to jointly encode text, position, and image features. The model was fine‑tuned on downstream SER (entity classification) and RE (entity relation) tasks, using label smoothing, OHEM, and anti‑noise loss to handle class imbalance. Few‑shot tasks leveraged self‑supervised fine‑tuning based on Task 3 results.

Team Overview

The Tencent OCR team, part of the Data Platform and WeChat Architecture divisions, develops high‑precision, stable text detection and recognition technologies that power hundreds of Tencent services, including advertising, WeChat, QQ, Tencent Cloud, video, and information‑flow products.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

OCR multimodal models Video Text Recognition ICDAR 2023 Structured Information Extraction

Written by

Tencent Tech

Tencent's official tech account. Delivering quality technical content to serve developers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.