Artificial Intelligence 15 min read

ShopeeVideo OCR: Multi-language Text Recognition System for E-commerce Video

ShopeeVideo OCR is a multi‑language text‑recognition system for Southeast Asian e‑commerce videos that unifies detection, Transformer‑based recognition, layout analysis, and large‑scale synthetic data generation to handle Indonesian, Filipino, English, Vietnamese, Thai and Chinese scripts, delivering industry‑leading accuracy and winning thirteen ICDAR first‑place awards.

Shopee Tech Team
Shopee Tech Team
Shopee Tech Team
ShopeeVideo OCR: Multi-language Text Recognition System for E-commerce Video

This article introduces Shopee's multi-language OCR (Optical Character Recognition) technology designed for ShopeeVideo, addressing the challenges of recognizing text in e-commerce videos across Southeast Asian markets.

Background and Challenges: The system supports multiple languages including Indonesian, Filipino, English, Vietnamese, Thai, and Chinese (simplified/traditional), categorized into Latin, Thai, and Chinese script types. Key challenges include: 1) Significant cross-regional business differences requiring a unified algorithm for multiple languages; 2) Limited or missing multilingual annotated data, requiring synthetic data generation; 3) Complex video content necessitating intelligent filtering of important information.

Technical Architecture: The system comprises text detection, text recognition, video layout analysis, and multilingual data generation modules.

Text Detection: Uses an FCN-based instance segmentation approach with ResNet-50 backbone and FPN for multi-scale features. Based on DBNet algorithm, the method adds EDT distance field and direction field learning, plus a character recognition branch with attention mechanism for end-to-end training.

Text Recognition: Based on Parseq (Transformer-based) with improvements: rewritten data pipeline for flexible data mixing, Hybrid Architecture replacing ViT in encoder, and adaptive 2D positional encoding strategy for handling irregular and multi-line text.

Video Layout Analysis: Classifies text into thematic and non-thematic categories based on position, content, and image clarity. Performs trajectory tracking across frames and merges text within paragraphs.

Multilingual Data Generation: Three methods: 1) Multi-language whole-image generation using multi-channel text rendering (generates million-level samples in 2-3 days); 2) Multi-language word generation with TextRenderer improvements supporting 15 data augmentations; 3) Style transfer using improved StyleText with character region learning, spectral norm regularization, and ASPP in background erasure module.

Achievements: The system achieved industry-leading recognition accuracy in Latin, Thai, and Chinese on 3k business datasets. At ICDAR competitions, the team secured 13 first-place finishes across 6 competitions including IC-15, MLT-19, ArT-19, ReCTS-19, IC13, COCO-Text, and MLT-17.

computer visiondeep learningOCRtext detectiontext recognitionOptical Character RecognitionShopeedata synthesisMulti-language OCRvideo OCR
Shopee Tech Team
Written by

Shopee Tech Team

How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.