Artificial Intelligence 21 min read

Insights into OCR Technology at iQIYI: Development, Challenges, and Applications

iQIYI’s OCR journey, explained by researcher Harlon, covers the evolution from separate detection and recognition pipelines to end‑to‑end models, key algorithms like CTPN, DB and CRNN, large‑scale simulated training, diverse video‑text applications, and future goals such as mobile deployment and tighter NLP integration.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Insights into OCR Technology at iQIYI: Development, Challenges, and Applications

With the rising popularity of artificial intelligence, the sub‑field of image recognition, especially OCR (Optical Character Recognition), has attracted increasing attention. Many companies have business needs that require image and document recognition, prompting industry practice and exploration of feasible methods.

InfoQ invited Harlon, an assistant researcher from iQIYI’s Intelligent Platform Department, for a live interview. He described the evolution of OCR technology, the pain points encountered during iQIYI’s exploration, and technical details of the solutions.

OCR consists of two main steps: (1) text detection – locating text regions in an image, which differs from generic object detection because text lines vary widely in length, aspect ratio, and orientation; (2) text recognition – converting the detected text line images into character strings. Traditional pipelines separate detection and recognition, while modern approaches use end‑to‑end sequence‑to‑sequence networks that jointly perform both tasks, reducing annotation effort and improving performance.

Text detection has progressed from detecting horizontal lines to handling arbitrary‑oriented text. Representative algorithms include CTPN, EAST, PMTD, and DB. Detection methods are divided into bounding‑box‑based and mask‑based approaches. Bounding‑box methods generate many candidate boxes via anchors and apply NMS, while mask‑based methods use segmentation networks to produce pixel‑level masks that are post‑processed into quadrilaterals.

For text recognition, two dominant families exist: CTC‑based CRNN and attention‑based encoder‑decoder models. CRNN combines CNN for feature extraction with RNN and uses CTC loss, allowing sequence prediction without explicit character segmentation. Attention models align image features with character sequences, focusing on relevant regions and improving accuracy, especially for long or complex texts.

End‑to‑end OCR merges detection and recognition into a single network that shares feature extraction, defining a loss as a weighted sum of detection and recognition errors. This reduces inference cost but increases training difficulty due to the differing nature of the two tasks.

Beyond pure text extraction, OCR‑based information extraction is needed for scenarios such as invoice processing, where relationships between text blocks must be identified.

Industrial applications of OCR include classic examples like the MNIST‑derived check‑number recognition, early use cases such as license‑plate, document, and bank‑card recognition, and newer, more general scenarios like natural‑scene text detection, online education (photo‑based question search), video subtitle extraction, and intelligent traffic analysis.

Open‑source frameworks for OCR development include PyTorch, TensorFlow, and PaddleOCR. For research, PyTorch and TensorFlow provide flexibility and community resources; for production, PaddleOCR offers a complete toolchain (data simulation, training, testing, deployment) and strong Chinese community support.

iQIYI’s internal OCR usage is extensive. The “smart subtitle analysis” service extracts subtitles from movies and variety shows in real time, feeding the results to downstream NLP modules for tagging and recommendation. Additional services include track‑board recognition, ad‑rights detection, end‑card detection, video‑text OCR for multilingual text, and specialized OCR for IDs, bank cards, and news headlines.

The OCR system at iQIYI has evolved through three stages: (1) Foundation stage (around 2017) – building basic OCR capabilities for subtitle search; (2) Development stage – optimizing speed and resource consumption to handle massive video volumes, achieving a 5‑minute processing time for a 40‑minute video; (3) Optimization stage – expanding coverage to more content types and improving generalization through auxiliary models (language classification, vertical text detection).

Key algorithms and models used include CTPN (horizontal text detection), PMTD (mask‑based detection), DB (differentiable binarization), CRNN (CTC‑based recognition), and attention‑based encoder‑decoder models. Each has strengths and weaknesses, e.g., CTPN excels at horizontal subtitles but may miss some lines; PMTD handles arbitrary orientations but struggles with dense tilted text; DB offers fast, adaptive thresholding.

Evaluation metrics: detection is measured by IoU‑based recall and precision; recognition is measured by whole‑line accuracy (exact match). Challenges such as diverse fonts, orientations, languages, and complex backgrounds are mitigated by large‑scale simulated data (including font styles, shadows, outlines) combined with real data, language‑specific detection, and careful handling of tilted text during training.

Future directions at iQIYI include video‑text recognition and tracking, tighter integration with NLP for error correction, and porting OCR models to mobile devices to reduce backend load and improve user experience.

In the QA section, practical advice is given: end‑to‑end OCR frameworks require staged training to avoid loss oscillation (reference: FOTS); watermarked images benefit from pre‑removal or simulated training data; EAST is fast and orientation‑agnostic but lags behind newer methods; blurry text handling relies on simulated blur data and balanced training; common pitfalls when building OCR from scratch include defining the character set, annotation rules, optimization strategies, and ensuring sufficient training samples.

computer visionAIDeep LearningOCRiQIYIPaddleOCRtext detectiontext recognition
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.