OCR Techniques and Solutions for Ctrip Business: Deep Learning Based Text Detection and Recognition
This article presents an overview of computer‑vision based OCR in Ctrip's operations, detailing deep‑learning text detection methods for controlled and uncontrolled scenarios, sequence‑based recognition models, training strategies with synthetic data, and performance results, while discussing current challenges and future improvements.
Author Introduction
Yuan Qiulong, an intern with the Ctrip Big Data AI R&D team, focuses on computer‑vision research and applications, primarily working on OCR during the internship.
Overview
Computer vision aims to enable machines to "see" by using cameras and algorithms to recognize, track, and analyze objects. In Ctrip, computer‑vision techniques are applied to supplier qualification, product upload, and display, involving OCR/Scene‑Text Recognition, image quality assessment, intelligent cropping, and object detection.
OCR serves two main purposes in Ctrip: (1) verification, such as checking business licenses and filtering products with sensitive words; and (2) data entry assistance, like automatically extracting license information.
OCR Fundamentals
OCR consists of text detection and text recognition. Detection methods include Stroke Width Transform (SWT), Maximally Stable Extremal Regions (MSER), and Fully‑Convolutional Networks combined with Recurrent Neural Networks (FCN+RNN). Recognition approaches are divided into character‑based methods (traditional DPM features or CNN‑extracted features) and sequence‑based methods (CTC and Seq2Seq).
Technical Solution for Ctrip
The solution follows a two‑stage pipeline: first detect text regions in images, then recognize the detected text.
3.1 Deep‑Learning Based Text Detection
Scenarios are split into controlled (e.g., business licenses) and uncontrolled (e.g., product posters). For controlled scenes, the CTPN model is used; for uncontrolled scenes, TextSnake is adopted. Training follows a coarse‑to‑fine strategy: pre‑training on synthetic data followed by fine‑tuning on a small set of real samples. The CTPN model achieves an F1 score of 89% on license detection, while TextSnake reaches an F1 of 81% on poster detection.
3.2 Sequence‑Based Text Recognition
Two architectures are used: CNN+LSTM+CTC and CNN+LSTM+Seq2Seq (with attention). Both employ CNN for visual feature extraction and bidirectional LSTM for contextual modeling; the difference lies in the loss function (CTC vs. Seq2Seq). The combined CTC‑attention model improves convergence speed while maintaining high accuracy. Training also follows the synthetic‑then‑real fine‑tuning approach.
The integrated OCR system (CTPN + recognition model) achieves up to 85% accuracy on full‑field extraction of critical information such as the Unified Social Credit Code, even under challenging conditions like stamps and reflections.
Conclusion
Deep‑learning OCR models rely heavily on large, diverse datasets; synthetic data is crucial for both detection and recognition stages. Ongoing work focuses on generating more realistic synthetic samples and addressing remaining shortcomings in natural‑scene OCR services.
References
[1] Epshtein et al., "Detecting text in natural scenes with stroke width transform," CVPR 2010. [2] Neumann & Matas, "Real‑time scene text localization and recognition," CVPR 2012. [3] Tian et al., 2016. [4] Shi et al., 2013. [5] Jaderberg et al., 2016. [6] He et al., 2016. [7] Shi, Bai, & Yao, 2016. [8] Lee & Osindero, 2016. [9] Long et al., 2018. [10] Kim et al., 2016.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.