Artificial Intelligence 15 min read

Didi's Machine Translation System: Architecture, Techniques, and WMT2020 Competition Experience

Didi's machine translation system combines a Transformer‑big architecture with relative position representations, enlarged feed‑forward networks, iterative back‑translation, knowledge‑distillation and domain fine‑tuning, optimized via TensorRT for speed, achieving a BLEU 36.6 and third place in the WMT2020 Chinese‑to‑English news task.

Didi Tech

Oct 27, 2020

Didi's Machine Translation System: Architecture, Techniques, and WMT2020 Competition Experience

This article introduces Didi's machine translation (MT) system, which leverages deep learning to translate large volumes of text between languages. It first outlines the background of MT, emphasizing the transition from Statistical Machine Translation (SMT) to Neural Machine Translation (NMT) and the role of large-scale data, GPU acceleration, and advanced linguistic techniques.

1. Background

MT services convert source text into a target language using deep learning models. The evolution of commercial MT has moved from SMT, which relies on phrase‑based statistical models, to NMT, which uses encoder‑decoder neural networks such as the Transformer.

1.1 Statistical Machine Translation (SMT)

SMT employs statistical analysis of bilingual corpora to map source phrases to target phrases and uses language models to select the most probable translation. It was the first widely commercialized MT approach.

1.2 Neural Machine Translation (NMT)

NMT encodes the source sentence with a deep neural network and decodes it into the target language. The breakthrough paper GNMT (Google, 2016) marked the shift to NMT, delivering substantially higher translation quality.

2. Evaluation Metric (BLEU)

BLEU (Bilingual Evaluation Understudy) measures n‑gram overlap between system output and reference translations. Higher BLEU scores indicate translations closer to human quality. The standard BLEU formula is shown in the original figure.

3. Transformer

The Transformer architecture, now the dominant NMT model, consists of a 6‑layer encoder and a 6‑layer decoder. Each encoder layer contains multi‑head self‑attention and a feed‑forward network (FFN); each decoder layer adds a masked multi‑head attention sub‑layer. Didi's implementation uses a "Transformer‑big" configuration (6 encoder & decoder layers, hidden size 1024, FFN size 4096, 16 attention heads).

3.1 Relative Position Representations

Traditional Transformers use absolute positional embeddings. Shaw et al. introduced relative position representations, which improve convergence and translation quality. Didi incorporated this technique into their models.

3.2 Larger FFN Size

Increasing the FFN dimension (e.g., to 8192 or 15000) boosts model capacity. To mitigate over‑fitting, Didi set a dropout rate of 0.3.

4. Didi Translation Practice

4.1 Data Preparation

High‑quality parallel corpora are essential. Didi collects raw bilingual data, filters it using language‑model and alignment scores, and applies data‑augmentation techniques such as back‑translation. Iterative back‑translation generates synthetic parallel data by alternating source‑to‑target and target‑to‑source models.

4.2 Model Training

Training strategies include:

Alternating knowledge distillation: an ensemble of teacher models generates synthetic data for a student model, and the process repeats to improve single‑model performance.

Fine‑tuning: a base model is adapted to specific domains (e.g., international ride‑hailing messages) with a small amount of domain data.

Ensemble: multiple models with different seeds, architectures, and data are combined via probability voting during inference.

4.3 Model Prediction and Acceleration

During deployment, model weights are fixed, allowing graph optimizations such as TensorRT fusion and FP16 quantization. Didi's TensorRT‑optimized Transformer runs up to 9× faster than the original TensorFlow implementation.

5. WMT2020 Machine Translation Competition

Didi participated in the WMT2020 News Translation shared task (Chinese→English). Using the Transformer‑big backbone with self‑attention, relative position attention, larger FFN, iterative back‑translation, and alternating knowledge distillation, Didi achieved a BLEU score of 36.6, ranking third overall. The detailed system description is available in an arXiv pre‑print (https://arxiv.org/abs/2010.08185).

References

Wu et al., "Google's neural machine translation system", arXiv:1609.08144, 2016.

Papineni et al., "BLEU: a method for automatic evaluation of machine translation", ACL, 2002.

Vaswani et al., "Attention is all you need", NeurIPS, 2017.

Shaw et al., "Self‑attention with relative position representations", arXiv:1803.02155, 2018.

Zhang et al., "Parallel Corpus Filtering via Pre‑trained Language Models", arXiv:2005.06166, 2020.

Edunov et al., "Understanding back‑translation at scale", arXiv:1808.09381, 2018.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer TensorRT neural networks Knowledge Distillation machine translation back-translation BLEU WMT2020

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.