Artificial Intelligence 14 min read

Didi's Machine Translation System: Architecture, Techniques, and WMT2020 Competition Experience

This article presents a comprehensive overview of Didi's machine translation platform, covering its evolution from statistical to neural models, the Transformer architecture with relative position and larger FFN, data preparation, training strategies such as back‑translation and knowledge distillation, deployment optimizations with TensorRT, and the team's successful participation in the WMT2020 news translation task.

DataFunTalk

Jan 10, 2021

Didi's Machine Translation System: Architecture, Techniques, and WMT2020 Competition Experience

Introduction – Didi's machine translation service uses deep learning to convert large volumes of text between languages, supporting both international ride‑hailing and driver‑passenger communication. The article outlines the overall framework, principles, and Didi's participation in the WMT2020 competition.

Background

Machine translation (MT) originally relied on Statistical Machine Translation (SMT), which learns phrase‑level translations from bilingual corpora and uses language models to select the best output. Since 2016, Neural Machine Translation (NMT) based on deep neural networks, exemplified by Google's GNMT, has become dominant, offering substantially higher quality.

Evaluation Metric (BLEU)

BLEU (Bilingual Evaluation Understudy) measures n‑gram overlap between system output and reference translations, applying a brevity penalty and geometric averaging. Higher BLEU scores indicate translations closer to human quality.

Transformer Architecture

The standard NMT encoder‑decoder framework now commonly uses the Transformer model, which stacks six identical encoder layers (each with multi‑head self‑attention and a feed‑forward network) and six decoder layers (masked multi‑head attention, multi‑head attention, and feed‑forward network).

Relative position representations enhance the attention mechanism by incorporating positional relationships, leading to faster convergence and better performance. Larger feed‑forward network (FFN) sizes (e.g., 8,192 or 15,000 dimensions) further improve capacity, with dropout (0.3) mitigating over‑fitting.

Didi Translation Practice

Data Preparation – Parallel bilingual corpora are essential. Didi filters raw web‑crawled data using language‑model and alignment scores, then augments data via back‑translation and iterative back‑translation, generating high‑quality synthetic pairs.

Model Training – Techniques include alternating knowledge distillation (using ensemble teachers to guide student models), fine‑tuning on domain‑specific data, and diverse ensemble training (different seeds, parameters, Transformer variants, and data subsets).

Model Prediction – Deployed models have fixed weights, allowing graph optimizations and low‑precision inference (FP16) via TensorRT, which yields up to a 9× speedup over native TensorFlow.

WMT2020 Machine Translation Competition

The WMT workshop is the premier evaluation campaign for MT. Didi participated in the news translation shared task (Chinese→English), employing a Transformer‑big base with self‑attention, relative positional attention, larger FFN, iterative back‑translation, and alternating knowledge distillation. The system achieved a BLEU score of 36.6, earning third place.

Relevant papers are available on arXiv (e.g., https://arxiv.org/abs/2010.08185) and the references include works on GNMT, BLEU, the original Transformer, relative position representations, parallel‑corpus filtering, and large‑scale back‑translation.

References

Wu et al., "Google's neural machine translation system," arXiv:1609.08144, 2016.

Papineni et al., "BLEU: a method for automatic evaluation of machine translation," ACL, 2002.

Vaswani et al., "Attention is all you need," NeurIPS, 2017.

Shaw et al., "Self‑attention with relative position representations," arXiv:1803.02155, 2018.

Zhang et al., "Parallel Corpus Filtering via Pre‑trained Language Models," arXiv:2005.06166, 2020.

Edunov et al., "Understanding back‑translation at scale," arXiv:1808.09381, 2018.

For more information, see the DataFunTalk community and the original article author’s profile.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

TensorRT neural networks Knowledge Distillation machine translation back-translation BLEU

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.