Artificial Intelligence 19 min read

Industrial-Scale Machine Translation at Bytedance: Applications, Demos, and Research Advances

This article presents Bytedance's industrial machine‑translation platform, describing its global deployment, diverse product demos, underlying sequence‑to‑sequence models, BERT‑enhanced training strategies, prune‑tune sparsity techniques, multilingual pre‑training, document translation, and a high‑performance inference engine.

DataFunTalk

Feb 20, 2021

Industrial-Scale Machine Translation at Bytedance: Applications, Demos, and Research Advances

Speaker: Dr. Wang Mingxuan, Algorithm Scientist at Bytedance.

Introduction: Machine translation has become a core industrial service, breaking language barriers for billions of users worldwide. Bytedance, with its global products such as TikTok, leverages large‑scale MT to serve users in over 5,000 languages.

Background: The talk outlines two perspectives: (1) enterprise‑grade MT services that translate content for global users, and (2) novel algorithms developed for massive MT deployments, including pre‑training, multilingual learning, and multimodal translation.

Demo 1 – XiaomingBot: A cross‑language multimodal chatbot that combines vision, speech, and text technologies.

Demo 2 – Multilingual Media Commentary: Automatic generation of sports commentary and news articles in multiple languages, with a virtual anchor that synchronises lip‑movement across languages.

Office Translation Scenarios: Real‑time IM translation, video‑conference subtitles, and email translation within Lark, enabling seamless multilingual collaboration.

Product‑Level Applications: TikTok subtitle generation, live‑stream translation for global artists, and large‑scale live events reaching millions of viewers.

Machine Translation Fundamentals: MT is a conditional sequence‑generation task modeled as a probability distribution p(target|source). Modern transformers use self‑attention on source and target sides and cross‑attention for alignment.

Multimodal MT Goal: Build a universal MT system that can ingest speech, documents, and images, encoding them on the source side and generating text on the target side.

Knowledge Transfer from Texts: To overcome limited parallel data, we explore semi‑supervised and pre‑training methods that exploit sentence‑level, document‑level, and multimodal data.

Maximising BERT for NMT: We address catastrophic forgetting during fine‑tuning by (1) controlling learning rates, (2) freezing BERT initially, then jointly tuning, and (3) using dynamic gates to balance BERT and NMT contributions. These strategies yield ~3% BLEU improvements.

Prune‑Tune: Inspired by the lottery ticket hypothesis, we identify sparse subnetworks in BERT/NMT that retain most upstream knowledge while being fine‑tuned on downstream MT tasks, achieving stable and robust gains across data scales.

Multilingual Pre‑Training: We train a universal encoder‑decoder on 32 language pairs and fine‑tune on 48 downstream tasks, including zero‑resource pairs, achieving up to +10 BLEU on low‑resource languages and SOTA results on rich‑resource pairs.

Document MT: Extending the approach to long documents (up to 2,000 characters) with specialized pre‑training and fine‑tuning, delivering high‑quality translations for enterprise documents.

Speech‑to‑Text MT: Decoupled end‑to‑end speech translation pipelines that first supervise acoustic modeling, then semantic encoding, and finally MT decoding, leveraging BERT knowledge for state‑of‑the‑art performance.

Imagination‑Based MT: A novel pipeline that generates imagined images from source text, encodes both text and imagined visual features, and translates, achieving comparable results to explicit multimodal models without requiring image inputs.

Efficient Inference – LightSeq: To serve heavy multilingual models, we released a fast decoder that speeds up inference by an order of magnitude over native TensorFlow, enabling real‑time MT for billions of requests.

Overall, the presentation showcases how Bytedance combines research innovations—BERT‑enhanced NMT, sparsity‑driven fine‑tuning, multilingual pre‑training, multimodal imagination, and ultra‑fast inference—to power its global translation services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

pretraining BERT machine translation multilingual NLP

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.