Artificial Intelligence 15 min read

Enterprise Applications and Research of Speech Translation

This article reviews recent advances in speech translation, discusses ByteDance's practical deployments, compares cascade and end‑to‑end modeling approaches, introduces improved encoder‑decoder architectures and training strategies, and reports state‑of‑the‑art results on the IWSLT 2021 benchmark.

DataFunSummit

Nov 18, 2021

Enterprise Applications and Research of Speech Translation

In recent years, end‑to‑end speech translation (ST) has achieved notable progress, yet it is still not ready for large‑scale industrial use because of challenges such as audio processing, hot‑word intervention, error analysis, and alignment issues. The talk is organized around four main topics.

1. Overview of Speech Translation – ST converts spoken language into text in another language, and can also produce spoken output directly. It aims to break language barriers for applications like automatic subtitles on YouTube, real‑time conference interpretation, and travel translation devices.

2. ByteDance Applications – ByteDance leverages its Volcano Translation service internally (e.g., Feishu message, image, document translation) and externally (multi‑language subtitles for user videos). A newly launched AR smart‑translation glasses provides real‑time subtitles, face‑to‑face translation, and photo translation for travel scenarios.

3. Modeling Methods

Cascade ST – combines an automatic speech recognition (ASR) system with a machine translation (MT) system. Advantages include using large ASR and MT corpora and modular optimization, but it suffers from error propagation and high computational cost.

End‑to‑End ST – a unified encoder‑decoder model that directly maps audio to target‑language text, typically built on the Transformer architecture. Audio is first transformed into features (e.g., mel‑spectrograms) and down‑sampled with CNNs before feeding the Transformer encoder, reducing error propagation and simplifying deployment.

4. Better End‑to‑End Models

LUT (Listen‑Understand‑Translate, AAAI 2021a) – adds a semantic encoder supervised by ASR transcripts and a pretrained BERT model, enabling the acoustic encoder to learn richer semantic representations.

Chimera (ACL 2021) – introduces a shared semantic projection that maps both audio and text into a common space, trained with a contrastive loss to align modalities.

COSTT (AAAI 2021b) – a continuous‑generation decoder that first emits an ASR token sequence and then the translation, effectively acting as a bilingual language model.

Training Strategies – progressive multi‑task learning (XSTNet, InterSpeech 2021) that pre‑trains on large translation corpora and fine‑tunes jointly on ASR, MT, and ST tasks; data augmentation via pseudo‑labeling (forward translation) to generate additional ST training data.

These techniques were evaluated on the IWSLT 2021 speech translation task, achieving a BLEU score of 31.3, surpassing the baseline by about seven points. The system combines a three‑in‑one model (ASR + MT + ST), model distillation, and ensemble methods.

Q&A Highlights

Hot‑word intervention for end‑to‑end ST may borrow techniques from ASR hot‑word handling and code‑switching.

While end‑to‑end systems now slightly outperform cascade systems on benchmarks, cascade remains dominant in production due to data scarcity for end‑to‑end training.

The rapid growth of ST is driven by the explosion of video content, 5G connectivity, and increased compute resources, motivating multimodal research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI machine translation End-to-End cascade model ByteDance speech translation Training Strategies

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.