Neural Machine Translation: Seq2Seq, Beam Search, BLEU, Attention Mechanisms, and GNMT Improvements
This article explains key concepts of neural machine translation, covering Seq2Seq encoder‑decoder models, beam search strategies, BLEU evaluation, various attention mechanisms, and the enhancements introduced in Google's Neural Machine Translation system to improve speed, OOV handling, and translation quality.
Machine translation is the process of using computers to translate source language sentences into semantically equivalent target language sentences, an important direction in natural language processing (NLP). Machine translation methods can be divided into three categories: rule‑based, statistical, and neural network‑based.
First, some basic concepts in machine translation are introduced.
Seq2Seq model: In natural language generation tasks, most implementations are based on the Seq2Seq (Encoder‑Decoder) architecture, which uses two RNNs—one encoder to compress the input sequence into a fixed‑length semantic vector, and one decoder to generate the target sequence from that vector.
Beam Search strategy: During inference, three generation strategies exist—greedy, beam search, and exhaustive search. Beam search tracks the top‑k most probable paths at each time step; k=1 reduces to greedy, k=vocabulary size becomes exhaustive.
Beam search is a pruning strategy that does not guarantee a global optimum but often yields better results than greedy while being far more efficient than exhaustive search.
Greedy search stops when a maximum length is reached or an end‑of‑sentence token appears; beam search may stop at different times for different paths, defining completed paths and allowing various stopping criteria such as maximum steps or a limit on completed paths.
Because beam search paths have varying lengths, scores (sum of token probabilities) are normalized by length to select the highest‑scoring path.
BLEU metric: In machine translation, BLEU is commonly used to evaluate model performance by measuring n‑gram precision against reference translations, with a brevity penalty (BP) to penalize overly short outputs.
Bahdanau et al. (2015) introduced the RNNSearch model, bringing attention mechanisms from computer vision into NLP, addressing the fixed‑length vector bottleneck of encoder‑decoder models.
The fixed‑length vector C limits performance because it compresses all source information into one vector, ignoring sentence length and assigning equal weight to all words.
Various attention mechanisms differ in how they compute attention weights and vectors, including hard vs. soft attention, global vs. local attention, and dynamic vs. static attention.
Luong et al. (2015) proposed global and local attention; global attention considers all source positions, while local attention focuses on a small window around a predicted position.
Local‑p attention predicts the center position using a learned function, whereas local‑m uses the current decoding step; however, local attention offers limited computational savings for short sentences, so global attention is more commonly used.
Static attention, used in reading‑comprehension tasks, computes attention only within the encoder, unlike dynamic attention which aligns encoder and decoder at each decoding step.
Various innovations exist for computing attention scores, including self‑attention, multi‑head attention, and key‑value attention as used in Transformers.
GNMT: Google’s Neural Machine Translation system (2016) addressed NMT weaknesses such as slow training, OOV handling, and missing translations by using low‑precision 8‑bit arithmetic on TPUs, word‑piece models, and length‑normalized beam search with coverage penalties.
GNMT employs a bidirectional LSTM encoder, stacked LSTM layers with residual connections, and both model and data parallelism (Downpour SGD with Adam/SGD) to accelerate training across multiple GPUs.
Word‑piece models split words to handle OOVs, and the training objective incorporates GLEU for better translation quality.
Beam search in GNMT is further refined with length normalization (lp) and coverage penalty (cp) to discourage missing translations.
[1] D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
[2] Luong M T, Pham H, Manning C D. Effective Approaches to Attention‑based Neural Machine Translation. Computer Science, 2015.
New Oriental Technology
Practical internet development experience, tech sharing, knowledge consolidation, and forward-thinking insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.