Understanding Transformer Architecture for Chinese‑English Translation: A Practical Guide
This practical guide walks through the full Transformer architecture for Chinese‑to‑English translation, detailing encoder‑decoder structure, tokenization and embeddings, batch handling with padding and masks, positional encodings, parallel teacher‑forcing, self‑ and multi‑head attention, and the complete forward and back‑propagation training steps.
This article provides a comprehensive, hands‑on explanation of the Transformer model using a Chinese‑to‑English translation example. It covers the macro architecture, input‑output processing, tokenization, embeddings, batch handling, padding masks, positional encoding, parallel computation, teacher forcing, self‑attention, multi‑head attention, and the forward/backward passes.
Macro view of Transformer
The Transformer consists of an encoder (left) and a decoder (right). The encoder extracts features from the source sentence, while the decoder extracts features from the partially generated target sentence and combines them with encoder outputs to predict the next token. GPT can be seen as the decoder part of a Transformer.
Input and Output
For translation, the model receives two inputs: the source token sequence (e.g., Chinese) and the target token sequence (English) prefixed with a <bos> token. The expected output is the target sentence ending with an <eos> token.
第一个循环
编码器输入 我 爱 00700
解码器输入 <bos>
输出 I
第二个循环
编码器输入 我 爱 00700
解码器输入 <bos> I
输出 love
第三个循环
编码器输入 我 爱 00700
解码器输入 <bos> I love
输出 00700
第四个循环
编码器输入 我 爱 00700
解码器输入 <bos> I love 00700
输出 <eos>
// 输出了结束符,翻译完成The model actually predicts a probability distribution over the entire vocabulary at each step, not a single word.
Token & Vocabulary
Tokens are the basic units the model works with. A token may be a whole word, a sub‑word, or even a character. Sub‑word tokenization (e.g., BPE) reduces vocabulary size dramatically, enabling efficient training on large corpora.
空格分词词表大小:6
[play, player, playing, plays, replay, replaying]
# bpe分词算法最终词表(目标大小=4)
# Ġ表示这个token只会出现在开始,想一想 还原句子的时候你需要知道两个token是拼起来还是插入空格
[Ġplay, re, ing, er]Embedding
After tokenization, each token index is mapped to a dense vector via an embedding matrix of shape (vocab_size, embedding_dim) (e.g., 32000 × 512). The embedding vectors capture semantic information and are learned during training.
# 初始化嵌入矩阵
nn.init.normal_(self.weight, mean=0, std=0.02)
# 基本的嵌入查找
embeddings = self.weight[x]Batch Processing & Padding Mask
Training processes multiple sentences in parallel (batch). Shorter sentences are padded with a special token; a padding mask tells the model which positions are padding. A causal mask prevents the decoder from attending to future tokens.
# 创建源序列和目标序列的掩码
src_padding_mask = (src == self.pad_idx)
trg_padding_mask = (trg == self.pad_idx)
# 因果掩码
seq_len = trg.size(1)
trg_mask = torch.triu(torch.ones((seq_len, seq_len), device=src.device), diagonal=1).bool()
# 应用mask
if mask is not None:
scores = scores.masked_fill(mask == True, -1e9)Positional Encoding
Since the Transformer has no recurrence, positional encodings are added to embeddings to inject order information. Simple integer encodings (i) can be used for illustration, while the original paper uses sinusoidal functions.
猫 0
吃 1
鱼 2Parallel Computation & Teacher Forcing
During training, the decoder receives the whole target sequence shifted right (teacher forcing), allowing all time steps to be computed in parallel. At inference, tokens are generated one by one because the ground‑truth future tokens are unavailable.
# Teacher forcing example
trg_input = trg[:, :-1] # BOS + ground‑truth tokens except last
trg_output = trg[:, 1:] # ground‑truth tokens except BOS
output = model(src, trg_input)Self‑Attention
Self‑attention computes attention weights by multiplying queries (Q) and keys (K) derived from the same input, scaling, applying softmax, and weighting values (V). This captures relationships between all token pairs.
Multi‑Head Attention
Multiple attention heads allow the model to attend to information from different representation subspaces. Each head processes a split of the embedding dimension (e.g., 8 heads × 64‑dim each for a 512‑dim model).
# Multi‑head attention forward
Q = self.q_linear(query)
K = self.k_linear(key)
V = self.v_linear(value)
Q = Q.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = K.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = V.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == True, -1e9)
attn = F.softmax(scores, dim=-1)
attn = self.dropout(attn)
out = torch.matmul(attn, V)
out = out.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
out = self.out_linear(out)
return outForward Pass: Add & Norm, Feed‑Forward
Each sub‑layer is wrapped with a residual connection (Add) and layer normalization (Norm). The feed‑forward network consists of two linear layers with a ReLU activation in between.
# Residual connection example
x = x + self.dropout1(self.self_attn(x2, x2, x2, tgt_mask))
# LayerNorm
self.norm1 = nn.LayerNorm(d_model)
x2 = self.norm1(x)
# Feed‑Forward
class FeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
return self.linear2(self.dropout(F.relu(self.linear1(x))))Back‑Propagation
After the forward pass, the loss is computed (e.g., cross‑entropy) and loss.backward() triggers automatic differentiation in PyTorch, updating all parameters via gradient descent.
# Forward
output = model(src, trg_input)
# Compute loss
loss = criterion(output_flat, trg_output_flat)
# Backward
loss.backward()The article concludes with references to several Chinese tutorials and papers for deeper study.
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.