Artificial Intelligence 26 min read

Understanding Transformer Architecture for Chinese‑English Translation: A Practical Guide

This practical guide walks through the full Transformer architecture for Chinese‑to‑English translation, detailing encoder‑decoder structure, tokenization and embeddings, batch handling with padding and masks, positional encodings, parallel teacher‑forcing, self‑ and multi‑head attention, and the complete forward and back‑propagation training steps.

Tencent Technical Engineering

Apr 16, 2025

Understanding Transformer Architecture for Chinese‑English Translation: A Practical Guide

This article provides a comprehensive, hands‑on explanation of the Transformer model using a Chinese‑to‑English translation example. It covers the macro architecture, input‑output processing, tokenization, embeddings, batch handling, padding masks, positional encoding, parallel computation, teacher forcing, self‑attention, multi‑head attention, and the forward/backward passes.

Macro view of Transformer

The Transformer consists of an encoder (left) and a decoder (right). The encoder extracts features from the source sentence, while the decoder extracts features from the partially generated target sentence and combines them with encoder outputs to predict the next token. GPT can be seen as the decoder part of a Transformer.

Input and Output

For translation, the model receives two inputs: the source token sequence (e.g., Chinese) and the target token sequence (English) prefixed with a <bos> token. The expected output is the target sentence ending with an <eos> token.

第一个循环 
 编码器输入 我 爱 00700  
 解码器输入 <bos>
输出 I

第二个循环 
 编码器输入 我 爱 00700 
 解码器输入 <bos> I
输出 love

第三个循环 
 编码器输入 我 爱 00700 
 解码器输入 <bos> I love
输出 00700
 
第四个循环 
 编码器输入 我 爱 00700 
 解码器输入 <bos> I love 00700
输出 <eos>
// 输出了结束符，翻译完成

The model actually predicts a probability distribution over the entire vocabulary at each step, not a single word.

Token & Vocabulary

Tokens are the basic units the model works with. A token may be a whole word, a sub‑word, or even a character. Sub‑word tokenization (e.g., BPE) reduces vocabulary size dramatically, enabling efficient training on large corpora.

空格分词词表大小：6
[play, player, playing, plays, replay, replaying]

# bpe分词算法最终词表（目标大小=4）
# Ġ表示这个token只会出现在开始，想一想 还原句子的时候你需要知道两个token是拼起来还是插入空格
[Ġplay, re, ing, er]

Embedding

After tokenization, each token index is mapped to a dense vector via an embedding matrix of shape (vocab_size, embedding_dim) (e.g., 32000 × 512). The embedding vectors capture semantic information and are learned during training.

# 初始化嵌入矩阵
nn.init.normal_(self.weight, mean=0, std=0.02)
# 基本的嵌入查找
embeddings = self.weight[x]

Batch Processing & Padding Mask

Training processes multiple sentences in parallel (batch). Shorter sentences are padded with a special token; a padding mask tells the model which positions are padding. A causal mask prevents the decoder from attending to future tokens.

# 创建源序列和目标序列的掩码
src_padding_mask = (src == self.pad_idx)
trg_padding_mask = (trg == self.pad_idx)
# 因果掩码
seq_len = trg.size(1)
trg_mask = torch.triu(torch.ones((seq_len, seq_len), device=src.device), diagonal=1).bool()
# 应用mask
if mask is not None:
    scores = scores.masked_fill(mask == True, -1e9)

Positional Encoding

Since the Transformer has no recurrence, positional encodings are added to embeddings to inject order information. Simple integer encodings (i) can be used for illustration, while the original paper uses sinusoidal functions.

猫 0
吃 1
鱼 2

Parallel Computation & Teacher Forcing

During training, the decoder receives the whole target sequence shifted right (teacher forcing), allowing all time steps to be computed in parallel. At inference, tokens are generated one by one because the ground‑truth future tokens are unavailable.

# Teacher forcing example
trg_input = trg[:, :-1]   # BOS + ground‑truth tokens except last
trg_output = trg[:, 1:]   # ground‑truth tokens except BOS
output = model(src, trg_input)

Self‑Attention

Self‑attention computes attention weights by multiplying queries (Q) and keys (K) derived from the same input, scaling, applying softmax, and weighting values (V). This captures relationships between all token pairs.

Multi‑Head Attention

Multiple attention heads allow the model to attend to information from different representation subspaces. Each head processes a split of the embedding dimension (e.g., 8 heads × 64‑dim each for a 512‑dim model).

# Multi‑head attention forward
Q = self.q_linear(query)
K = self.k_linear(key)
V = self.v_linear(value)
Q = Q.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = K.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = V.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
 scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
 if mask is not None:
     scores = scores.masked_fill(mask == True, -1e9)
 attn = F.softmax(scores, dim=-1)
 attn = self.dropout(attn)
 out = torch.matmul(attn, V)
 out = out.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
 out = self.out_linear(out)
 return out

Forward Pass: Add & Norm, Feed‑Forward

Each sub‑layer is wrapped with a residual connection (Add) and layer normalization (Norm). The feed‑forward network consists of two linear layers with a ReLU activation in between.

# Residual connection example
x = x + self.dropout1(self.self_attn(x2, x2, x2, tgt_mask))
# LayerNorm
self.norm1 = nn.LayerNorm(d_model)
x2 = self.norm1(x)
# Feed‑Forward
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x):
        return self.linear2(self.dropout(F.relu(self.linear1(x))))

Back‑Propagation

After the forward pass, the loss is computed (e.g., cross‑entropy) and loss.backward() triggers automatic differentiation in PyTorch, updating all parameters via gradient descent.

# Forward
output = model(src, trg_input)
# Compute loss
loss = criterion(output_flat, trg_output_flat)
# Backward
loss.backward()

The article concludes with references to several Chinese tutorials and papers for deeper study.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer PyTorch Positional Encoding machine translation Self-Attention Seq2Seq

Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.