Artificial Intelligence 33 min read

Illustrated Transformer: Comprehensive Explanation and Code Implementation

This article provides a step‑by‑step illustrated guide to the Transformer architecture, covering its macro structure, detailed self‑attention mechanisms, multi‑head attention, positional encoding, residual connections, decoder operation, training process, loss functions, and includes complete PyTorch and custom Python code examples.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Illustrated Transformer: Comprehensive Explanation and Code Implementation

Preface

This translation of the illustrated Transformer article explains the model from input to output, adding original explanations and simple code for Self‑Attention and multi‑head attention matrix operations.

1. Macro Understanding of Transformer

The model treats the whole system as a black box that receives a source sentence and outputs a translated sentence. The architecture consists of an Encoder stack on the left and a Decoder stack on the right, each typically with six identical layers.

Each Encoder layer contains two sub‑layers: a Self‑Attention layer and a Feed‑Forward Neural Network (FFNN). The Decoder layers have an additional Encoder‑Decoder Attention sub‑layer.

2. Detailed Understanding of Transformer

2.1 Transformer Input

Words are first converted to embeddings (commonly 256 or 512 dimensions; the example uses 4‑dimensional vectors for simplicity). Sentences are padded or truncated to a fixed length.

2.2 Encoder

The Encoder receives a list of word vectors, processes them through Self‑Attention, then through the FFNN, and passes the result to the next Encoder layer. Each position follows its own computational path.

3. Self‑Attention Overview

Self‑Attention allows each word to attend to all other words in the sentence, enabling the model to capture dependencies such as pronoun references.

4. Self‑Attention Details

4.1 Compute Query, Key, Value Vectors

For each input word vector, three new vectors are created by multiplying with learned weight matrices W Q , W K , and W V . These vectors are typically lower‑dimensional than the original embedding.

4.2 Compute Attention Scores

The score for a word is the dot product between its Query vector and the Key vectors of all words, scaled, passed through Softmax, multiplied by the corresponding Value vectors, and summed.

5. Matrix Computation of Self‑Attention

All words are stacked into matrix X, then multiplied by weight matrices to obtain Q, K, V matrices. The attention computation is performed with matrix multiplications, enabling parallel computation for all positions.

6. Multi‑Head Attention

Multiple attention heads (e.g., 8) are created by projecting Q, K, V into separate sub‑spaces, computing attention in each head, and concatenating the results before a final linear projection.

7. Code Implementation of Attention

7.1 PyTorch Implementation

torch.nn.MultiheadAttention(embed_dim, num_heads, dropout=0.0, bias=True, add_bias_kv=False, add_zero_attn=False, kdim=None, vdim=None)

Key arguments include embed_dim (dimension of Q/K/V), num_heads (must divide embed_dim ), and optional masks.

7.2 Manual Implementation

class MultiheadAttention(nn.Module):
    def __init__(self, hid_dim, n_heads, dropout):
        super(MultiheadAttention, self).__init__()
        self.hid_dim = hid_dim
        self.n_heads = n_heads
        assert hid_dim % n_heads == 0
        self.w_q = nn.Linear(hid_dim, hid_dim)
        self.w_k = nn.Linear(hid_dim, hid_dim)
        self.w_v = nn.Linear(hid_dim, hid_dim)
        self.fc = nn.Linear(hid_dim, hid_dim)
        self.do = nn.Dropout(dropout)
        self.scale = torch.sqrt(torch.FloatTensor([hid_dim // n_heads]))

    def forward(self, query, key, value, mask=None):
        bsz = query.shape[0]
        Q = self.w_q(query)
        K = self.w_k(key)
        V = self.w_v(value)
        Q = Q.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)
        K = K.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)
        V = V.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)
        attention = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
        if mask is not None:
            attention = attention.masked_fill(mask == 0, -1e10)
        attention = self.do(torch.softmax(attention, dim=-1))
        x = torch.matmul(attention, V)
        x = x.permute(0, 2, 1, 3).contiguous()
        x = x.view(bsz, -1, self.n_heads * (self.hid_dim // self.n_heads))
        x = self.fc(x)
        return x

7.3 Key Code Snippet

# Split K, Q, V into multiple heads
Q = Q.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)
K = K.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)
V = V.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)

8. Positional Encoding

Since the model has no recurrence, sinusoidal positional encodings are added to word embeddings to provide order information. The encoding uses sine for even dimensions and cosine for odd dimensions, allowing extrapolation to longer sequences.

9. Residual Connections

Each sub‑layer (Self‑Attention and FFNN) is wrapped with a residual connection followed by layer normalization, facilitating gradient flow and stable training.

10. Decoder

The Decoder mirrors the Encoder but adds a masked Self‑Attention (preventing attention to future positions) and an Encoder‑Decoder Attention that attends to the Encoder outputs.

11. Final Linear and Softmax Layers

The Decoder output is projected to the vocabulary size via a linear layer, then a Softmax converts logits to probabilities for word selection.

12. Training Process

During training, the model’s output distributions are compared to ground‑truth tokens using a loss function (e.g., cross‑entropy). The network is optimized via back‑propagation to minimize this loss.

13. Loss Function

Cross‑entropy (or KL‑divergence) measures the difference between predicted and true probability distributions, guiding the model to produce accurate translations.

Further Reading

Attention Is All You Need (https://arxiv.org/abs/1706.03762)

Transformer: A Novel Neural Network Architecture for Language Understanding (https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html)

Tensor2Tensor announcement (https://ai.googleblog.com/2017/06/accelerating-deep-learning-research.html)

Łukasz Kaiser’s talk (https://www.youtube.com/watch?v=rBCqOTEfxvg)

Tensor2Tensor Jupyter notebook (https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb)

Tensor2Tensor GitHub repository (https://github.com/tensorflow/tensor2tensor)

deep learningtransformerNLPPyTorchSelf-AttentionMulti-Head Attention
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.