Artificial Intelligence 24 min read

A Simple Introduction to the Transformer Model

This article provides a comprehensive, beginner-friendly explanation of the Transformer architecture, covering its encoder‑decoder structure, self‑attention, multi‑head attention, positional encoding, residual connections, decoding process, final linear and softmax layers, and training considerations, illustrated with numerous diagrams and code snippets.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
A Simple Introduction to the Transformer Model

Original article: "The Random Transformer" by Jay Alammar, which received notable attention on Hacker News and Reddit, and has been referenced in courses at Stanford, Harvard, MIT, Princeton, CMU, and in an MIT deep‑learning lecture.

The piece focuses on the Transformer model, an attention‑based architecture that speeds up training, enables parallel processing, and often outperforms earlier neural machine translation models.

Transformers were introduced in the paper "Attention Is All You Need"; implementations are available in TensorFlow's Tensor2Tensor and a PyTorch guide from Harvard NLP.

The article aims to explain these concepts in an accessible way for non‑experts, and the author also created a video guide titled "Transformer Tour".

Simple Overview

The Transformer can be seen as a black box that converts a sentence in one language to its translation.

It consists of an encoder stack and a decoder stack, connected by attention mechanisms.

Each encoder layer has two sub‑layers: a self‑attention layer and a feed‑forward neural network.

The self‑attention layer lets each word consider all other words in the input, while the feed‑forward network processes each position independently.

The decoder mirrors the encoder but adds an extra attention layer that attends to encoder outputs, similar to seq2seq attention.

Tensor Flow into Images

After understanding the components, the article shows how vectors (embeddings) flow through the model.

Each word is embedded into a 512‑dimensional vector.

The animal didn't cross the street because it was too tired

The embedding occurs in the lowest encoder layer; subsequent layers process the vectors produced by the previous layer.

Self‑Attention (High‑Level Understanding)

Self‑attention allows the model to relate a word like "it" to its antecedent "animal" by attending to other positions in the sequence.

Readers can explore a Tensor2Tensor notebook to interactively visualize a Transformer.

Self‑Attention Details

Self‑attention computes Query (Q), Key (K), and Value (V) vectors for each input token by multiplying the embedding with learned weight matrices.

These vectors have a reduced dimension (e.g., 64) compared to the original 512‑dimensional embeddings.

Scores are obtained by dot‑product of Q with K, scaled by √64 (≈8), then passed through a softmax to produce attention weights.

The weighted values are summed to produce the output of the self‑attention layer.

Matrix Formulation of Self‑Attention

All steps can be expressed as matrix operations: compute Q, K, V matrices from the input matrix X using weight matrices WQ, WK, WV.

The final step combines the scaled dot‑product attention results into a single matrix.

Multi‑Head Attention

Multi‑head attention replicates the self‑attention mechanism with multiple independent sets of Q/K/V matrices (typically eight heads), allowing the model to attend to information from different representation sub‑spaces.

Outputs from all heads are concatenated and projected with an additional weight matrix WO to produce a single matrix for the subsequent feed‑forward layer.

Positional Encoding

Since the Transformer lacks recurrence, positional encodings are added to input embeddings to convey token order.

The encoding vectors follow sinusoidal patterns; an example with dimension 4 is shown.

Code for generating them can be found in the function get_timing_signal_1d() .

Residual Connections and Layer Normalization

Each sub‑layer (self‑attention or feed‑forward) is wrapped with a residual connection followed by layer‑normalization.

Decoder Side

The decoder first receives the encoder’s output as K and V for the encoder‑decoder attention layer, then processes its own inputs (embedded tokens plus positional encodings) through masked self‑attention, encoder‑decoder attention, and feed‑forward sub‑layers.

Masking ensures that each position can only attend to earlier positions, implemented by setting future positions to -inf before the softmax.

Final Linear and Softmax Layers

The decoder stack outputs a vector that is passed through a linear (fully‑connected) layer to produce logits over the vocabulary, followed by a softmax that converts logits into probabilities.

Training Overview

During training, the model’s forward pass is compared against ground‑truth token sequences using a loss function (e.g., cross‑entropy). Gradients are back‑propagated to update weights.

Example: translating the French phrase "merci" to "thanks" illustrates how the model learns to assign high probability to the correct token.

Beam search (with beam size 2) can be used to keep multiple candidate translations during decoding.

Further Reading

The article recommends reading the original "Attention Is All You Need" paper, related Transformer blog posts, and exploring the Tensor2Tensor repository and notebooks.

Additional noteworthy works include depthwise separable convolutions, "One Model To Learn Them All", discrete autoencoders, image Transformers, training tips, relative position representations, fast decoding with discrete latent variables, and the Adafactor optimizer.

Acknowledgements

Thanks are given to Illia Polosukhin, Jakob Uszkoreit, Llion Jones, Łukasz Kaiser, Niki Parmar, and Noam Shazeer for early feedback, and readers are invited to discuss on Twitter.

Deep LearningtransformerNeural Networksmachine translationSelf-Attention
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.