Attention Mechanism, Transformer Architecture, and BERT: An In-Depth Overview
This article provides a comprehensive overview of the attention mechanism, its mathematical foundations, the transformer model architecture—including encoder and decoder components—and the BERT pre‑training model, detailing their principles, implementations, and applications in natural language processing.
BERT is a pre‑trained language model proposed by Google that is built on the transformer architecture. It relies heavily on the Attention Mechanism , which allows models to focus on the most relevant parts of the input when processing data.
The attention mechanism, inspired by cognitive neuroscience, uses three vectors: query , key , and value . The query simulates voluntary attention, the key simulates involuntary attention, and their interaction determines the attention scores. These scores are computed by an attention scoring function (e.g., scaled dot‑product), followed by a softmax to obtain attention weights α , and finally a weighted sum of the value vectors produces the output.
Figure 1: Attention score calculation illustration
The transformer model, introduced by Vaswani et al., replaces recurrent and convolutional structures with a fully attention‑based encoder‑decoder architecture. It consists of stacked self‑attention layers and feed‑forward networks, enabling parallel computation and reducing the maximum path length between tokens.
In the encoder, input tokens first pass through an embedding layer (including positional encoding). Each token is projected into query , key , and value vectors via learned matrices WQ , WK , and WV . Multi‑Head Attention applies several independent attention heads, allowing the model to capture information from multiple representation subspaces.
Figure 2: Encoder‑decoder attention structure
The decoder mirrors the encoder but adds two attention mechanisms: a masked self‑attention that prevents attending to future tokens, and an encoder‑decoder attention that attends to the encoder’s output. Each sub‑layer is wrapped with residual connections and layer normalization (Add&Norm) to stabilize training, followed by a feed‑forward network often implemented with a Gated Linear Unit (GLU).
BERT (Bidirectional Encoder Representations from Transformers ) extends the transformer encoder by pre‑training on large unlabeled corpora using two tasks: masked language modeling (MLM) and next‑sentence prediction (NSP). In MLM, 15% of tokens are selected; 80% are replaced with [MASK] , 10% with a random token, and 10% left unchanged. The model learns to predict the original tokens, capturing deep bidirectional context.
Input to BERT consists of three embeddings: token embedding, segment embedding, and position embedding. Special tokens [CLS] and [SEP] are added to denote the start of the sequence and sentence boundaries, respectively. The final hidden state of [CLS] serves as a pooled representation for classification tasks, while token‑level representations are used for token‑wise predictions.
Despite its strengths, BERT exhibits anisotropy: the model encodes information differently across dimensions and struggles with very long sequences, often requiring truncation which can affect performance on tasks requiring long‑range dependencies.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.