Comprehensive Overview of BERT: Architecture, Pre‑training Tasks, and Applications
This article provides a detailed introduction to BERT, covering its bidirectional transformer encoder design, pre‑training objectives such as Masked Language Modeling and Next Sentence Prediction, model configurations, differences from GPT/ELMo, and a wide range of downstream NLP applications.
Basic Information
BERT (Bidirectional Encoder Representations from Transformers) is a bidirectional transformer‑based language representation model introduced by Google. The original paper is BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/abs/1810.04805). Its full name, source code (https://github.com/google-research/bert) and keywords (pre‑training, deep, bidirectional) are listed.
Key Characteristics
Pre‑training
BERT’s main innovation lies in its pre‑training method, which uses two tasks: Masked Language Model (MLM) and Next Sentence Prediction (NSP) to capture word‑level and sentence‑level representations.
After large‑scale pre‑training, the model can be fine‑tuned on a small amount of labeled data for tasks such as sentiment classification, achieving strong performance.
Deep
BERT‑base consists of 12 encoder layers (L=12, H=768, A=12, ~110M parameters) and BERT‑large has 24 layers (L=24, H=1024, A=16, ~340M parameters). The model leverages massive data and powerful GPUs.
Bidirectional
Through the MLM task, BERT learns contextual information from both left and right sides of a token, enabling true bidirectional understanding.
Differences from ELMo and GPT
GPT uses a unidirectional transformer decoder, limiting its ability to capture full context.
ELMo concatenates left‑to‑right and right‑to‑left RNN outputs, providing shallow bidirectional information.
BERT employs a full transformer encoder, uses both directions simultaneously, and requires only a lightweight fine‑tuning head for downstream tasks.
Differences from the Original Transformer
BERT uses only the encoder stack (12 layers) of the original transformer.
It adds Segment Embeddings and learns positional embeddings, unlike the fixed positional encodings of the vanilla transformer.
Model Specifications
Bert base: L=12, H=768, A=12, TP=110M, GPU ≈ 7 GB+
Bert large: L=24, H=1024, A=16, TP=340M, GPU ≈ 32 GB+
Transformer (for reference): L=6, H=512, A=8
Main Contributions
Introduced MLM and NSP as novel pre‑training objectives.
Demonstrated that larger models (12 → 24 layers) yield better performance.
Provided a universal fine‑tuning framework for many downstream NLP tasks.
Set new state‑of‑the‑art results across multiple benchmarks, sparking the surge of self‑supervised NLP.
Application Scenarios
Text Classification
Fine‑tuning BERT on a small labeled set dramatically improves multi‑class text classification accuracy.
Sentiment Analysis
BERT can be applied at document, sentence, or aspect level to predict polarity with high precision.
Named Entity Recognition (NER)
By treating NER as a token‑level classification problem, BERT achieves strong entity detection across categories such as PERSON, ORGANIZATION, LOCATION, etc.
Machine Translation
When incorporated as the encoder in an encoder‑decoder architecture, BERT supplies rich semantic representations that boost translation quality.
Two‑Stage Model
BERT follows a two‑stage paradigm: a pre‑training stage (MLM + NSP) followed by a fine‑tuning stage where a simple output layer is added for each downstream task.
Pre‑training Task: MLM
During training, 15 % of tokens are selected for masking. Of those, 80 % are replaced with [MASK] , 10 % with a random token, and 10 % remain unchanged. The model predicts the original token for the masked positions.
Pre‑training Task: NSP
Pairs of sentences are either kept in original order (label = IsNext) or the second sentence is replaced with a random one (label = NotNext). The model uses the [CLS] token to perform binary classification.
Example inputs and labels are shown in the table below.
Example
Illustration
Input1 =
[CLS]我今天要
[MASK]课
[SEP],上完
[MASK]给你打 电话
[SEP]Label1 = IsNext
Input2 =
[CLS]大模型
MASK]技术发展很快
[SEP],晚
[MASK]吃什么
[SEP]Label2 = NotNext
Symbols:
[CLS]– sentence start,
[SEP]– sentence separator,
[MASK]– masked token.
Bidirectional Understanding
Because MLM predicts masked tokens using attention over both left and right contexts, BERT achieves true bidirectional representation, unlike GPT’s left‑to‑right attention mask.
Appendix
Pre‑training Benefits for NLP Tasks
Pre‑training improves tasks such as sentence‑level relation modeling, sentiment detection, and entity recognition.
Strategies for Using Pre‑trained Models
Feature‑based: treat the frozen pre‑trained embeddings as additional features (e.g., ELMo).
Fine‑tuning: add a lightweight task‑specific head and continue training on the downstream data.
Parameter Calculation
The learnable parameters come from the embedding matrix and each transformer block (self‑attention projections and MLP layers). Multiplying by the number of blocks yields the total parameter count (≈110 M for BERT‑base).
BERT vs. GPT Comparison
Architecture
GPT: unidirectional decoder, uses language modeling and NSP.
BERT: bidirectional encoder, uses MLM and NSP.
Training Tasks
GPT predicts the next token; BERT predicts masked tokens and sentence continuity.
Datasets
GPT: WebText and similar large corpora.
BERT: Wikipedia, BookCorpus, and other large text collections.
Application Domains
GPT: language generation, text completion, QA.
BERT: text classification, NER, sentiment analysis, etc.
Notes recorded on 2023‑11‑14 by 山海.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.