An Overview of BERT: Architecture, Pre‑training Tasks, Comparisons, and Applications
This article provides a comprehensive English overview of BERT, covering its original paper, model architecture, pre‑training objectives (Masked Language Model and Next Sentence Prediction), differences from ELMo, GPT and vanilla Transformers, parameter counts, main contributions, and a range of NLP application scenarios such as text classification, sentiment analysis, NER, and machine translation.
Basic Information
BERT (Bidirectional Encoder Representations from Transformers) is a bidirectional transformer encoder model introduced by Google. The original paper is BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/abs/1810.04805). The source code is available at https://github.com/google-research/bert.
Characteristics of BERT
Pre‑training
The main innovation lies in the pre‑training stage, which uses two objectives: Masked Language Model (MLM) and Next Sentence Prediction (NSP) to capture token‑level and sentence‑level representations.
Deep
BERT‑base employs 12 encoder layers, a hidden size of 768, and 12 attention heads, enabling it to be trained on massive corpora and achieve state‑of‑the‑art results. Pre‑trained multilingual models are publicly released.
Bidirectional
Through the MLM task, BERT learns contextual information from both left and right sides of a token, achieving true bidirectional understanding.
Differences with ELMo / GPT
GPT uses a unidirectional transformer that predicts future tokens, limiting its ability to learn rich word representations.
ELMo concatenates left‑to‑right and right‑to‑left RNN outputs, providing shallow bidirectional fusion and requiring architectural adjustments for downstream tasks.
BERT, based on the transformer encoder, leverages both left and right context without labeled data and only needs a fine‑tuned output layer for downstream tasks.
Differences with the Original Transformer
BERT uses only the transformer encoder and stacks many layers (e.g., 12).
It adds Segment Embeddings and learns positional embeddings, whereas the original transformer uses fixed positional encodings.
Model Details
Key hyper‑parameters: number of layers L, hidden dimension H, number of attention heads A, total parameters TP.
BERT‑base: L=12, H=768, A=12, TP≈110 M (GPU ≥ 7 GB).
BERT‑large: L=24, H=1024, A=16, TP≈340 M (GPU ≥ 32 GB).
Transformer (baseline): L=6, H=512, A=8.
Main Contributions
Introduced Masked LM and the NSP objective, enabling deep bidirectional pre‑training.
Demonstrated that larger models (12 → 24 layers) yield better performance.
Provided a universal fine‑tuning framework that eliminates task‑specific model design.
Set new records on many NLP benchmarks, sparking the surge of unsupervised pre‑training.
Application Scenarios
Text Classification
With minimal labeled data, BERT dramatically improves multi‑class text classification accuracy by learning rich semantic representations.
Sentiment Analysis
BERT can be fine‑tuned for document‑level, sentence‑level, or aspect‑level sentiment classification, capturing nuanced polarity.
Named Entity Recognition
By treating NER as a token‑level classification task, BERT achieves high precision in identifying entities such as persons, organizations, locations, dates, etc.
Machine Translation
In encoder‑decoder translation systems, a BERT encoder supplies enriched semantic embeddings to the decoder, improving translation quality.
Two‑Stage Model
BERT follows a two‑stage paradigm: first a large‑scale pre‑training phase, then a fine‑tuning phase on downstream tasks, requiring only an additional output layer.
Stage 1: Pre‑training
The pre‑training phase jointly optimizes MLM and NSP objectives.
Stage 2: Fine‑tuning
During fine‑tuning, the same architecture and pre‑trained parameters are used; only the final classification head is adapted to the specific task.
Special tokens [CLS] and [SEP] mark the start of a sequence and separate sentences, respectively.
Self‑Supervised Learning
Unlike supervised learning that requires human‑annotated labels, self‑supervised learning creates pseudo‑labels from the data itself (e.g., masking tokens). BERT’s massive pre‑training is based on this paradigm.
Pre‑training Tasks
Masked Language Model (MLM)
For each token, there is a 15% chance of being selected for masking. The selected token is then processed as follows (example: "my dog is hairy"):
80% probability: replace with [MASK] → "my dog is [MASK]".
10% probability: replace with a random token → "my dog is apple".
10% probability: keep unchanged → "my dog is hairy".
The model predicts the original token only for the masked positions, computing loss and back‑propagating gradients.
Next Sentence Prediction (NSP)
Pairs of sentences are constructed; 50% of the time they are consecutive (label IsNext ), otherwise the second sentence is replaced with a random one (label NotNext ). The [CLS] token representation is used for binary classification.
Example
Illustration
Input1 =
[CLS]我今天要
[MASK]课
[SEP],上完
[MASK]给你打 电话
[SEP]Label1 = IsNext
Input2 =
[CLS]大模型
MASK]技术发展很快
[SEP],晚
[MASK]吃什么
[SEP]Label2 = NotNext
BERT’s Bidirectional Understanding
Because the MLM token can attend to both left and right contexts via the self‑attention mechanism, the model captures truly bidirectional information, unlike GPT’s masked attention which restricts visibility.
Appendix
Further reading includes detailed analyses of the BERT paper, parameter calculations, and various tutorial links.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.