Artificial Intelligence 16 min read

An Overview of BERT: Architecture, Pre‑training Tasks, Comparisons, and Applications

This article provides a comprehensive English overview of BERT, covering its original paper, model architecture, pre‑training objectives (Masked Language Model and Next Sentence Prediction), differences from ELMo, GPT and vanilla Transformers, parameter counts, main contributions, and a range of NLP application scenarios such as text classification, sentiment analysis, NER, and machine translation.

Rare Earth Juejin Tech Community

Dec 4, 2023

An Overview of BERT: Architecture, Pre‑training Tasks, Comparisons, and Applications

Basic Information

BERT (Bidirectional Encoder Representations from Transformers) is a bidirectional transformer encoder model introduced by Google. The original paper is

BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding

(https://arxiv.org/abs/1810.04805). The source code is available at https://github.com/google-research/bert.

Characteristics of BERT

Pre‑training

The main innovation lies in the pre‑training stage, which uses two objectives: Masked Language Model (MLM) and Next Sentence Prediction (NSP) to capture token‑level and sentence‑level representations.

Deep

BERT‑base employs 12 encoder layers, a hidden size of 768, and 12 attention heads, enabling it to be trained on massive corpora and achieve state‑of‑the‑art results. Pre‑trained multilingual models are publicly released.

Bidirectional

Through the MLM task, BERT learns contextual information from both left and right sides of a token, achieving true bidirectional understanding.

Differences with ELMo / GPT

GPT uses a unidirectional transformer that predicts future tokens, limiting its ability to learn rich word representations.

ELMo concatenates left‑to‑right and right‑to‑left RNN outputs, providing shallow bidirectional fusion and requiring architectural adjustments for downstream tasks.

BERT, based on the transformer encoder, leverages both left and right context without labeled data and only needs a fine‑tuned output layer for downstream tasks.

Differences with the Original Transformer

BERT uses only the transformer encoder and stacks many layers (e.g., 12).

It adds Segment Embeddings and learns positional embeddings, whereas the original transformer uses fixed positional encodings.

Model Details

Key hyper‑parameters: number of layers L, hidden dimension H, number of attention heads A, total parameters TP.

BERT‑base: L=12, H=768, A=12, TP≈110 M (GPU ≥ 7 GB).

BERT‑large: L=24, H=1024, A=16, TP≈340 M (GPU ≥ 32 GB).

Transformer (baseline): L=6, H=512, A=8.

Main Contributions

Introduced Masked LM and the NSP objective, enabling deep bidirectional pre‑training.

Demonstrated that larger models (12 → 24 layers) yield better performance.

Provided a universal fine‑tuning framework that eliminates task‑specific model design.

Set new records on many NLP benchmarks, sparking the surge of unsupervised pre‑training.

Application Scenarios

Text Classification

With minimal labeled data, BERT dramatically improves multi‑class text classification accuracy by learning rich semantic representations.

Sentiment Analysis

BERT can be fine‑tuned for document‑level, sentence‑level, or aspect‑level sentiment classification, capturing nuanced polarity.

Named Entity Recognition

By treating NER as a token‑level classification task, BERT achieves high precision in identifying entities such as persons, organizations, locations, dates, etc.

Machine Translation

In encoder‑decoder translation systems, a BERT encoder supplies enriched semantic embeddings to the decoder, improving translation quality.

Two‑Stage Model

BERT follows a two‑stage paradigm: first a large‑scale pre‑training phase, then a fine‑tuning phase on downstream tasks, requiring only an additional output layer.

Stage 1: Pre‑training

The pre‑training phase jointly optimizes MLM and NSP objectives.

Stage 2: Fine‑tuning

During fine‑tuning, the same architecture and pre‑trained parameters are used; only the final classification head is adapted to the specific task.

Special tokens [CLS] and [SEP] mark the start of a sequence and separate sentences, respectively.

Self‑Supervised Learning

Unlike supervised learning that requires human‑annotated labels, self‑supervised learning creates pseudo‑labels from the data itself (e.g., masking tokens). BERT’s massive pre‑training is based on this paradigm.

Pre‑training Tasks

Masked Language Model (MLM)

For each token, there is a 15% chance of being selected for masking. The selected token is then processed as follows (example: "my dog is hairy"):

80% probability: replace with [MASK] → "my dog is [MASK]".

10% probability: replace with a random token → "my dog is apple".

10% probability: keep unchanged → "my dog is hairy".

The model predicts the original token only for the masked positions, computing loss and back‑propagating gradients.

Next Sentence Prediction (NSP)

Pairs of sentences are constructed; 50% of the time they are consecutive (label IsNext), otherwise the second sentence is replaced with a random one (label NotNext). The [CLS] token representation is used for binary classification.

Example

Illustration

Input1 = [CLS] 我今天要 [MASK] 课 [SEP] ，上完 [MASK] 给你打电话 [SEP] Label1 = IsNext

Input2 = [CLS] 大模型 MASK] 技术发展很快 [SEP] ，晚 [MASK] 吃什么 [SEP] Label2 = NotNext

BERT’s Bidirectional Understanding

Because the MLM token can attend to both left and right contexts via the self‑attention mechanism, the model captures truly bidirectional information, unlike GPT’s masked attention which restricts visibility.

Appendix

Further reading includes detailed analyses of the BERT paper, parameter calculations, and various tutorial links.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer NLP pretraining BERT Next Sentence Prediction

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.