Artificial Intelligence 17 min read

Comprehensive Overview of BERT: Architecture, Pre‑training Tasks, and Applications

This article provides a detailed introduction to BERT, covering its bidirectional transformer encoder design, pre‑training objectives such as Masked Language Modeling and Next Sentence Prediction, model configurations, differences from GPT/ELMo, and a wide range of downstream NLP applications.

Rare Earth Juejin Tech Community

Dec 13, 2023

Comprehensive Overview of BERT: Architecture, Pre‑training Tasks, and Applications

Basic Information

BERT (Bidirectional Encoder Representations from Transformers) is a bidirectional transformer‑based language representation model introduced by Google. The original paper is

BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding

(https://arxiv.org/abs/1810.04805). Its full name, source code (https://github.com/google-research/bert) and keywords (pre‑training, deep, bidirectional) are listed.

Key Characteristics

Pre‑training

BERT’s main innovation lies in its pre‑training method, which uses two tasks: Masked Language Model (MLM) and Next Sentence Prediction (NSP) to capture word‑level and sentence‑level representations.

After large‑scale pre‑training, the model can be fine‑tuned on a small amount of labeled data for tasks such as sentiment classification, achieving strong performance.

Deep

BERT‑base consists of 12 encoder layers (L=12, H=768, A=12, ~110M parameters) and BERT‑large has 24 layers (L=24, H=1024, A=16, ~340M parameters). The model leverages massive data and powerful GPUs.

Bidirectional

Through the MLM task, BERT learns contextual information from both left and right sides of a token, enabling true bidirectional understanding.

Differences from ELMo and GPT

GPT uses a unidirectional transformer decoder, limiting its ability to capture full context.

ELMo concatenates left‑to‑right and right‑to‑left RNN outputs, providing shallow bidirectional information.

BERT employs a full transformer encoder, uses both directions simultaneously, and requires only a lightweight fine‑tuning head for downstream tasks.

Differences from the Original Transformer

BERT uses only the encoder stack (12 layers) of the original transformer.

It adds Segment Embeddings and learns positional embeddings, unlike the fixed positional encodings of the vanilla transformer.

Model Specifications

Bert base: L=12, H=768, A=12, TP=110M, GPU ≈ 7 GB+

Bert large: L=24, H=1024, A=16, TP=340M, GPU ≈ 32 GB+

Transformer (for reference): L=6, H=512, A=8

Main Contributions

Introduced MLM and NSP as novel pre‑training objectives.

Demonstrated that larger models (12 → 24 layers) yield better performance.

Provided a universal fine‑tuning framework for many downstream NLP tasks.

Set new state‑of‑the‑art results across multiple benchmarks, sparking the surge of self‑supervised NLP.

Application Scenarios

Text Classification

Fine‑tuning BERT on a small labeled set dramatically improves multi‑class text classification accuracy.

Sentiment Analysis

BERT can be applied at document, sentence, or aspect level to predict polarity with high precision.

Named Entity Recognition (NER)

By treating NER as a token‑level classification problem, BERT achieves strong entity detection across categories such as PERSON, ORGANIZATION, LOCATION, etc.

Machine Translation

When incorporated as the encoder in an encoder‑decoder architecture, BERT supplies rich semantic representations that boost translation quality.

Two‑Stage Model

BERT follows a two‑stage paradigm: a pre‑training stage (MLM + NSP) followed by a fine‑tuning stage where a simple output layer is added for each downstream task.

Pre‑training Task: MLM

During training, 15 % of tokens are selected for masking. Of those, 80 % are replaced with [MASK], 10 % with a random token, and 10 % remain unchanged. The model predicts the original token for the masked positions.

Pre‑training Task: NSP

Pairs of sentences are either kept in original order (label = IsNext) or the second sentence is replaced with a random one (label = NotNext). The model uses the [CLS] token to perform binary classification.

Example inputs and labels are shown in the table below.

Example

Illustration

Input1 = [CLS] 我今天要 [MASK] 课 [SEP] ，上完 [MASK] 给你打电话 [SEP] Label1 = IsNext

Input2 = [CLS] 大模型 MASK] 技术发展很快 [SEP] ，晚 [MASK] 吃什么 [SEP] Label2 = NotNext

Symbols: [CLS] – sentence start, [SEP] – sentence separator, [MASK] – masked token.

Bidirectional Understanding

Because MLM predicts masked tokens using attention over both left and right contexts, BERT achieves true bidirectional representation, unlike GPT’s left‑to‑right attention mask.

Appendix

Pre‑training Benefits for NLP Tasks

Pre‑training improves tasks such as sentence‑level relation modeling, sentiment detection, and entity recognition.

Strategies for Using Pre‑trained Models

Feature‑based: treat the frozen pre‑trained embeddings as additional features (e.g., ELMo).

Fine‑tuning: add a lightweight task‑specific head and continue training on the downstream data.

Parameter Calculation

The learnable parameters come from the embedding matrix and each transformer block (self‑attention projections and MLP layers). Multiplying by the number of blocks yields the total parameter count (≈110 M for BERT‑base).

BERT vs. GPT Comparison

Architecture

GPT: unidirectional decoder, uses language modeling and NSP.

BERT: bidirectional encoder, uses MLM and NSP.

Training Tasks

GPT predicts the next token; BERT predicts masked tokens and sentence continuity.

Datasets

GPT: WebText and similar large corpora.

BERT: Wikipedia, BookCorpus, and other large text collections.

Application Domains

GPT: language generation, text completion, QA.

BERT: text classification, NER, sentiment analysis, etc.

Notes recorded on 2023‑11‑14 by 山海.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

NLP self-supervised BERT Masked Language Model

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.