Artificial Intelligence 12 min read

BERT Model Overview: Inputs, Encoder, Fine‑tuning, and Variants

This article explains BERT's WordPiece tokenization, input embeddings (token, segment, and position embeddings), encoder architecture for Base and Large models, fine‑tuning strategies for various NLP tasks, and introduces popular variants such as RoBERTa and ALBERT.

Rare Earth Juejin Tech Community

Dec 20, 2023

BERT Model Overview: Inputs, Encoder, Fine‑tuning, and Variants

BERT Inputs

Tokenization Method

BERT uses WordPiece embeddings: rare words are split into sub‑word units that appear frequently in the corpus, allowing a compact vocabulary (≈30k) to represent most text while keeping the number of tokens manageable.

The first token of every sequence is the special token [CLS] (classification) and a [SEP] token is appended after each sentence to mark separation.

Input Embeddings

BERT's input representation consists of three parts: token embeddings, segment embeddings, and position embeddings. For each token, its final vector is the sum of the token's 768‑dimensional embedding, the embedding of the segment (sentence A or B) it belongs to, and a learned positional embedding.

Unlike the original Transformer, which uses fixed sinusoidal positional encodings, BERT learns position embeddings jointly with the rest of the model.

Token Embeddings: Look‑up table mapping each token to a 768‑dimensional vector.

Segment Embeddings: Distinguish the two sentences (segment 0 or 1) in a pair‑input scenario.

Position Embeddings: Learned positional vectors obtained via look‑up.

In the original Transformer, the input is the sum of word embedding and position embedding.

BERT Encoder

Architecture

The encoder comprises three components: the input layer (as described above), multi‑head self‑attention, and a feed‑forward neural network, mirroring the Transformer encoder. BERT only uses the encoder part of the Transformer, available in two sizes: BERT‑Base and BERT‑Large.

BERT‑Base Information

BERT‑Base stacks 12 Transformer encoder layers. Key specifications:

Specification

Details

Encoder layers

Maximum sequence length

512

Hidden size (dim)

768

Number of attention heads

Parameters

≈110 M

GPU memory requirement

≈7 GB+

BERT‑Large Information

BERT‑Large stacks 24 encoder layers. Key specifications:

Specification

Details

Encoder layers

Maximum sequence length

1024

Hidden size (dim)

768

Number of attention heads

Parameters

≈340 M

GPU memory requirement

≈32 GB+

BERT Fine‑tuning

Pre‑training and Fine‑tuning

BERT is first trained on massive corpora using self‑supervised objectives (Masked Language Modeling and Next Sentence Prediction) to learn generic word representations. The resulting model is then fine‑tuned on downstream tasks by adding a small task‑specific head.

Typical fine‑tuning tasks include:

Sentence‑pair Classification

Predict the relationship between two sentences (e.g., semantic similarity, coherence). Input format: [CLS] sentence1 [SEP] sentence2 [SEP]. Output: a label indicating the relation.

Single‑sentence Classification

Assign a category to a single sentence (e.g., news topic, intent detection). Input format: [CLS] sentence [SEP]. The [CLS] vector is used as the feature for classification.

Question Answering (QA)

Given a question and a passage, extract the answer span from the passage. Input format: [CLS] question [SEP] context [SEP]. Output: start and end indices of the answer.

Named Entity Recognition (NER)

Label each token with entity tags (e.g., PER, ORG, LOC) using a sequence labeling scheme such as BIOES.

BIOES tagging meanings: B = Begin, I = Intermediate, E = End, S = Single, O = Outside.

BERT Variants

RoBERTa (Robustly Optimized BERT Pretraining Approach)

Key differences from original BERT:

Removed the Next Sentence Prediction (NSP) task.

Trained on longer sequences.

Increased pre‑training data from 16 GB to 160 GB.

Used larger batch sizes (up to 8 K).

Expanded vocabulary from 30 k to 50 k tokens.

ALBERT (A Lite BERT)

Factorized embedding parameterization: decouples embedding size (E) from hidden size (H) to reduce parameters.

Cross‑layer parameter sharing: the same Transformer layer is reused across all encoder blocks.

Replaces NSP with Sentence Order Prediction (SOP) task.

Special Token Vocabulary in BERT

BERT defines several special tokens that mark structure and control the training process.

[CLS]

The classification token placed at the beginning of the sequence; its final hidden state is used for sentence‑level tasks such as classification or regression.

[SEP]

The separator token that delineates different segments (e.g., two sentences) and also marks the end of the input.

[PAD]

Padding token used to extend shorter sequences to a uniform length for batch processing.

[MASK]

Mask token employed during pre‑training for the Masked Language Modeling objective; random tokens are replaced with [MASK] and the model learns to predict the original token.

[UNK]

Unknown token representing out‑of‑vocabulary words that were not seen during pre‑training.

Additional custom special tokens can be introduced for domain‑specific applications.

References: https://blog.csdn.net/weixin_42038527/article/details/130871339 https://blog.csdn.net/weixin_44624036/article/details/131146059 https://blog.csdn.net/qq_42801194/article/details/122294769

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer NLP self-supervised learning BERT WordPiece

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.