BERT Model Overview: Inputs, Encoder, Fine‑tuning, and Variants
This article explains BERT's WordPiece tokenization, input embeddings (token, segment, and position embeddings), encoder architecture for Base and Large models, fine‑tuning strategies for various NLP tasks, and introduces popular variants such as RoBERTa and ALBERT.
BERT Inputs
Tokenization Method
BERT uses WordPiece embeddings: rare words are split into sub‑word units that appear frequently in the corpus, allowing a compact vocabulary (≈30k) to represent most text while keeping the number of tokens manageable.
The first token of every sequence is the special token [CLS] (classification) and a [SEP] token is appended after each sentence to mark separation.
Input Embeddings
BERT's input representation consists of three parts: token embeddings, segment embeddings, and position embeddings. For each token, its final vector is the sum of the token's 768‑dimensional embedding, the embedding of the segment (sentence A or B) it belongs to, and a learned positional embedding.
Unlike the original Transformer, which uses fixed sinusoidal positional encodings, BERT learns position embeddings jointly with the rest of the model.
Token Embeddings: Look‑up table mapping each token to a 768‑dimensional vector.
Segment Embeddings: Distinguish the two sentences (segment 0 or 1) in a pair‑input scenario.
Position Embeddings: Learned positional vectors obtained via look‑up.
In the original Transformer, the input is the sum of word embedding and position embedding.
BERT Encoder
Architecture
The encoder comprises three components: the input layer (as described above), multi‑head self‑attention, and a feed‑forward neural network, mirroring the Transformer encoder. BERT only uses the encoder part of the Transformer, available in two sizes: BERT‑Base and BERT‑Large.
BERT‑Base Information
BERT‑Base stacks 12 Transformer encoder layers. Key specifications:
Specification
Details
Encoder layers
12
Maximum sequence length
512
Hidden size (dim)
768
Number of attention heads
12
Parameters
≈110 M
GPU memory requirement
≈7 GB+
BERT‑Large Information
BERT‑Large stacks 24 encoder layers. Key specifications:
Specification
Details
Encoder layers
24
Maximum sequence length
1024
Hidden size (dim)
768
Number of attention heads
16
Parameters
≈340 M
GPU memory requirement
≈32 GB+
BERT Fine‑tuning
Pre‑training and Fine‑tuning
BERT is first trained on massive corpora using self‑supervised objectives (Masked Language Modeling and Next Sentence Prediction) to learn generic word representations. The resulting model is then fine‑tuned on downstream tasks by adding a small task‑specific head.
Typical fine‑tuning tasks include:
Sentence‑pair Classification
Predict the relationship between two sentences (e.g., semantic similarity, coherence). Input format: [CLS] sentence1 [SEP] sentence2 [SEP] . Output: a label indicating the relation.
Single‑sentence Classification
Assign a category to a single sentence (e.g., news topic, intent detection). Input format: [CLS] sentence [SEP] . The [CLS] vector is used as the feature for classification.
Question Answering (QA)
Given a question and a passage, extract the answer span from the passage. Input format: [CLS] question [SEP] context [SEP] . Output: start and end indices of the answer.
Named Entity Recognition (NER)
Label each token with entity tags (e.g., PER, ORG, LOC) using a sequence labeling scheme such as BIOES.
BIOES tagging meanings: B = Begin, I = Intermediate, E = End, S = Single, O = Outside.
BERT Variants
RoBERTa (Robustly Optimized BERT Pretraining Approach)
Key differences from original BERT:
Removed the Next Sentence Prediction (NSP) task.
Trained on longer sequences.
Increased pre‑training data from 16 GB to 160 GB.
Used larger batch sizes (up to 8 K).
Expanded vocabulary from 30 k to 50 k tokens.
ALBERT (A Lite BERT)
Factorized embedding parameterization: decouples embedding size (E) from hidden size (H) to reduce parameters.
Cross‑layer parameter sharing: the same Transformer layer is reused across all encoder blocks.
Replaces NSP with Sentence Order Prediction (SOP) task.
Special Token Vocabulary in BERT
BERT defines several special tokens that mark structure and control the training process.
[CLS]
The classification token placed at the beginning of the sequence; its final hidden state is used for sentence‑level tasks such as classification or regression.
[SEP]
The separator token that delineates different segments (e.g., two sentences) and also marks the end of the input.
[PAD]
Padding token used to extend shorter sequences to a uniform length for batch processing.
[MASK]
Mask token employed during pre‑training for the Masked Language Modeling objective; random tokens are replaced with [MASK] and the model learns to predict the original token.
[UNK]
Unknown token representing out‑of‑vocabulary words that were not seen during pre‑training.
Additional custom special tokens can be introduced for domain‑specific applications.
References: https://blog.csdn.net/weixin_42038527/article/details/130871339 https://blog.csdn.net/weixin_44624036/article/details/131146059 https://blog.csdn.net/qq_42801194/article/details/122294769
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.