Artificial Intelligence 15 min read

Compression Techniques for BERT: Analysis, Quantization, Pruning, Distillation, and Structure‑Preserving Methods

This article reviews BERT’s architecture, analyzes the storage and compute costs of each layer, and systematically presents compression methods—including quantization, pruning, knowledge distillation (Distilled BiLSTM and MobileBERT), and structure‑preserving techniques—aimed at enabling efficient deployment on resource‑constrained mobile devices.

DataFunSummit
DataFunSummit
DataFunSummit
Compression Techniques for BERT: Analysis, Quantization, Pruning, Distillation, and Structure‑Preserving Methods

The BERT model consists of an embedding layer, linear‑before‑attention, multi‑head attention, linear‑after‑attention, and feed‑forward layers; storage and inference costs grow with the number of Transformer blocks, as illustrated in Figures 1‑3.

Quantization reduces parameter precision (e.g., fp32 → fp16) to halve storage and accelerate inference on GPUs or FPGA platforms, while quantization‑aware training mitigates accuracy loss, especially for sensitive layers such as embeddings.

Pruning removes redundant parameters. Elementwise pruning sparsifies individual weights, whereas structured pruning eliminates entire attention heads or whole Transformer blocks, with trade‑offs between compression ratio and accuracy.

Knowledge Distillation transfers knowledge from a large teacher model to a smaller student model. Two examples are presented:

Distilled BiLSTM – a lightweight single‑layer BiLSTM learns from BERT via output‑probability distillation, achieving ~99.7% size reduction and 400× speed‑up.

MobileBERT – a narrow BERT variant that inserts bottleneck layers into each Transformer block; combined with quantization it attains up to 10× compression (40× with quantization) while preserving task performance.

MobileBERT training uses a staged layer‑wise distillation strategy, employing feature‑map transfer (hidden‑layer distillation) and attention transfer (KL‑divergence between teacher and student attention distributions) to align intermediate representations.

Structure‑Preserving Compression includes parameter sharing, low‑rank factorization of embeddings, and attention decoupling for sentence‑pair tasks, which reduce storage or inference cost without altering the model architecture.

The article concludes that a combination of these techniques enables BERT to run efficiently on mobile terminals, with references to recent research papers.

model compressionquantizationpruningKnowledge DistillationBERTmobile deployment
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.