Artificial Intelligence 15 min read

Compression Techniques for BERT: Analysis, Quantization, Pruning, Distillation, and Structure‑Preserving Methods

This article reviews BERT’s architecture, analyzes the storage and compute costs of each layer, and systematically presents compression methods—including quantization, pruning, knowledge distillation (Distilled BiLSTM and MobileBERT), and structure‑preserving techniques—aimed at enabling efficient deployment on resource‑constrained mobile devices.

DataFunSummit

Jun 5, 2021

Compression Techniques for BERT: Analysis, Quantization, Pruning, Distillation, and Structure‑Preserving Methods

The BERT model consists of an embedding layer, linear‑before‑attention, multi‑head attention, linear‑after‑attention, and feed‑forward layers; storage and inference costs grow with the number of Transformer blocks, as illustrated in Figures 1‑3.

Quantization reduces parameter precision (e.g., fp32 → fp16) to halve storage and accelerate inference on GPUs or FPGA platforms, while quantization‑aware training mitigates accuracy loss, especially for sensitive layers such as embeddings.

Pruning removes redundant parameters. Elementwise pruning sparsifies individual weights, whereas structured pruning eliminates entire attention heads or whole Transformer blocks, with trade‑offs between compression ratio and accuracy.

Knowledge Distillation transfers knowledge from a large teacher model to a smaller student model. Two examples are presented:

Distilled BiLSTM – a lightweight single‑layer BiLSTM learns from BERT via output‑probability distillation, achieving ~99.7% size reduction and 400× speed‑up.

MobileBERT – a narrow BERT variant that inserts bottleneck layers into each Transformer block; combined with quantization it attains up to 10× compression (40× with quantization) while preserving task performance.

MobileBERT training uses a staged layer‑wise distillation strategy, employing feature‑map transfer (hidden‑layer distillation) and attention transfer (KL‑divergence between teacher and student attention distributions) to align intermediate representations.

Structure‑Preserving Compression includes parameter sharing, low‑rank factorization of embeddings, and attention decoupling for sentence‑pair tasks, which reduce storage or inference cost without altering the model architecture.

The article concludes that a combination of these techniques enables BERT to run efficiently on mobile terminals, with references to recent research papers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Model Compression quantization pruning Knowledge Distillation BERT Mobile Deployment

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.