Artificial Intelligence 14 min read

Overview of Main Model Compression and Acceleration Techniques: Structural Optimization, Pruning, Quantization, and Knowledge Distillation

This article reviews four mainstream model compression and acceleration methods—structural optimization, pruning, quantization, and knowledge distillation—explaining their principles, implementations, and performance, and presents practical examples such as DistillBERT, TinyBERT, and FastBERT with comparative results.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Overview of Main Model Compression and Acceleration Techniques: Structural Optimization, Pruning, Quantization, and Knowledge Distillation

In recent years, deep learning models have achieved widespread success in computer vision, natural language processing, and other fields, but their large parameter counts lead to high computational and memory costs, making deployment on resource‑constrained platforms challenging. Model compression and acceleration aim to reduce model size and inference cost while preserving task performance.

1. Introduction

Model compression reduces the number of parameters, while acceleration reduces computational complexity; the two are related but not identical. Compression makes models smaller and easier to deploy, whereas acceleration focuses on faster inference.

Necessity: mainstream models such as VGG‑16 contain over 130 million parameters, occupy >500 MB, and require >3 × 10¹⁰ FLOPs per image.

Feasibility: many parameters are redundant or have limited impact, so training a subset can achieve comparable or even superior performance.

Compression and acceleration can be tackled at the algorithm, framework, and hardware levels; this article concentrates on algorithm‑level techniques.

2. Main Techniques

The four dominant algorithmic approaches are structural optimization, pruning, quantization, and knowledge distillation.

2.1 Structural Optimization

Designing more efficient network architectures reduces redundancy and computation. Common strategies include:

Matrix factorization (e.g., ALBERT embedding layer)

Parameter sharing (e.g., CNNs, ALBERT)

Grouped convolutions (e.g., ShuffleNet, MobileNet)

Decomposed convolutions (Inception V2 uses two small kernels; Inception V3 uses asymmetric kernels)

Replacing fully‑connected layers with global average pooling

Using 1×1 convolutions

2.2 Pruning

Pruning removes “redundant” parameters from a pretrained model based on evaluation criteria. It can be:

Unstructured pruning : fine‑grained removal of individual weights, which may lead to irregular sparsity and limited speedup.

Structured pruning : coarse‑grained removal of entire filters or channels, yielding regular sparsity that accelerates inference on existing hardware, though it may degrade accuracy and require fine‑tuning.

2.3 Quantization

Quantization represents weights, activations, gradients, etc., with lower‑bit formats (e.g., 16‑bit, 8‑bit, 2‑bit, 1‑bit). Benefits include:

Significant reduction in storage and memory bandwidth (e.g., 8‑bit quantization cuts storage by ~75%).

Faster integer arithmetic and lower power consumption.

Drawbacks are potential accuracy loss and the need for specialized training or hardware support.

2.4 Knowledge Distillation

Introduced by Hinton et al. (2015), knowledge distillation trains a small Student model using the soft outputs of a large Teacher model. The process typically:

Trains the Teacher model.

Uses the Teacher’s softmax logits as soft labels for the Student.

Optimizes a loss that combines soft‑label KL divergence with the hard‑label cross‑entropy.

This transfers knowledge, allowing the Student to achieve performance comparable to the Teacher while being much smaller. Limitations include reliance on softmax outputs, making it most suitable for classification tasks.

3. Application Examples

3.1 DistillBERT

DistillBERT (Sanh et al., 2019) reduces BERT’s depth from 12 to 6 layers, removes token‑type embeddings and the pooler, and initializes the Student with Teacher parameters. Training loss combines KL divergence between Teacher and Student logits and cosine similarity of hidden states. Results show a substantial drop in parameters and inference time with modest accuracy loss on GLUE.

3.2 TinyBERT

TinyBERT (Jiao et al., 2019) follows a two‑stage distillation pipeline: (1) General Distillation from an unfine‑tuned BERT to obtain a general Student, and (2) Task‑specific Distillation using a fine‑tuned Teacher to further adapt the Student. The loss comprises embedding‑level MSE, prediction‑layer MSE, and cross‑entropy components.

3.3 FastBERT

FastBERT (Liu et al., 2020) introduces a self‑distillation mechanism where a Teacher (backbone) and a Student (branch) coexist in a single model; each Transformer layer is followed by a Student classifier. An adaptive inference strategy uses normalized entropy of the Student’s output to decide whether to stop early.

References

[1] SIGAI: Survey on Deep Learning Model Compression and Acceleration.

[2] Jian Feng: A Comprehensive Overview of Model Compression and Acceleration.

[3] DeepHub: Summary and Comparison of Model Compression Methods.

[4] Wu Jianming: Detailed Analysis of Model Pruning.

... (additional references omitted for brevity) ...

AIDeep Learningmodel compressionquantizationpruningknowledge distillation
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.