Survey of Model Compression and Quantization Techniques for Deep Neural Networks
This article provides a comprehensive overview of deep learning model compression and acceleration methods, detailing pruning strategies, various pruning types, evaluation criteria, sparsity ratios, fine‑tuning procedures, as well as linear and non‑linear quantization approaches, their implementations, and practical considerations.
1. Overview
Recent advances in deep learning have led to models with ever‑increasing parameters and computational complexity, making deployment on resource‑constrained hardware challenging. Model compression and acceleration aim to reduce parameter count and computational cost while preserving performance.
Compression focuses on decreasing the number of parameters, whereas acceleration targets lowering computational complexity.
Techniques include architectural redesign (e.g., using smaller 3×3 kernels, replacing full‑connection layers with average‑pooling, employing depth‑wise convolutions as in MobileNets), as well as pruning, quantization, and knowledge distillation.
Hardware‑level optimizations involve inference frameworks such as TensorRT, Tf‑lite, NCNN, MNN, and specialized hardware like GPUs, FPGAs, ASICs, TPUs, and NPUs.
2. Pruning
2.1 Pruning Process
Deep neural networks contain many redundant parameters; pruning removes unimportant weights, neurons, or layers to reduce model size and inference cost, analogous to a gardener trimming a dense plant.
The typical workflow is:
Train a high‑performance original model.
Assess the importance of each parameter.
Remove parameters with low importance.
Fine‑tune on the training set to recover accuracy.
Check whether size, speed, and accuracy meet requirements; repeat if necessary.
2.2 Pruning Types
Pruning can be categorized by the basic operation unit:
Unstructured pruning : removes individual weight elements, resulting in sparse matrices.
Structured pruning : removes whole filters or channels, preserving dense matrix structures and enabling efficient execution on existing hardware.
2.2.1 Unstructured Pruning
Weights with the smallest absolute values are set to zero based on a global ranking.
2.2.2 Structured Pruning – Filter‑wise
Entire convolutional kernels (filters) are removed, which also reduces the corresponding feature‑map channels in the next layer.
2.2.3 Structured Pruning – Channel‑wise
Channels are pruned by leveraging batch‑norm scaling factors; channels with small scaling factors are considered less important.
2.2.4 Structured Pruning – Shape‑wise
Pruning granularity is finer, targeting specific positions within each kernel.
2.2.5 Structured Pruning – Stripe‑wise (SWP)
Stripes (1×1×C slices) are formed from 3×3×C kernels and pruned based on learned importance scores (FS module).
2.3 Pruning Evaluation Criteria
Commonly a greedy approach ranks importance scores (e.g., weight magnitude, sum of absolute values) and removes a proportion of parameters. Regularization techniques such as L1 or group‑lasso are often added to encourage sparsity.
3. Sparsity Ratio / Pruning Rate
Sparsity can be predefined globally or locally per layer, or adaptively determined during training.
4. Fine‑Tuning
Since pruning alters the network structure, fine‑tuning is required to recover lost accuracy, often alternating pruning and fine‑tuning steps.
3. Quantization
3.1 Basic Principles
Quantization maps high‑precision floating‑point values to lower‑bit fixed‑point representations. Linear quantization (most common in industry) uses a scale (S) and zero‑point (Z) to convert between float and integer domains.
3.1.1 Linear Quantization
Formulas: Q = round(R / S) + Z and R = S·(Q‑Z). Scale is derived from the min/max of the floating‑point tensor and the target integer range.
3.1.2 Non‑Linear Quantization
Non‑linear mappings allocate more quantization levels to important weight ranges, often using clustering (e.g., K‑means) or piecewise functions.
3.2 Quantization Methods
3.2.1 Clustering Quantization
Weights are clustered into k centroids (e.g., -1, 0, 1, 2) and each weight is replaced by its nearest centroid.
3.2.2 Power‑of‑Two Quantization
Weights are rounded to the nearest power‑of‑two, enabling shift‑based multiplication.
3.2.3 Binary Quantization (1‑bit)
Weights are binarized using a sign function or stochastic rounding; gradients are approximated with a straight‑through estimator.
3.2.4 8‑bit Quantization
Both symmetric (range [-128,127]) and asymmetric ([0,255]) schemes are widely supported (TensorFlow, TensorRT). Symmetric quantization may truncate outliers; asymmetric quantization uses a non‑zero zero‑point.
3.3 Post‑Training Quantization (PTQ) vs. Quantization‑Aware Training (QAT)
PTQ calibrates scale and zero‑point using a small calibration dataset, optionally applying KL‑divergence to select optimal ranges. QAT inserts fake‑quantization ops during training, using the straight‑through estimator to back‑propagate gradients through quantization.
3.4 Fine‑Tuning after Quantization
When quantization induces noticeable accuracy loss, fine‑tuning (or QAT) can restore performance, often achieving <5% degradation for 8‑bit models and even acceptable results for 4‑bit models.
4. Summary
Model compression and acceleration remain active research areas. Pruning and quantization provide complementary ways to obtain lightweight, high‑accuracy, fast‑inference models. Selecting appropriate techniques, sparsity levels, and fine‑tuning strategies is crucial for successful deployment.
5. References
https://jinzhuojun.blog.csdn.net/article/details/100621397
https://cs.nju.edu.cn/wujx/paper/Pruning_Survey_MLA21.pdf
https://blog.csdn.net/weixin_49457347/article/details/117110458
https://zhuanlan.zhihu.com/p/138059904
https://blog.csdn.net/wspba/article/details/75675554
http://fjdu.github.io/machine/learning/2016/07/07/quantize-neural-networks-with-tensorflow.html
https://zhuanlan.zhihu.com/p/45496826
https://zhuanlan.zhihu.com/p/361957385
https://zhuanlan.zhihu.com/p/374374300
https://blog.csdn.net/WZZ18191171661/article/details/103332338
https://zhuanlan.zhihu.com/p/58182172
https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf
https://developer.download.nvidia.cn/video/gputechconf/gtc/2020/presentations/s21664-toward-int8-inference-deploying-quantization-aware-trained-networks-using-tensorrt.pdf
DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING
Learning Structured Sparsity in Deep Neural Networks
Learning Efficient Convolutional Networks through Network Slimming
Accelerating Convolutional Neural Networks by Group‑wise 2D‑filter Pruning
PRUNING FILTERS FOR EFFICIENT CONVNETS
Quantizing deep convolutional networks for efficient inference: A whitepaper
Data‑Free Quantization Through Weight Equalization and Bias Correction
PRUNING FILTER IN FILTER
Author: Li Xinke
Laiye Technology Team
Official account of Laiye Technology, featuring its best tech innovations, practical implementations, and cutting‑edge industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.