Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques
This talk presents practical methods for accelerating deep model inference, detailing two case studies—text QA and speech QA—along with their technical challenges, and outlines optimization strategies such as model compression, multi‑operator fusion, matrix multiplication tuning, quantization, and dynamic batching.
The session, presented by senior engineer Li Zhengxing from Tencent and organized by DataFunTalk, focuses on the practice of accelerating deep model inference. It introduces two real‑world projects— a text‑based Question Answering (QA) system and a speech‑based QA system used in games—describing their architectures, challenges, and performance results.
Case 1: Text QA
The text QA engine receives user‑entered text, performs intent detection, and returns the appropriate answer. Its processing pipeline includes text input, preprocessing (error correction, punctuation removal), tokenization & entity recognition, recall service to fetch a candidate list, and a ranking service that selects the top result.
Challenges: large models (e.g., BERT, XLM‑R), high QPS, and the need to load dozens of models simultaneously, stressing memory and GPU VRAM.
Case 2: Speech QA
In the speech QA scenario (e.g., in the game "Peace Elite"), players speak to an intelligent NPC. The pipeline consists of voice capture, preprocessing, vectorization using Facebook’s wav2vec model, recall service, and a rejection‑identification service.
Challenges: very large models, extremely high QPS due to real‑time voice streams from many players.
Optimization Solutions
1. Industry Methods : TensorFlow Serving (TF‑Serving) and LibTorch offer good hardware compatibility but lack inference performance optimization; TensorRT provides better GPU performance for supported models but has limited model support and higher integration cost.
2. Model Decomposition : Visualizations of the MQA (BERT‑based) and speech QA (wav2vec‑based) architectures show that both rely on a 12‑layer transformer encoder, which becomes the performance bottleneck.
3. Model Compression : Reduce the 12‑layer transformer to 3 layers via knowledge distillation (self‑distillation using TinyBERT). The loss function combines embedding, attention, hidden‑state, and logits losses. The compressed model achieves ~80 ms latency with ~78 % accuracy.
4. Multi‑Operator Fusion : Merge multiple small kernels (e.g., three embedding look‑ups in BERT, conv1d + GELU in speech QA) into single kernels, reducing kernel launch overhead and improving GPU utilization. Faster Transformer optimizations further reduce kernel count from 60 to 14.
5. Matrix Multiplication Optimization : Use cuBLAS/cuBLAS‑like APIs (e.g., cubelas) with algorithm selection based on GPU model; handle row‑major vs. column‑major storage via transposition to achieve optimal GEMM performance.
6. Quantization : Apply FP16 (or INT8) quantization to halve memory bandwidth and improve compute speed. FP16 is sufficient for the production environment; half‑precision kernels (half2) are leveraged for higher throughput.
7. Dynamic Batching (Service‑Level Optimization) : Experiments show that processing multiple requests in a single batch, either by request count or by input length, significantly improves throughput while keeping latency within acceptable limits. The QA server is re‑architected to batch recall results, sort them, and dispatch batched tasks to worker processes.
Results and Conclusion
Experimental results demonstrate that model distillation, quantization, and dynamic batching provide the most noticeable performance gains. The combined optimizations enable the deployment of multiple large models in a shared GPU environment while meeting real‑time latency requirements.
The presenter thanks the audience and invites further discussion.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.