Model Compression and Deployment of Pre‑trained Language Models at Meituan
This article presents Meituan's practical experience with compressing large pre‑trained language models—covering challenges of large‑model deployment, compression techniques such as knowledge distillation, pruning and quantization, the AutoDisc assistant‑model approach, multi‑teacher and iterative distillation, and real‑world applications in search advertising, intelligent assistants, and dual‑tower semantic matching.
The talk begins by outlining the rapid growth of pre‑trained language model sizes since BERT and the resulting challenges for real‑time NLP services at Meituan, including search recommendation, intelligent客服, content moderation, and B2B FAQ matching.
Meituan's NLP pipeline processes user queries through intent detection, category prediction, syntactic analysis, and semantic matching, often relying on large models that deliver high accuracy but incur prohibitive inference latency.
To address latency, three main compression techniques are reviewed: knowledge distillation (teacher‑student learning), model pruning (removing less important attention heads or feed‑forward units), and quantization (reducing precision to int8/int4 for edge deployment).
For high compression ratios, Meituan proposes the AutoDisc method, which automatically searches for an optimal “assistant” model that balances parameter reduction with performance retention, using a shared‑parameter architecture and optimized down‑sampling.
Experiments on the GLUE benchmark show that both manual and AutoDisc‑derived assistants outperform naïve single‑step distillation, with AutoDisc achieving the best trade‑off.
Multi‑teacher strategies and iterative distillation further improve performance: multiple assistant models capture diverse semantic information, and iterative distillation leverages soft labels from unsupervised data to pre‑train a compact student model before fine‑tuning on downstream tasks.
These techniques have been deployed in several Meituan services: a distilled BERT‑Medium model (≈20M parameters) improves search ad relevance and raises daily ad revenue by 2.7%; a dual‑tower model with virtual interaction (VIRT) boosts intelligent‑assistant answer quality, handling over 5,000 additional queries per day.
In semantic matching scenarios, a 3‑hundred‑million‑parameter model is distilled to a 20‑million‑parameter dual‑tower model, preserving 96.2% of offline performance while achieving a 56× inference speedup, leading to more accurate search results for users.
The presentation concludes with a summary of the practical impact of model compression on Meituan's AI services and an invitation to follow the DataFunTalk community for further AI and big‑data insights.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.