Vivo's Self‑Developed Streaming Speech‑Recognition Inference Engine and KunlunChip High‑Performance Inference Library
The article details vivo's development of a high‑accuracy, high‑performance streaming speech‑recognition inference engine built on the wenet framework, its optimization techniques such as dynamic batching and memory pooling, collaborative acceleration with KunlunChip's high‑performance inference library, and extensive performance benchmarks demonstrating multi‑batch GPU and XPU gains.
Speech recognition is a core AI service for vivo, underpinning applications such as the Jovi input method and voice assistant. To provide billions of users with a smooth experience, vivo built a high‑accuracy, high‑performance engine based on the open‑source wenet end‑to‑end toolkit, deeply optimizing it for both offline and streaming scenarios.
As user volume grew, CPU‑based inference showed high latency (TP99) and could not meet demanding workloads. The vivo AI Engineering Center therefore created a streaming inference engine that supports dynamic batching, memory‑pooling, and bucket‑sorted data dispatch, and runs on both CPU and GPU, achieving notable acceleration on GPU.
KunlunChip Technology, with over a decade of AI accelerator experience, supplies general‑purpose AI chips and a software stack. In collaboration with vivo, they launched an AI multi‑compute project, focusing first on speech‑recognition and achieving early breakthroughs.
Vivo Self‑Developed Streaming Inference Engine
The engine consists of four parts:
Wenet decoder pipeline: front‑end processing (features, VAD) → encoder → language model (WFST) → decoder.
Data scheduling: dynamic batching and bucket sorting.
Runtime adaptation layer: abstracts model inference interfaces for different back‑ends.
Backend inference layer: supports onnxruntime, GPU, and Kunlun XpuRT.
Key engine features include:
Support for multi‑batch streaming requests, fully exploiting hardware parallelism.
Dynamic batching that automatically assembles batches within a short time window to boost throughput.
Bucket sorting to reduce padding waste across batches.
Memory‑pool for GPU buffers, lowering allocation overhead under high concurrency.
Language‑model (WFST) optimizations:
Pooling of small objects (ForwardLink, BackpointerToken) reduces per‑search time from 14 ms to 5 ms.
Thread model refined from one pthread per session to bthread, decreasing system load on GPU.
AsrDecoder object pooling improves CPU performance by 27%.
KunlunChip High‑Performance Inference Library Solution
The biggest technical challenge for AI inference engines is simultaneously meeting rapid product iteration and high hardware performance. KunlunChip offered two solutions—graph‑compiler and high‑performance library—and vivo chose the latter. The library implements large operators (Encoder and Decoder) built on KunlunChip’s high‑performance kernel API.
Library highlights:
Dynamic shape support with no performance loss compared to static shape, saving valuable GPU memory.
Multi‑batch streaming inference with efficient cache management.
Deep graph optimizations such as ffn_kernel_fusion and attention_fusion, plus variable‑length optimizations.
Quantization strategies: FP16, INT8 (dynamic/static), and mixed‑precision.
Custom operator fusion (e.g., RelPos fusion in ConformerEncoder).
Automation tools for one‑click model import.
Performance Tests
Hardware configuration:
Hardware
Configuration
CPU
2 × Intel(R) Xeon(R) Gold 6330 @ 2.00GHz
Memory
512 GB
GPU
165 W GPU
XPU
1 × KunlunChip AI accelerator R200
Maximum concurrency (routes):
Solution
Concurrent routes
onnxruntime (CPU)
350
GPU (165 W)
1 700
KunlunChip R200
1 400
Latency (first‑word / last‑word) for FP16 quantized runs shows that all back‑ends maintain lossless accuracy. The KunlunChip library reaches 1 400 concurrent streams on a single card—about four times the CPU baseline—while dramatically reducing first‑word and last‑word latency. Multi‑card tests (4 × GPU or 4 × R200) achieve up to 4 000 concurrent streams.
Compared with mainstream 165 W GPUs, KunlunChip’s high‑performance inference library provides convenient tools for model fusion, quantization, and pruning tailored to specific business needs.
KunlunChip Support in wenet
wenet is China’s largest open‑source speech community. KunlunChip’s second‑generation AI chip is the first heterogeneous AI inference chip supported by wenet, with source code merged into the mainline. Future work will deliver both a graph engine and a high‑performance library backend, enabling multi‑batch streaming decoding and end‑to‑end deployment solutions.
KunlunChip will continue to leverage its leading position in the inference ecosystem to improve voice‑service user experience and collaborate closely with the community.
About the vivo AI Research Institute
Founded in 2017, the institute focuses on foundational AI research and application in computer vision, speech, NLP, machine learning, deep learning, and reinforcement learning, aiming to deliver ubiquitous AI‑driven conveniences to users.
Reference Links
vivo official site: https://www.vivo.com/
wenet project: https://github.com/wenet-e2e/wenet
KunlunChip official site: https://www.kunlunxin.com.cn/
KunlunChip XPU support in wenet: https://github.com/wenet-e2e/wenet/tree/main/runtime/kunlun
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.