Artificial Intelligence 10 min read

Vivo's Self‑Developed Streaming Speech‑Recognition Inference Engine and KunlunChip High‑Performance Inference Library

The article details vivo's development of a high‑accuracy, high‑performance streaming speech‑recognition inference engine built on the wenet framework, its optimization techniques such as dynamic batching and memory pooling, collaborative acceleration with KunlunChip's high‑performance inference library, and extensive performance benchmarks demonstrating multi‑batch GPU and XPU gains.

DataFunTalk

Dec 7, 2022

Vivo's Self‑Developed Streaming Speech‑Recognition Inference Engine and KunlunChip High‑Performance Inference Library

Speech recognition is a core AI service for vivo, underpinning applications such as the Jovi input method and voice assistant. To provide billions of users with a smooth experience, vivo built a high‑accuracy, high‑performance engine based on the open‑source wenet end‑to‑end toolkit, deeply optimizing it for both offline and streaming scenarios.

As user volume grew, CPU‑based inference showed high latency (TP99) and could not meet demanding workloads. The vivo AI Engineering Center therefore created a streaming inference engine that supports dynamic batching, memory‑pooling, and bucket‑sorted data dispatch, and runs on both CPU and GPU, achieving notable acceleration on GPU.

KunlunChip Technology, with over a decade of AI accelerator experience, supplies general‑purpose AI chips and a software stack. In collaboration with vivo, they launched an AI multi‑compute project, focusing first on speech‑recognition and achieving early breakthroughs.

Vivo Self‑Developed Streaming Inference Engine

The engine consists of four parts:

Wenet decoder pipeline: front‑end processing (features, VAD) → encoder → language model (WFST) → decoder.

Data scheduling: dynamic batching and bucket sorting.

Runtime adaptation layer: abstracts model inference interfaces for different back‑ends.

Backend inference layer: supports onnxruntime, GPU, and Kunlun XpuRT.

Key engine features include:

Support for multi‑batch streaming requests, fully exploiting hardware parallelism.

Dynamic batching that automatically assembles batches within a short time window to boost throughput.

Bucket sorting to reduce padding waste across batches.

Memory‑pool for GPU buffers, lowering allocation overhead under high concurrency.

Language‑model (WFST) optimizations:

Pooling of small objects (ForwardLink, BackpointerToken) reduces per‑search time from 14 ms to 5 ms.

Thread model refined from one pthread per session to bthread, decreasing system load on GPU.

AsrDecoder object pooling improves CPU performance by 27%.

KunlunChip High‑Performance Inference Library Solution

The biggest technical challenge for AI inference engines is simultaneously meeting rapid product iteration and high hardware performance. KunlunChip offered two solutions—graph‑compiler and high‑performance library—and vivo chose the latter. The library implements large operators (Encoder and Decoder) built on KunlunChip’s high‑performance kernel API.

Library highlights:

Dynamic shape support with no performance loss compared to static shape, saving valuable GPU memory.

Multi‑batch streaming inference with efficient cache management.

Deep graph optimizations such as ffn_kernel_fusion and attention_fusion, plus variable‑length optimizations.

Quantization strategies: FP16, INT8 (dynamic/static), and mixed‑precision.

Custom operator fusion (e.g., RelPos fusion in ConformerEncoder).

Automation tools for one‑click model import.

Performance Tests

Hardware configuration:

Hardware

Configuration

CPU

2 × Intel(R) Xeon(R) Gold 6330 @ 2.00GHz

Memory

512 GB

GPU

165 W GPU

XPU

1 × KunlunChip AI accelerator R200

Maximum concurrency (routes):

Solution

Concurrent routes

onnxruntime (CPU)

350

GPU (165 W)

1 700

KunlunChip R200

1 400

Latency (first‑word / last‑word) for FP16 quantized runs shows that all back‑ends maintain lossless accuracy. The KunlunChip library reaches 1 400 concurrent streams on a single card—about four times the CPU baseline—while dramatically reducing first‑word and last‑word latency. Multi‑card tests (4 × GPU or 4 × R200) achieve up to 4 000 concurrent streams.

Compared with mainstream 165 W GPUs, KunlunChip’s high‑performance inference library provides convenient tools for model fusion, quantization, and pruning tailored to specific business needs.

KunlunChip Support in wenet

wenet is China’s largest open‑source speech community. KunlunChip’s second‑generation AI chip is the first heterogeneous AI inference chip supported by wenet, with source code merged into the mainline. Future work will deliver both a graph engine and a high‑performance library backend, enabling multi‑batch streaming decoding and end‑to‑end deployment solutions.

KunlunChip will continue to leverage its leading position in the inference ecosystem to improve voice‑service user experience and collaborate closely with the community.

About the vivo AI Research Institute

Founded in 2017, the institute focuses on foundational AI research and application in computer vision, speech, NLP, machine learning, deep learning, and reinforcement learning, aiming to deliver ubiquitous AI‑driven conveniences to users.

Reference Links

vivo official site: https://www.vivo.com/

wenet project: https://github.com/wenet-e2e/wenet

KunlunChip official site: https://www.kunlunxin.com.cn/

KunlunChip XPU support in wenet: https://github.com/wenet-e2e/wenet/tree/main/runtime/kunlun

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization AI inference Speech Recognition wenet Kunlun chip streaming engine

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.