Artificial Intelligence 16 min read

FPGA-Based Real-Time Streaming ASR Acceleration for Kuaishou: A Case Study in Domain-Specific Hardware Optimization

This paper presents a full fixed-point FPGA-based hardware acceleration solution for TDNN+LSTM acoustic models in real-time streaming ASR, achieving 37.67% latency reduction and 7.5x concurrency improvement through software-hardware co-design and domain-specific optimization.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
FPGA-Based Real-Time Streaming ASR Acceleration for Kuaishou: A Case Study in Domain-Specific Hardware Optimization

This paper presents a comprehensive FPGA-based hardware acceleration solution for real-time streaming Automatic Speech Recognition (ASR) developed by Kuaishou's Heterogeneous Computing Center in collaboration with the MMU Audio Center. The solution targets TDNN+LSTM acoustic models, which are widely used in voice search, live streaming input methods, and other interactive applications where low latency and high concurrency are critical performance metrics.

The system architecture consists of a host processor handling feature extraction, decoding, and post-processing, while the neural network inference is offloaded to FPGA accelerator cards. The FPGA design employs a VLIW-based domain-specific processor with multiple acceleration engines, each containing 32 vector processing units and a 64x32 matrix multiply array for 16-bit fixed-point operations. The hardware design incorporates advanced optimization techniques including loop unrolling, pipeline optimization, and ping-pong buffering to maximize performance.

Key innovations span multiple levels: algorithm-level graph fusion reduces model layers by ~20%, temporal profiling enables precise fixed-point quantization of LSTM models without retraining, and graph partitioning optimizes computation distribution across FPGA resources. The software framework provides zero-code model upgrades through configuration files, supports multi-model switching, and implements OpenCL-based task scheduling with event-driven synchronization.

System-level optimizations include batch processing mechanisms to maximize kernel utilization, dynamic load balancing across multiple models, and efficient PCIe data transfer. The solution achieves significant performance improvements: 37.67% average latency reduction, 20.7% P99 latency improvement, and 7.5x increase in concurrent processing capacity while maintaining recognition accuracy. The project demonstrates successful large-scale deployment of FPGA in data center voice processing scenarios, addressing the growing demand for AI infrastructure efficiency through domain-specific hardware acceleration.

Domain‑Specific Architecturefixed-point quantizationFPGA accelerationhardware-software co-designlow latency optimizationreal-time ASRTDNN+LSTMvoice processing
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.