Artificial Intelligence 16 min read

FPGA-Based Real-Time Streaming ASR Acceleration for Kuaishou: A Case Study in Domain-Specific Hardware Optimization

This paper presents a full fixed-point FPGA-based hardware acceleration solution for TDNN+LSTM acoustic models in real-time streaming ASR, achieving 37.67% latency reduction and 7.5x concurrency improvement through software-hardware co-design and domain-specific optimization.

Kuaishou Tech

Nov 8, 2021

FPGA-Based Real-Time Streaming ASR Acceleration for Kuaishou: A Case Study in Domain-Specific Hardware Optimization

This paper presents a comprehensive FPGA-based hardware acceleration solution for real-time streaming Automatic Speech Recognition (ASR) developed by Kuaishou's Heterogeneous Computing Center in collaboration with the MMU Audio Center. The solution targets TDNN+LSTM acoustic models, which are widely used in voice search, live streaming input methods, and other interactive applications where low latency and high concurrency are critical performance metrics.

The system architecture consists of a host processor handling feature extraction, decoding, and post-processing, while the neural network inference is offloaded to FPGA accelerator cards. The FPGA design employs a VLIW-based domain-specific processor with multiple acceleration engines, each containing 32 vector processing units and a 64x32 matrix multiply array for 16-bit fixed-point operations. The hardware design incorporates advanced optimization techniques including loop unrolling, pipeline optimization, and ping-pong buffering to maximize performance.

Key innovations span multiple levels: algorithm-level graph fusion reduces model layers by ~20%, temporal profiling enables precise fixed-point quantization of LSTM models without retraining, and graph partitioning optimizes computation distribution across FPGA resources. The software framework provides zero-code model upgrades through configuration files, supports multi-model switching, and implements OpenCL-based task scheduling with event-driven synchronization.

System-level optimizations include batch processing mechanisms to maximize kernel utilization, dynamic load balancing across multiple models, and efficient PCIe data transfer. The solution achieves significant performance improvements: 37.67% average latency reduction, 20.7% P99 latency improvement, and 7.5x increase in concurrent processing capacity while maintaining recognition accuracy. The project demonstrates successful large-scale deployment of FPGA in data center voice processing scenarios, addressing the growing demand for AI infrastructure efficiency through domain-specific hardware acceleration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Domain-specific Architecture fixed-point quantization FPGA acceleration hardware-software co-design low latency optimization real-time ASR TDNN+LSTM voice processing

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.