Artificial Intelligence 10 min read

High‑Precision Low‑Latency Intelligent Danmu Blocking Solution for Kuaishou Video

The Kuaishou audio‑video team designed a high‑precision, low‑latency intelligent danmu‑blocking system that uses advanced image‑segmentation and temporal‑stability techniques to generate accurate masks, improve scene robustness, eliminate mask delay, and enhance user experience across diverse video content.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
High‑Precision Low‑Latency Intelligent Danmu Blocking Solution for Kuaishou Video

In the era of bullet‑screen (danmu) comments, dense overlays often obscure key scenes; the Kuaishou long‑video channel faced similar problems, prompting the development of a high‑precision, low‑latency intelligent danmu‑blocking solution that automatically detects user‑interesting regions and routes danmu around them.

Traditional adaptive blocking methods rely on portrait masks, which suffer from mis‑detection and latency, as illustrated by examples of mask errors and delays.

To improve mask accuracy, the team built a high‑precision mask generation algorithm based on image‑segmentation (U2Net) and incorporated a non‑local module to fuse features from multiple frames, enhancing temporal stability; an additional guidance mask from the previous frame further stabilizes predictions.

Temporal stability is defined as (1) sequential frame mask consistency and (2) real‑time stability during transitions. The non‑local architecture computes similarity between the current frame and preceding frames, merging these features to reinforce temporal information. SSIM is used to assess frame similarity and decide whether to apply temporal cues, thereby mitigating mask delay during rapid scene changes.

To ensure robustness across diverse scenes, a comprehensive data‑annotation pipeline was built, covering data collection, filtering, multi‑model labeling, and quality evaluation. Millions of annotated samples from various domains (e.g., food‑broadcast, street interviews, movies) were used to train the model, significantly reducing background mis‑detections.

Mask delay was traced to two main causes: mismatched video codecs (different bitrate versions leading to frame‑level desynchronization) and renderer lag (using a previous‑frame mask for the current frame). Aligning timestamps during transcoding and optimizing player rendering eliminated these delays.

Extensive testing across multiple scenarios—multiplayer, fast‑cut scenes, complex motions—showed a subjective accuracy exceeding 95%, and the deployment increased video consumption time and active user count on the long‑video page.

References: [1] Qin X, Zhang Z, Huang C, et al. U2‑Net: Going deeper with nested U‑structure for salient object detection. Pattern Recognition, 2020, 106: 107404. [2] Wang X, Girshick R, Gupta A, et al. Non‑local neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7794‑7803.

AIimage segmentationvideo processingKuaishoudanmu blockingnon‑local network
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.