High‑Precision Low‑Latency Intelligent Danmu Blocking Solution for Kuaishou Video
The Kuaishou audio‑video team designed a high‑precision, low‑latency intelligent danmu‑blocking system that uses advanced image‑segmentation and temporal‑stability techniques to generate accurate masks, improve scene robustness, eliminate mask delay, and enhance user experience across diverse video content.
In the era of bullet‑screen (danmu) comments, dense overlays often obscure key scenes; the Kuaishou long‑video channel faced similar problems, prompting the development of a high‑precision, low‑latency intelligent danmu‑blocking solution that automatically detects user‑interesting regions and routes danmu around them.
Traditional adaptive blocking methods rely on portrait masks, which suffer from mis‑detection and latency, as illustrated by examples of mask errors and delays.
To improve mask accuracy, the team built a high‑precision mask generation algorithm based on image‑segmentation (U2Net) and incorporated a non‑local module to fuse features from multiple frames, enhancing temporal stability; an additional guidance mask from the previous frame further stabilizes predictions.
Temporal stability is defined as (1) sequential frame mask consistency and (2) real‑time stability during transitions. The non‑local architecture computes similarity between the current frame and preceding frames, merging these features to reinforce temporal information. SSIM is used to assess frame similarity and decide whether to apply temporal cues, thereby mitigating mask delay during rapid scene changes.
To ensure robustness across diverse scenes, a comprehensive data‑annotation pipeline was built, covering data collection, filtering, multi‑model labeling, and quality evaluation. Millions of annotated samples from various domains (e.g., food‑broadcast, street interviews, movies) were used to train the model, significantly reducing background mis‑detections.
Mask delay was traced to two main causes: mismatched video codecs (different bitrate versions leading to frame‑level desynchronization) and renderer lag (using a previous‑frame mask for the current frame). Aligning timestamps during transcoding and optimizing player rendering eliminated these delays.
Extensive testing across multiple scenarios—multiplayer, fast‑cut scenes, complex motions—showed a subjective accuracy exceeding 95%, and the deployment increased video consumption time and active user count on the long‑video page.
References: [1] Qin X, Zhang Z, Huang C, et al. U2‑Net: Going deeper with nested U‑structure for salient object detection. Pattern Recognition, 2020, 106: 107404. [2] Wang X, Girshick R, Gupta A, et al. Non‑local neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7794‑7803.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.