Artificial Intelligence 19 min read

Design and Optimization of Bilibili's Large-Scale Video Duplicate Detection System

Bilibili built a massive video‑duplicate detection platform that trains a self‑supervised ResNet‑50 feature extractor, removes black borders, and uses a two‑stage ANN‑plus‑segment‑level matching pipeline accelerated by custom GPU decoding and inference, boosting duplicate rejection 7.5×, recall 3.75×, and cutting manual misses from 65 to 5 per day.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Design and Optimization of Bilibili's Large-Scale Video Duplicate Detection System

Background : Bilibili faces a significant amount of low‑editing duplicate video submissions, where creators re‑upload the same content with minor edits such as black borders, cropping, filters, or overlays. These duplicates increase the workload of safety and community review, affect fair traffic distribution, degrade user experience, and raise computational costs.

The platform needs a massive‑scale video retrieval system (referred to as the “collision system”) that can compare every newly uploaded video against the entire historical library, identify low‑editing duplicates, and provide auditors with source video hints to protect original creators.

Challenges :

Absence of pre‑trained features that capture varying degrees of video editing; a custom feature extractor must be trained on Bilibili’s data.

Input resolution of 224×224 for fast inference discards important content when redundant borders dominate low‑editing videos, requiring a preprocessing step to remove irrelevant regions.

Billions of videos in the repository demand a two‑stage retrieval pipeline: a coarse‑filter to generate candidate sets followed by a precise segment‑level matching, all within 10 seconds for a 720p video at 1 fps.

Overall Architecture : The system consists of four subsystems – the main collision system, a timeout fallback system, downstream subsystems (e.g., copyright), and a filtering module. The main system handles video preprocessing, feature extraction, coarse candidate retrieval, and precise segment matching, consuming the majority of resources while delivering the highest accuracy and recall.

Algorithm Optimizations :

Feature Extraction : A self‑supervised training pipeline builds an embedding extractor tailored to Bilibili’s editing‑distance metric, using a ResNet‑50 backbone. Positive pairs are generated by applying realistic editing augmentations (crop, flip, color shift, blur) to the same frame; negative pairs are drawn from a large, dynamically updated queue.

Training incorporates tricks such as multi‑stage data augmentation, knowledge distillation from a large ViT teacher, and 8‑bit quantization for faster inference.

Evaluation on a 30 k pair test set shows progressive improvements from ImageNet baseline through MoCo, augmentation, ViT distillation, and quantization.

Two‑Stage Matching Strategy :

Coarse retrieval uses approximate nearest neighbor (ANN) search with product quantization (PQ32) and over 1 M inverted lists, achieving sub‑second latency on a billion‑scale vector index.

For each video, a 2‑second‑1‑frame fingerprint is stored to reduce memory and compute. The coarse stage returns a wide candidate set (high recall, low precision).

Fine‑grained matching treats the problem as aligning two sequences of frame‑level features, using a similarity matrix, longest‑common‑subsequence extraction, and non‑maximum suppression to resolve many‑to‑many matches.

Engineering Performance Optimizations :

Model Inference : The custom InferX framework accelerates ResNet‑50 inference on Volta/Turing GPUs by >5× compared to LibTorch, achieving >2000 QPS on a single T4 with only 2 GB memory usage.

Video Decoding : A GPU‑only decoder built on NvCodec SDK converts video streams directly to torch CUDA tensors, eliminating host‑device copies. It supports YUV→RGB conversion via CUDA kernels and optional JPEG‑encoded frame output for network transmission.

Image Pre‑processing : GPU kernels perform resize, normalization, and black‑border removal using warp‑shuffle reductions and shared‑memory optimizations.

Audio Feature Extraction : Log‑FilterBank and MFCC calculations are re‑implemented in C++ with vectorization, Intel MKL FFT/GEMM, and optional cuBLAS/cuFFT acceleration, yielding a 10× speed‑up over the original Python version.

Vector Retrieval : Faiss‑based distributed index with FP16/Tensor‑Core support and optional binary hashing reduces memory footprint and enables billion‑scale similarity search on CPU or GPU.

Results : After two years of development, the system improves duplicate‑rejection volume by ~7.5×, recall by ~3.75×, and suggestion count by 1.7× compared to the 2020 baseline. Model accuracy reaches ~88%, and daily manual miss‑detections drop from 65 to 5 (≈1/13 of the baseline). The system now serves Bilibili’s safety, copyright, high‑risk image/video review, and recommendation deduplication pipelines.

system architecturedeep learningGPU AccelerationFeature Extractionvideo deduplicationlarge-scale retrieval
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.