Artificial Intelligence 9 min read

Video Copyright Detection Using CDVS and CDVA: Solution Overview and Technical Details

The BoYun Vision team developed a CDVS‑ and CDVA‑based video copyright detection system that extracts key frames, combines handcrafted and deep features, aligns timestamps with a CDVS temporal node filter, and achieved a top‑2 result in the 2019 iQIYI‑CCF competition.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Video Copyright Detection Using CDVS and CDVA: Solution Overview and Technical Details

With the rapid development of mobile Internet and the widespread adoption of smartphones, short videos have become a major medium for information dissemination, but they also bring massive copyright infringement issues. An automated method is needed to detect video infringement and protect the rights of video production companies and original creators.

The BoYun Vision team proposed a video copyright detection method based on CDVS (Compact Descriptor for Visual Search) and CDVA (Compact Descriptor for Video Analysis). Their solution achieved top‑2 results in the "2019 CCF Big Data & Computing Intelligence Competition – Video Copyright Detection" jointly organized by iQIYI and CCF.

The competition provided 1,600 infringing short videos (query set) and 205 original long videos (reference set). The algorithm had to output, for each short video, the corresponding long video and the start‑end timestamps, with a timing error not exceeding 3 seconds. An additional 3,000 short videos were supplied as a training/validation set.

The task required linking transformed short videos to their source long videos, extracting key frames, computing visual fingerprints, and performing video similarity retrieval. Besides robustness of video features, the solution needed to be real‑time and scalable.

Core Strategy and Algorithm

The core idea follows the CDVA international standard completed by the BoYun Vision team in 2019, with specific improvements for the competition. The pipeline includes:

Key‑frame detection for both query and reference videos.

Extraction of handcrafted CDVS features and deep learning features from key frames.

Feature‑based video matching to generate a set of video pairs with matching scores.

To achieve precise temporal alignment, a CDVS‑based temporal node filtering method was designed. It combines frame ordering, noise suppression of timestamp errors, and correlation filtering between the two video sequences to obtain aligned time nodes.

Technical Details

CDVS (Compact Descriptor for Visual Search) is a hand‑crafted feature similar to SIFT, suitable for image retrieval and matching. Its extraction consists of five steps: interest‑point selection, local feature description, descriptor compression, location compression, and descriptor aggregation. Implemented on GPU, a single machine can extract up to 200 image features per second, with descriptor sizes ranging from 512 bytes to 16 KB.

CDVA (Compact Descriptor for Video Analysis) extends CDVS to video matching and retrieval. The workflow includes:

Fast key‑frame localization using color histograms.

Frame‑level matching by combining deep CNN features (NIP) and CDVS handcrafted features.

The deep features are sensitive to rotation; performance drops noticeably when the video is rotated, as shown in the competition results.

To improve robustness, the team enhanced the CNN pooling layer with a Nested Invariant Pooling (NIP) method, producing compact global features invariant to translation, scale, and rotation. Compared with MAC and R‑MAC, NIP shows higher saliency region matching and better video matching performance.

After obtaining candidate video pairs and their scores, a temporal node filtering process refines the selection:

Frame extraction and CDVS‑based frame matching to compute similarity scores.

Score accumulation and re‑ranking of video pairs to select the highest‑scoring reference video.

Temporal alignment using frame ordering, noise suppression of mismatched frames, and a sliding‑window approach to output precise timestamp locations.

The solution achieved TOP‑2 in the competition, demonstrating strong performance in both feature robustness and real‑time processing.

Award Statement

The BoYun Vision team entered the finals on their first attempt and secured a TOP‑2 finish. Despite a slight score drop in the second round, the team gained valuable experience in algorithm design, coding, experimentation, and collaborative problem solving. Parts of the source code have been uploaded to GitHub for public reference.

Team Introduction

The team consists of AI researchers and engineers from BoYun Vision (Beijing) Technology Co., Ltd., focusing on visual search and analysis. Members include team leader Xie Zhangxiang, R&D director Lou Yuhang (Ph.D., Peking University, research on large‑scale image/video retrieval and feature compression), researcher Bai Yan (Ph.D., Peking University, image/video retrieval and feature standardization), founder and CEO Chen Jie (Ph.D., Peking University, AI standards), and engineer Zhang Zhenbin.

deep learningVideo Copyright DetectionCDVACDVSvideo retrievalvisual search
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.