Artificial Intelligence 8 min read

Multimodal Speaker Diarization for Long-Form Video Scripts

iQIYI’s multimodal speaker diarization system splits long‑form video using subtitle timestamps and scene detection, extracts voiceprints with a custom model, hierarchically clusters them, and applies an Activate Speaker Detection algorithm combined with face‑recognition to assign speakers, achieving around 90 % precision and recall and boosting downstream tasks such as summarization, translation, and dubbing.

iQIYI Technical Product Team

Nov 7, 2024

Multimodal Speaker Diarization for Long-Form Video Scripts

Background: Video scripts contain dialogue and speaker information that are essential for understanding plot. However, after multiple edits and cuts on long‑video platforms, script information is often lost, creating a need for dialogue speaker identification technology. This technology extracts and identifies speaker segments from a single episode, enabling structured management of massive video content. It improves downstream tasks such as highlight detection (accuracy ~85%, 5% higher than text‑only input) and supports video description, summarization, translation, and dubbing with precision and recall around 90%.

Existing solutions are divided into two categories: clustering‑based cascaded frameworks and end‑to‑end frameworks. The latter handles overlapping speech better, but because most dialogue lines in dramas belong to a single speaker, a clustering‑based approach is adopted. The typical pipeline includes voice activity detection (VAD) for segmentation, voiceprint feature extraction, and unsupervised clustering. Long videos pose challenges such as varied scene types, fluctuating speaker counts, background music, and large intra‑speaker variability, which degrade clustering purity.

To address these issues, iQIYI proposes a multimodal speaker diarization method. Audio is split using subtitle timestamps as segmentation points, and scene transition detection removes opening/ending songs and interludes. A proprietary voiceprint model extracts features, which are clustered to form high‑purity dialogue clusters. Finally, an Activate Speaker Detection (ASD) algorithm together with face‑recognition is applied in a multi‑level association strategy to assign speaker identities to each audio segment.

Technical solution (three modules): 1. Audio‑Video Splitting: Combine scene transition detection and music recognition to discard non‑dialogue segments. Use subtitle start‑end times rather than VAD to ensure each short audio clip corresponds to a single speaker. 2. Voiceprint Feature Extraction & Clustering: Build a large‑scale drama voiceprint dataset (2,000 speakers, 270k utterances, ~200 h total) from iQIYI’s long‑video library. Train a custom voiceprint model and perform hierarchical clustering—first within scenes, then across the whole episode—prioritizing purity over quantity. 3. Multi‑Level Speaker Association: Associate speakers at three granularities—dialogue, scene, and whole‑episode. At the dialogue level, ASD and face‑recognition link each subtitle line to the speaking person. At the scene level, cluster‑level speaker attributes correct intra‑cluster errors. At the episode level, global clustering fills gaps where lower‑level association fails.

The ASD algorithm is a multimodal, end‑to‑end detector that takes candidate face sequences and corresponding audio as input, extracts visual and audio features, fuses them via an attention‑based module, and outputs a binary speaking‑person decision. This handles cases where multiple faces appear, the speaker is off‑camera, or the speaker faces away.

Results: The custom iQIYI voiceprint model outperforms open‑source models on both the internal drama dataset and public benchmarks (see comparison table in the original). The overall diarization pipeline achieves speaker identification precision and recall around 90% and has been deployed in multiple iQIYI services such as video summarization, description, and translation.

Conclusion & Future Work: The technology is already in production, and future research will focus on improving the underlying algorithms, enhancing clustering methods, and incorporating dialogue semantics to build a more complete speaker diarization system for iQIYI’s video ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal AI dialogue recognition iQIYI speaker diarization video analysis voiceprint clustering

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.