Tag

voiceprint clustering

0 views collected around this technical thread.

iQIYI Technical Product Team
iQIYI Technical Product Team
Nov 7, 2024 · Artificial Intelligence

Multimodal Speaker Diarization for Long-Form Video Scripts

iQIYI’s multimodal speaker diarization system splits long‑form video using subtitle timestamps and scene detection, extracts voiceprints with a custom model, hierarchically clusters them, and applies an Activate Speaker Detection algorithm combined with face‑recognition to assign speakers, achieving around 90 % precision and recall and boosting downstream tasks such as summarization, translation, and dubbing.

dialogue recognitioniQIYImultimodal AI
0 likes · 8 min read
Multimodal Speaker Diarization for Long-Form Video Scripts