iQIYI Technical Product Team
Nov 7, 2024 · Artificial Intelligence
Multimodal Speaker Diarization for Long-Form Video Scripts
iQIYI’s multimodal speaker diarization system splits long‑form video using subtitle timestamps and scene detection, extracts voiceprints with a custom model, hierarchically clusters them, and applies an Activate Speaker Detection algorithm combined with face‑recognition to assign speakers, achieving around 90 % precision and recall and boosting downstream tasks such as summarization, translation, and dubbing.
dialogue recognitioniQIYImultimodal AI
0 likes · 8 min read