Multimodal Speaker Diarization for Long-Form Video Scripts
iQIYI’s multimodal speaker diarization system splits long‑form video using subtitle timestamps and scene detection, extracts voiceprints with a custom model, hierarchically clusters them, and applies an Activate Speaker Detection algorithm combined with face‑recognition to assign speakers, achieving around 90 % precision and recall and boosting downstream tasks such as summarization, translation, and dubbing.