ControlFoley: An Open‑Source Model for Fully Controllable Video Sound Generation
ControlFoley, released by Xiaomi's large‑model team, is an open‑source framework that lets creators generate video‑aligned sound effects while explicitly controlling content, style, and timing through text prompts, video dubbing, or reference audio, achieving SOTA performance on multiple benchmarks.
Video‑to‑audio (V2A) models can automatically add sound to silent video, but creators often need finer control over the sound content and style. Existing methods typically follow the visual cue, ignoring textual or audio references, which limits creative flexibility.
ControlFoley Overview
ControlFoley is a unified, controllable video sound generation framework that supports three task types:
TV2A (Text‑guided Video‑to‑Audio) : generate synchronized sound from video and optional text that refines the semantic meaning.
TC‑V2A (Text‑Controlled Video Dubbing) : when text and visual semantics conflict, the model follows the text intent while preserving temporal alignment with the video.
AC‑V2A (Audio‑Reference Controlled Video Dubbing) : use a reference audio clip to control timbre and style while the video dictates timing.
The model is designed to be "controllable"—the word "Control" in its name reflects the ability for creators to dictate what the sound should be, even when visual and textual cues disagree.
Key Technical Innovations
Joint Visual Encoding (CAV‑MAE‑ST) : A spatio‑temporal audio‑visual encoder trained specifically for video dubbing, enhancing the model’s understanding of action timing and audio‑visual synchronization beyond generic CLIP representations.
Time‑Timbre Decoupling : Reference audio is processed to suppress its temporal structure, preserving only global timbre features. This prevents the reference’s rhythm from disrupting video‑driven synchronization.
Modal‑Robust Training : Random modality dropout and a unified multimodal alignment objective (REPA) enable the model to handle any combination of inputs (video only, video + text, video + audio, or all three) without performance degradation.
Benchmark Evaluation
ControlFoley was evaluated on three task families:
TV2A : Achieved open‑source SOTA on VGGSound‑Test, Kling‑Audio‑Eval, and MovieGen‑Audio‑Bench, showing superior semantic alignment, timing, and sound quality.
TC‑V2A : Tested with the VGGSound‑TVC benchmark, which varies text‑visual conflict from L0 to L3. Metrics IB (dependency on video semantics) decreased with higher conflict, while CLAP (text‑audio semantic similarity) remained high, demonstrating strong text control without sacrificing sync.
AC‑V2A : Compared against CondFoleyGen and commercial Kling‑Foley. ControlFoley outperformed on timbre similarity, temporal sync, and overall audio quality, confirming effective style transfer from reference audio while maintaining video‑driven timing. DeSync scores stayed low across conflict levels, indicating that text‑driven control does not compromise audio‑video synchronization.
Practical Impact
By decoupling modality responsibilities—visual cues dictate when sounds occur, text or reference audio dictate what they sound like—ControlFoley enables creators to produce tailored sound effects for short videos, film clips, game animations, or ads. The open‑source release includes code, model weights, a technical report, an online demo, and a ready‑to‑use Skill, facilitating reproducibility and further research.
Resources
For full results and interactive demos, see the project homepage and GitHub repository linked below. Technical Report: https://arxiv.org/abs/2604.15086 GitHub: https://github.com/xiaomi-research/controlfoley Model Weights: https://huggingface.co/YJX-Xiaomi/ControlFoley Online Demo & Skill: https://yjx-research.github.io/ControlFoley_web_page/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
