Xiaomi’s AI Breakthroughs Earn Spot at ICASSP 2026
Xiaomi announced that a suite of AI research papers—including a large‑scale audio‑text dataset, a federated learning framework for domain and class generalization, a dual‑encoder music evaluation model, a cross‑domain audio‑text pre‑training system, a one‑step video‑to‑audio synthesis method, a training‑free frame‑selection technique for long‑video understanding, and a unified multimodal retrieval architecture—were accepted to the prestigious ICASSP 2026 conference, showcasing detailed methodologies, benchmark results, and potential impact across audio, vision, and multimodal AI applications.
ICASSP 2026 Acceptance Overview
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026 announced the acceptance of multiple Xiaomi research papers spanning audio understanding, music generation evaluation, general audio‑text pre‑training, video‑to‑audio synthesis, long‑video understanding, federated learning generalization, and multimodal multilingual retrieval. The selections highlight Xiaomi’s sustained investment in signal‑processing AI.
ACAVCaps: Large‑Scale Fine‑Grained Audio Understanding Dataset
The ACAVCaps dataset contains roughly 4.7 million audio‑text pairs. An automated pipeline extracts sound events, music features, speaker attributes, and speech content using parallel expert models, then employs a large language model with a Chain‑of‑Thought (CoT) reasoning strategy to integrate fragmented metadata into coherent natural‑language annotations. This multi‑level labeling transforms isolated tags into hierarchical, context‑rich descriptions, and the dataset will be fully open‑sourced.
FedDCG: Federated Joint Learning for Domain and Class Generalization
FedDCG introduces a domain‑grouping strategy that partitions client data by domain and trains independent class‑generalization networks within each group to avoid decision‑boundary confusion. The framework combines class‑specific domain‑group collaborative training with cross‑attention prompts and separates global and domain prompts to decouple generic and specific knowledge. Experiments on Office‑Home (tested on ImageNet‑R) achieve an average accuracy of 70.30 % , about 3 % higher than the next best DiPrompT, and maintain superiority under a 50 % low‑sampling regime.
FUSEMOS: Dual‑Encoder Perceptual Evaluation for Text‑to‑Music Generation
FUSEMOS fuses the CLAP and MERT pre‑trained models via a late‑fusion architecture. CLAP aligns audio and text semantics, while MERT captures melodic, rhythmic, and harmonic structures from large‑scale music data. A ranking‑aware composite loss (truncated regression + contrastive ranking) improves both mean‑square error and Spearman correlation on the MusicEval benchmark, delivering predictions that better reflect human preference ordering.
GLAP: General Contrastive Audio‑Text Pre‑Training Across Domains and Languages
GLAP jointly optimizes speech, music, and environmental sound retrieval and classification within a single framework. It reaches recall@1 of ≈94 % on LibriSpeech (English) and ≈99 % on AISHELL‑2 (Chinese), while maintaining state‑of‑the‑art performance on AudioCaps. The model exhibits zero‑shot keyword spotting in 50 languages without target‑language fine‑tuning.
MeanFlow: One‑Step Multimodal Video‑to‑Audio Synthesis
MeanFlow replaces traditional flow‑matching with average‑velocity‑field modeling, enabling one‑step generation for video‑synchronized audio (V2A). A scalar rescaling mechanism balances conditional and unconditional predictions, mitigating distortion in one‑step synthesis. The model achieves 2×‑500× faster inference (e.g., generating 8 s of audio in 0.056 s ) while preserving SOTA audio quality, temporal alignment, and cross‑modal consistency, and it extends naturally to text‑to‑audio tasks.
Think‑Clip‑Sample (TCS): Training‑Free Frame Selection for Long‑Video Understanding
TCS employs multi‑query reasoning to generate diverse semantic queries from a question, then uses CLIP to score frame similarity. A clip‑level slow‑fast sampling splits the frame budget into dense “slow” sampling on high‑similarity segments and sparse “fast” sampling elsewhere, balancing detail and global context. On benchmarks MLVU, LongVideoBench, and VideoMME, TCS improves accuracy by up to 6.9 % (MLVU) and cuts inference time by over 50 % on models such as Qwen2‑VL‑7B and MiMo‑VL‑7B.
Unified Multimodal and Multilingual Retrieval via Multi‑Task Learning with NLU Integration
The proposed framework consolidates image search, text search, and intent understanding into a two‑model architecture. A shared text encoder aligns image and text semantics, while cross‑attention with an NLU model injects intent awareness. The system attains 93.3 % average recall on XTD10 and 94.8 % on Multi30K, surpassing baselines by 1.1‑2.7 %, and reaches 85.1 % on COCO‑QLTI, improving retrieval speed and reducing memory footprint.
Conclusion
These accepted works illustrate Xiaomi’s comprehensive AI research pipeline—from dataset creation and model innovation to cross‑modal and multilingual capabilities—positioning the company to integrate cutting‑edge AI across its “person‑car‑home” ecosystem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
