Artificial Intelligence 14 min read

Xiaomi’s AI Breakthroughs Earn Spot at ICASSP 2026

Xiaomi announced that a suite of AI research papers—including a large‑scale audio‑text dataset, a federated learning framework for domain and class generalization, a dual‑encoder music evaluation model, a cross‑domain audio‑text pre‑training system, a one‑step video‑to‑audio synthesis method, a training‑free frame‑selection technique for long‑video understanding, and a unified multimodal retrieval architecture—were accepted to the prestigious ICASSP 2026 conference, showcasing detailed methodologies, benchmark results, and potential impact across audio, vision, and multimodal AI applications.

Xiaomi Tech

Jan 21, 2026

Xiaomi’s AI Breakthroughs Earn Spot at ICASSP 2026

ICASSP 2026 Acceptance Overview

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026 announced the acceptance of multiple Xiaomi research papers spanning audio understanding, music generation evaluation, general audio‑text pre‑training, video‑to‑audio synthesis, long‑video understanding, federated learning generalization, and multimodal multilingual retrieval. The selections highlight Xiaomi’s sustained investment in signal‑processing AI.

ACAVCaps: Large‑Scale Fine‑Grained Audio Understanding Dataset

The ACAVCaps dataset contains roughly 4.7 million audio‑text pairs. An automated pipeline extracts sound events, music features, speaker attributes, and speech content using parallel expert models, then employs a large language model with a Chain‑of‑Thought (CoT) reasoning strategy to integrate fragmented metadata into coherent natural‑language annotations. This multi‑level labeling transforms isolated tags into hierarchical, context‑rich descriptions, and the dataset will be fully open‑sourced.

FedDCG: Federated Joint Learning for Domain and Class Generalization

FedDCG introduces a domain‑grouping strategy that partitions client data by domain and trains independent class‑generalization networks within each group to avoid decision‑boundary confusion. The framework combines class‑specific domain‑group collaborative training with cross‑attention prompts and separates global and domain prompts to decouple generic and specific knowledge. Experiments on Office‑Home (tested on ImageNet‑R) achieve an average accuracy of 70.30 % , about 3 % higher than the next best DiPrompT, and maintain superiority under a 50 % low‑sampling regime.

FUSEMOS: Dual‑Encoder Perceptual Evaluation for Text‑to‑Music Generation

FUSEMOS fuses the CLAP and MERT pre‑trained models via a late‑fusion architecture. CLAP aligns audio and text semantics, while MERT captures melodic, rhythmic, and harmonic structures from large‑scale music data. A ranking‑aware composite loss (truncated regression + contrastive ranking) improves both mean‑square error and Spearman correlation on the MusicEval benchmark, delivering predictions that better reflect human preference ordering.

GLAP: General Contrastive Audio‑Text Pre‑Training Across Domains and Languages

GLAP jointly optimizes speech, music, and environmental sound retrieval and classification within a single framework. It reaches recall@1 of ≈94 % on LibriSpeech (English) and ≈99 % on AISHELL‑2 (Chinese), while maintaining state‑of‑the‑art performance on AudioCaps. The model exhibits zero‑shot keyword spotting in 50 languages without target‑language fine‑tuning.

MeanFlow: One‑Step Multimodal Video‑to‑Audio Synthesis

MeanFlow replaces traditional flow‑matching with average‑velocity‑field modeling, enabling one‑step generation for video‑synchronized audio (V2A). A scalar rescaling mechanism balances conditional and unconditional predictions, mitigating distortion in one‑step synthesis. The model achieves 2×‑500× faster inference (e.g., generating 8 s of audio in 0.056 s ) while preserving SOTA audio quality, temporal alignment, and cross‑modal consistency, and it extends naturally to text‑to‑audio tasks.

Think‑Clip‑Sample (TCS): Training‑Free Frame Selection for Long‑Video Understanding

TCS employs multi‑query reasoning to generate diverse semantic queries from a question, then uses CLIP to score frame similarity. A clip‑level slow‑fast sampling splits the frame budget into dense “slow” sampling on high‑similarity segments and sparse “fast” sampling elsewhere, balancing detail and global context. On benchmarks MLVU, LongVideoBench, and VideoMME, TCS improves accuracy by up to 6.9 % (MLVU) and cuts inference time by over 50 % on models such as Qwen2‑VL‑7B and MiMo‑VL‑7B.

Unified Multimodal and Multilingual Retrieval via Multi‑Task Learning with NLU Integration

The proposed framework consolidates image search, text search, and intent understanding into a two‑model architecture. A shared text encoder aligns image and text semantics, while cross‑attention with an NLU model injects intent awareness. The system attains 93.3 % average recall on XTD10 and 94.8 % on Multi30K, surpassing baselines by 1.1‑2.7 %, and reaches 85.1 % on COCO‑QLTI, improving retrieval speed and reducing memory footprint.

Conclusion

These accepted works illustrate Xiaomi’s comprehensive AI research pipeline—from dataset creation and model innovation to cross‑modal and multilingual capabilities—positioning the company to integrate cutting‑edge AI across its “person‑car‑home” ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Multimodal federated learning video synthesis music generation audio understanding ICASSP 2026

Written by

Xiaomi Tech

Chat about technology with Xiaomi and change life together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

IC​ASSP 2026 Acceptance Overview

ACAVCaps: Large‑Scale Fine‑Grained Audio Understanding Dataset

FedDCG: Federated Joint Learning for Domain and Class Generalization

FUSEMOS: Dual‑Encoder Perceptual Evaluation for Text‑to‑Music Generation

GLAP: General Contrastive Audio‑Text Pre‑Training Across Domains and Languages

MeanFlow: One‑Step Multimodal Video‑to‑Audio Synthesis

Think‑Clip‑Sample (TCS): Training‑Free Frame Selection for Long‑Video Understanding

Unified Multimodal and Multilingual Retrieval via Multi‑Task Learning with NLU Integration

Conclusion

Xiaomi Tech

How this landed with the community

Was this worth your time?

0 Comments

ICASSP 2026 Acceptance Overview