Artificial Intelligence 20 min read

Tencent Music Tianqin Lab’s Practice and Applications of Audio Representation Large Models

This article reviews Tencent Music Tianqin Lab’s research on audio representation large models, covering background, the evolution of audio features, self‑supervised methods such as SimCLR, BYOL, MAE, MLM, benchmark results, multimodal extensions, and real‑world applications like song authenticity detection and search ranking.

DataFunSummit
DataFunSummit
DataFunSummit
Tencent Music Tianqin Lab’s Practice and Applications of Audio Representation Large Models

Introduction – The article shares the practice and applications of audio representation large models developed by Tencent Music’s Tianqin Lab, aiming to build a universal audio representation that captures both expert‑level and ordinary listeners’ music perception.

Audio Representation Background – User music preferences involve multiple dimensions (artist, melody, timbre, genre, etc.). Traditional low‑level features (spectral roll‑off, centroid) evolved to mid‑level MFCCs and high‑level musical attributes (chords, rhythm). However, these high‑level features still differ from everyday listeners’ understanding, motivating a more comprehensive representation.

Evolution of Audio Representation – The field has shifted from handcrafted low‑level signals to self‑supervised deep models. Recent breakthroughs include AudioMAE, JukeBox, and multimodal models that treat audio as a powerful encoder for cross‑modal tasks.

Self‑Supervised Learning Methods

SimCLR – contrastive learning that pulls positive audio pairs together while pushing negatives apart.

BYOL – uses an online and a target network to learn representations without negative samples, achieving strong performance with limited data.

MAE – masks random audio patches and reconstructs them, enabling fast training and good results on tasks such as environmental sound recognition.

MLM (e.g., Wav2Vec2.0, MERT) – masks portions of the audio signal and predicts them, often combined with auxiliary acoustic or musical teachers (e.g., Encodec, CQT) for richer supervision.

Datasets and Model Scale – Representative models are summarized with their parameter counts and training resources. Larger models require substantially more music data; efficient training tricks are needed to keep resource usage reasonable.

Benchmarking – The MARBLE benchmark (Music Audio Representation Benchmark for universal Evaluation) evaluates models on tasks such as tagging, genre classification, emotion analysis, pitch estimation, singer identification, and vocal technique classification. Results show that universal audio representations excel in downstream tasks.

Multimodal Extensions – Audio can be combined with text (CLAP), images, or video. CLAP aligns audio and textual embeddings, enabling cross‑modal retrieval and tag generation. The MU‑LLaMA model integrates audio with large language models to generate music descriptions.

Practical Applications

Song authenticity detection – distinguishing AI‑generated vocals from real recordings using fine‑grained audio features.

Search ranking – leveraging audio embeddings to surface cold‑start or newly released high‑quality songs.

Music understanding – using models like MU‑LLaMA to generate descriptive captions, though current AI still lags behind expert human analysis.

Conclusion – Universal audio representation offers rich, efficient, and precise features that advance music information retrieval, recommendation, and multimodal understanding, while ongoing research focuses on deeper music perception and scaling to larger datasets.

multimodal AILarge Modelsself-supervised learningmusic information retrievalaudio representationTencent Music
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.