Artificial Intelligence 25 min read

Multimodal + Music: MMatch Series Technologies and Their Applications at Tencent Music

This article presents the multimodal learning demands of QQ Music, introduces the MMatch series of multimodal matching technologies—including image‑text matching, music similarity, AI tagging, and video scoring—and details their practical applications in business scenarios such as merchant public‑play, search, recommendation, and future product ideas.

DataFunTalk
DataFunTalk
DataFunTalk
Multimodal + Music: MMatch Series Technologies and Their Applications at Tencent Music

In this talk, Moyan, a senior algorithm researcher at Tencent Music, outlines the multimodal learning needs driven by QQ Music and its related services, emphasizing that music data is accompanied by visual, textual, and video modalities which can enrich content understanding and retrieval.

The discussion is organized into four parts: (1) a demand inventory for "multimodal + music"; (2) the MMatch series of technologies; (3) concrete applications of MMatch; and (4) future plans.

1. Multimodal + Music demand inventory – QQ Music contains album covers, user comments, playlists, and videos, all of which can be linked to the audio track. These diverse modalities enable fine‑grained music matching beyond coarse tags, support short‑video music pairing, and satisfy varied merchant environments such as supermarkets, restaurants, and nightclubs.

2. MMatch series technologies – The team abstracts common requirements into a data‑domain concept and builds embeddings for image, text, audio, and video. Four key techniques are introduced:

Image‑text matching (MMatch 图文配乐) that builds a semantic pool from visual and textual cues and retrieves similar songs.

Music similarity matching (MMatch 音乐匹配) that predicts multi‑label descriptors (instrumentation, singer timbre, genre, language) and uses embedding similarity for cold‑start songs.

AI tagging (AI标签技术) that generates soft tags from embeddings and refines them with lightweight classifiers, achieving >90% accuracy.

Video‑music matching (MMatch 视频配乐) that jointly models lyrics, audio, and video to align high‑quality MV targets.

3. MMatch applications – The technology powers the QQ Music merchant version (public‑play), enabling compliance‑aware song selection, environment‑based recommendation, and seed‑playlist expansion. It also improves search (10% higher effective play rate) and recommendation (1.35% higher completion, better tail‑song exposure). Additional use cases include running‑radio BPM matching, automated tag generation for long‑tail content, and visual‑music pairing.

Extensive experiments show a 2.33% objective gain over musicnn on the MSD‑18w dataset, a five‑fold inference speedup, and subjective listening tests reaching 91% matching accuracy across thousands of song pairs.

4. Future plans – Prospects include user‑customizable tag‑based recommendations, multimodal playlist re‑ordering for smoother listening transitions, and broader deployment of MMatch in short‑video, retail, and gaming contexts to promote emerging artists.

The session concludes with a Q&A covering compliance, the role of tags, reference papers, and data acquisition for merchant environments.

Artificial IntelligenceRecommendation systemsmultimodal learningTencent Musicembedding matchingmusic retrieval
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.