Artificial Intelligence 12 min read

Applying Multimodal Large Models to Music Recommendation at NetEase Cloud Music

This article details how NetEase Cloud Music leverages multimodal large language models to improve music recommendation across daily, personalized, and playlist scenarios by extracting rich audio, text, and visual features, addressing data skew, cold‑start challenges, and achieving measurable gains in user engagement and distribution efficiency.

DataFunSummit
DataFunSummit
DataFunSummit
Applying Multimodal Large Models to Music Recommendation at NetEase Cloud Music

Background : Large language models (LLMs) have advanced dramatically, and multimodal LLMs that process text, images, audio, and video are now reshaping industries, including music recommendation.

Music Recommendation Challenges : NetEase Cloud Music faces data skew (the Matthew effect), cold‑start for new songs, and the need for high‑quality, diverse recommendations across daily feeds, personalized streams, and UGC/MGC playlists.

Solution Overview : A three‑layer system—data, feature, and application—uses multimodal LLMs to generate comprehensive song representations (lyrics, cover images, audio, metadata, user comments) that are aligned with existing ID‑based embeddings.

Technical Details :

Prompt construction combines song metadata, user reviews, instrument tags, lyrics, and visual/audio features as inputs to the LLM.

Feature extraction employs Baichuan for text, ViT‑base for images, and MERT for audio, processed in parallel workers.

Offline validation compares LLM‑based recall with traditional collaborative‑filtering and NLP models, showing superior emotional and contextual matching.

Alignment techniques include multimodal‑to‑ID mapping layers, auxiliary networks, contrastive learning, and a two‑stage modeling pipeline.

Results : Deploying multimodal representations boosted average playtime by 3%, click‑through rate by 3%, playlist distribution by 50%, new‑song distribution efficiency by 3%, and long‑audio exposure by 4%.

Future Work : Further explore contrastive alignment, two‑stage pre‑training for multimodal models, and new applications such as LLAVA for end‑to‑end multimodal fusion.

multimodal AILarge Language ModelsRecommendation systemsFeature Extractionmusic recommendationNetEase Cloud Music
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.