Artificial Intelligence 12 min read

Applying Multimodal Large Models to Music Recommendation at NetEase Cloud Music

This article details how NetEase Cloud Music leverages multimodal large language models to improve music recommendation across daily, personalized, and playlist scenarios by extracting rich audio, text, and visual features, addressing data skew, cold‑start challenges, and achieving measurable gains in user engagement and distribution efficiency.

DataFunSummit

Feb 26, 2025

Applying Multimodal Large Models to Music Recommendation at NetEase Cloud Music

Background : Large language models (LLMs) have advanced dramatically, and multimodal LLMs that process text, images, audio, and video are now reshaping industries, including music recommendation.

Music Recommendation Challenges : NetEase Cloud Music faces data skew (the Matthew effect), cold‑start for new songs, and the need for high‑quality, diverse recommendations across daily feeds, personalized streams, and UGC/MGC playlists.

Solution Overview : A three‑layer system—data, feature, and application—uses multimodal LLMs to generate comprehensive song representations (lyrics, cover images, audio, metadata, user comments) that are aligned with existing ID‑based embeddings.

Technical Details :

Prompt construction combines song metadata, user reviews, instrument tags, lyrics, and visual/audio features as inputs to the LLM.

Feature extraction employs Baichuan for text, ViT‑base for images, and MERT for audio, processed in parallel workers.

Offline validation compares LLM‑based recall with traditional collaborative‑filtering and NLP models, showing superior emotional and contextual matching.

Alignment techniques include multimodal‑to‑ID mapping layers, auxiliary networks, contrastive learning, and a two‑stage modeling pipeline.

Results : Deploying multimodal representations boosted average playtime by 3%, click‑through rate by 3%, playlist distribution by 50%, new‑song distribution efficiency by 3%, and long‑audio exposure by 4%.

Future Work : Further explore contrastive alignment, two‑stage pre‑training for multimodal models, and new applications such as LLAVA for end‑to‑end multimodal fusion.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal AI Large Language Models feature extraction music recommendation NetEase Cloud Music

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.