Artificial Intelligence 14 min read

Multimodal Cold-Start Techniques for Music Recommendation at NetEase Cloud Music

This article presents NetEase Cloud Music's multimodal cold-start recommendation approach, detailing the problem's significance, feature extraction using CLIP, I2I2U indirect modeling, U2I DSSM direct modeling with contrastive learning and interest‑boundary mechanisms, deployment pipeline, evaluation results, and future optimization directions.

DataFunTalk
DataFunTalk
DataFunTalk
Multimodal Cold-Start Techniques for Music Recommendation at NetEase Cloud Music

Cold-start modeling is crucial for music recommendation platforms to enrich content ecosystems and improve user experience, especially for long-tail items lacking interaction data.

The presented solution leverages multimodal features (audio, text, tags) extracted via a CLIP-based framework, combining pretrained encoders for audio (Transformer) and text (BERT) to obtain robust song representations.

Two modeling strategies are introduced:

I2I2U indirect modeling : transforms a cold-start item to similar items (I2I) and then to users (U), using vector similarity and collaborative‑filtering based supervision with BPR loss.

U2I direct modeling : a multimodal DSSM consisting of an ItemTower and a UserTower, enhanced with an interest‑boundary tower to separate positive and negative samples.

Contrastive learning is applied to mitigate popularity bias by generating two augmented views of each item (random masking/noise) and using an infoNCE loss combined with BPR loss.

The interest‑boundary mechanism computes a boundary vector for each user; during inference, an item is recommended only if its score exceeds the user's boundary, preventing irrelevant cold‑start items from being shown to users who prefer popular content.

After offline training, song vectors are indexed for fast nearest‑neighbor retrieval; new items are encoded online, matched to similar items, and then delivered to users who have interacted with those similar items.

Evaluation shows significant improvements: +38% more target users, +1.95% increase in collection rate, and +1.42% rise in completion rate, with strong clustering of songs by genre in the embedding space.

Future work includes multimodal fusion of content and behavior features and end‑to‑end optimization of the recall‑ranking pipeline.

The article concludes with a Q&A covering key metrics, feature preprocessing, model architecture, and the role of contrastive learning and interest boundaries in cold‑start recommendation.

deep learningcontrastive learningrecommendation systemCold Startmultimodal learningmusic recommendation
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.