Artificial Intelligence 15 min read

Multimodal Content Understanding and Cold-Start Practices in NetEase Cloud Music Community Recommendation System

This article details how NetEase Cloud Music leverages multimodal content understanding—using audio models like MusicCLIP and Audio MAE and image‑text fusion via FLAVA—to improve recommendation performance for new content and new users, covering system architecture, cold‑start solutions, and future AI‑driven directions.

DataFunSummit

Sep 16, 2024

Multimodal Content Understanding and Cold-Start Practices in NetEase Cloud Music Community Recommendation System

Introduction: This article presents the application of multimodal content understanding in NetEase Cloud Music community recommendation system, covering audio and image‑text modalities, and emphasizing its importance for new content and new user optimization.

Community Overview: The Cloud Music community includes four content types—comments (text), moments (image‑text), dynamics (user and artist posts), and fan groups—aiming to grow user scale and retention through a loop of users, creators, and content.

Multimodal Content Understanding – Audio: Audio representations are learned using MusicCLIP and Audio MAE. MusicCLIP aligns audio with textual tags via a CLIP‑like contrastive framework, while Audio MAE masks spectrogram patches and reconstructs them with an encoder‑decoder trained by MSE.

Multimodal Content Understanding – Image‑Text: The FLAVA architecture encodes images and text with ViT, merges features, and applies additional tricks such as fixing the image encoder and training the text encoder, handling misaligned image‑text pairs, and exploring large‑language models like Qwen‑VL with prompts for feature extraction.

Cold‑Start Practices – Content Cold‑Start: A CB2CF pipeline improves new content recall, using concatenated multimodal features processed by SENet and MLP to align with the online DSSM twin tower, enhancing interaction rates and exposure of new items.

Cold‑Start Practices – User Cold‑Start: User embeddings are first generated by a Light GCN based on behavior, then enhanced with multimodal embeddings from MusicCLIP and Audio MAE, achieving higher coverage and interaction improvements, especially for niche‑preference users.

Personalized Grouping: Differences between new and old users are captured via category‑level interest diffs, and gender‑based priors guide optimal group‑wise gating mechanisms integrated into a Deep FM + MMOE architecture, yielding stable user segmentation and metric gains.

Outlook: Future work includes leveraging large models for tag representation, generating additional features, and adopting transformer‑based generative recommendation models that follow scaling laws.

Q&A: The session addresses recall‑to‑ranking exposure, pair‑wise loss definitions, multimodal user tower comparisons, and typical traffic split ratios for experiments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

recommendation system cold start Multimodal Learning AI models audio representation image-text fusion

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.