Artificial Intelligence 26 min read

Live Streaming Recommendation Practices in NetEase Cloud Music: Real-time, Multi-target, and Multimodal Approaches

The paper describes NetEase Cloud Music’s LOOK live‑streaming recommendation system for the song‑playback page, which combines millisecond‑level real‑time feature pipelines, multi‑target optimization (click, watch, gift, comment) via ESMM+FM and MMoE models, GradNorm‑based loss fusion, and a multimodal avatar‑text‑host ranking model, achieving double‑digit CTR and CTCVR gains while balancing producer and consumer retention.

NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Live Streaming Recommendation Practices in NetEase Cloud Music: Real-time, Multi-target, and Multimodal Approaches

The article presents NetEase Cloud Music's LOOK live streaming recommendation system, focusing on the song playback page entry.

It outlines the business background, describing multiple traffic entry points and goals such as CTR, watch duration, follow conversion, and payment conversion, while balancing producer and consumer retention.

Key challenges include real-time changes of streamers, multi-target optimization (click, watch, gift, comment), multimodal content (text, image, video, audio), limited display space (small avatar), and high proportion of new users affecting personalization.

For real-time recommendation, the authors detail a streaming sample pipeline using Flink and Kafka, covering millisecond, second/minute, and hour-level features, and discuss sample attribution (delayed feedback) and incremental training strategies.

Multi-target fusion explores separate modeling, joint modeling, sample weighting, and Learning to Rank, then focuses on ESMM+FM and MMoE approaches, showing ESMM+FM improves CTR by 7.1% and CTCVR by 6.4%, while MMoE yields similar CTR and a 1.5% CTCVR lift.

Loss fusion discusses weighted loss and GradNorm, with GradNorm improving CTR by 0.6% and CTCVR by 0.4% over manual weighting.

Multimodal exploration covers real-time feature/model updates and controlling streamer avatar and text via model-driven selection, leading to a host‑text‑avatar triplet ranking model that boosts CTR by 10.21% and CTCVR by 8.48%.

The article concludes that model choice must fit the business scene, emphasizing feature extraction aligned with user behavior and data.

real-timeLive Streamingrecommendation systemmultimodalMMoEESMMGradNormmulti-target learning
NetEase Cloud Music Tech Team
Written by

NetEase Cloud Music Tech Team

Official account of NetEase Cloud Music Tech Team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.