Artificial Intelligence 12 min read

Weibo Machine Learning Platform (WML) Overview and Flink Applications

This article presents an in‑depth overview of Weibo's large‑scale machine learning platform, detailing its multi‑layer architecture, development workflow, CTR model evolution, and how Apache Flink is employed for real‑time data processing, sample services, multi‑stream joins, multimedia feature generation, and future roadmap plans.

DataFunTalk
DataFunTalk
DataFunTalk
Weibo Machine Learning Platform (WML) Overview and Flink Applications

Weibo, one of China's leading social media platforms, serves over 2.22 billion daily active users and 5.16 billion monthly active users, requiring a massive machine‑learning infrastructure to deliver real‑time content recommendations.

The Weibo Machine Learning Platform (WML) provides an end‑to‑end service chain for CTR, multimedia, and other machine‑learning tasks, covering sample processing, model training, deployment, and inference. Its architecture consists of six layers: Cluster, Scheduling, Computing Platform, Model Training, Online Inference, and Business Application, with custom concepts such as sample, model, and service libraries and unified submission methods (WeiClient CLI and WAIC UI).

Development in WML follows a two‑level DAG design. The inner DAG (WeiLearn) allows users to implement custom UDFs for offline and real‑time stages, forming individual tasks. The outer DAG (WeiFlow) composes these tasks into cross‑cluster workflows for execution.

CTR models have evolved through six versions, expanding from simple LR offline learning to support GBDT, FM/FFM, Wide&Deep, DeepFM, DSSM, and online FM/FFM, eventually reaching a billion‑parameter scale with peak throughput of millions of QPS and a model‑update cycle of about ten minutes.

Flink is integrated into WML for both real‑time and batch pipelines. It underpins the real‑time computation layer (alongside Storm, Flume, and Grafana) and supports data sources such as Kafka, Redis, and HDFS. Key use cases include sample services, multi‑stream joins, and multimedia feature generation, where offline deep‑learning models are deployed on GPU clusters for online inference.

Sample services generate training data by joining multiple streams, handling out‑of‑order logs with a custom 10‑minute window trigger that emits results as soon as the join succeeds, reducing latency. Optimizations include a sample‑trigger mechanism, PU‑loss compensation based on a 2019 Twitter paper, and RocksDB state storage with Gemini‑inspired I/O improvements.

Multi‑stream join scenarios involve N data sources filtered, mapped, and keyed before being joined in a window, followed by further processing and storage into the sample library.

Multimedia feature generation relies on offline GPU‑accelerated deep‑learning models for image, text, and video streams, with online inference via RPC calls to the models, ensuring four‑nine availability, sub‑second latency, and configurable development modes.

Reliability measures include end‑to‑end monitoring and alerting, at‑least‑once message delivery, automatic restarts with checkpoint recovery, and retry queues to guarantee data persistence.

Future plans focus on unifying real‑time and batch table registration via a common API and migrating offline deep‑learning pipelines to online TensorFlow‑on‑Flink, enabling continuous model updates and incremental training.

The presentation was delivered by Yu Qian, Senior Algorithm Engineer at Weibo's Machine Learning R&D Center, who has extensive experience building real‑time recommendation systems with Flink.

machine learningReal-time ProcessingFlinkctrdata platformWeibo
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.