Technical Evolution of Xianyu Real-Time Selection System for Double Eleven
To meet Double‑Eleven’s sub‑second, billion‑item feed demands, Alibaba’s Xianyu selection system evolved from a Solr‑based search pipeline through offline batch and PostgreSQL attempts to a Blink‑powered real‑time stream platform using Niagara’s low‑latency LSM storage, delivering high‑throughput, personalized product feeds.
Double Eleven is a massive shopping festival in China. For Alibaba engineers it is an annual stress test that drives a series of innovations across networking, infrastructure, capacity planning, performance optimization, personalized recommendation, intelligent search and complex marketing scenarios.
The Xianyu selection system ("选品投放系统") is introduced to address two key characteristics of Double Eleven event pages: similar product categories (e.g., sports goods) and personalized, per‑user product feeds. The system must select relevant items from billions of products in real time and deliver them to each user.
2. Xianyu Selection System Evolution
2.1 Business Problems
The e‑commerce flow consists of three core stages: product publishing, product display, and product transaction. The selection system focuses on the display stage, especially during large‑scale events, by combining operational rules with algorithmic personalization to build many event channels and present high‑quality items.
2.2 Search‑Based Implementation
Early Xianyu relied on a search engine (Solr) as the data source for feeds. The pipeline:
Publish → write to product DB → send async message via RocketMQ.
Consumer reads the message, enriches data via RPC, writes to a MySQL dump table, then updates the search index via HTTP.
Nightly full‑dump jobs rebuild the index.
Operators configure selection rules through a UI that translates to search queries.
Front‑end feeds request are generated by converting rule IDs to search queries.
Limitations include high latency for real‑time updates (≈100 ms) and scalability concerns for large‑scale events.
2.3 Offline Computation Approach
Using a big‑data platform (MaxComputer, or a custom Kafka + Hadoop + Hive + Spark stack) to run batch jobs that compute selection rules. This approach cannot meet Xianyu’s sub‑second real‑time requirements.
2.4 PostgreSQL Attempt
PostgreSQL was evaluated for its OLTP/OLAP capabilities, fast JSON handling, triggers, and NOTIFY mechanism. However, trigger overhead, notification reliability, and KV store performance issues prevented it from being a viable production solution.
2.5 Real‑Time Computing Platform (Blink)
Blink, an Alibaba‑enhanced Flink platform, was chosen for its real‑time stream processing, high‑performance state backend, and operational maturity. Key challenges addressed:
Incremental and full‑load data merging in a wide table.
High write/read throughput for billions of rows (≈2 × 10⁵ TPS).
Efficient rule evaluation and result set updates.
Blink’s state backend options include:
Gemini – high‑performance in‑memory backend.
RocksDBStateBackend – Flink native, slower.
Niagara – Alibaba’s custom distributed backend based on Seastar, offering low‑latency LSM‑tree storage.
Custom UDAFs are used for aggregation. Example code:
/*
* @param
UDAF output type
* @param
UDAF accumulator type
*/
public abstract class AggregateFunction
extends UserDefinedFunction {
public ACC createAccumulator();
public T getValue(ACC accumulator);
public void accumulate(ACC accumulator, ...[user inputs]...);
}The state stores fields such as _mergeRes (merged product info with timestamps), _ruleRes (rule hit flag), _changeFlag (change marker), and _ruleDiff (diff for incremental updates). Niagara’s LSM design, masstree memtable, and adaptive compaction enable the required throughput.
2.6 Real‑Time Full Selection and Preview
HybridDB for MySQL (an OLAP database) provides trillion‑scale, millisecond‑level multi‑dimensional analysis, integrated with MaxComputer and Blink for seamless rule evaluation and KV store version switching.
2.7 Real‑Time Recommendation Layer
Details are omitted for brevity; the recommendation layer builds on the real‑time selection results.
2.8 Performance Metrics
The system, named “Mach” (after the Mach speed unit), now powers almost all Xianyu feed‑type services, supporting high QPS, low latency, and robust reliability.
References
[1] RocketMQ: https://rocketmq.apache.org/
[2] Real‑time search solution based on Lucene.
[3] MaxComputer: https://www.aliyun.com/product/odps
[4] PostgreSQL NOTIFY: https://www.postgresql.org/docs/current/static/sql-notify.html
[5] Seastar: http://www.seastar-project.org
[6] Masstree: https://pdos.csail.mit.edu/papers/masstree:eurosys12.pdf
[7] HybridDB for MySQL: https://www.aliyun.com/product/petadata
Xianyu Technology
Official account of the Xianyu technology team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.