Evolution of Real-Time Computing at Youzan: From Storm to Flink and Future Directions
Youzan’s real‑time computing platform progressed from early Storm deployments through Spark Streaming to a Flink‑based architecture, adding unified task management, monitoring, and dedicated streaming clusters, while now pursuing SQL‑driven jobs, a Druid OLAP engine, and a future real‑time data warehouse.
Youzan is a merchant service company that provides e‑commerce solutions across all industries and scenarios. A large number of its business products rely on real‑time data processing, which is supported by a core technology component serving dozens of products and hundreds of streaming jobs (e.g., transaction dashboards, product statistics, log platforms, tracing, risk control).
1. Overview
The article outlines the development history and current architecture of Youzan's real‑time computing platform.
2. Real‑time Computing Development at Youzan
From a technology‑stack perspective, Youzan followed the industry trend: early Storm, then JStorm, Spark Streaming, and most recently Flink. The evolution can be divided into two phases: the initial (startup) phase and the platform‑ization phase.
2.1 Startup Phase
During this phase there was no overall real‑time computing plan, no platform‑level task management, monitoring, or alerting. Users submitted jobs via command‑line on AG servers, making reliability and usability difficult, though many use cases were accumulated.
2.1.1 Storm Introduction
In early 2014 Storm was used to decouple real‑time event statistics from business logic, listening to MySQL binlog updates and writing results back to MySQL or Redis. Over two years, nearly a hundred real‑time applications were built on Storm, but the system suffered from limited throughput and latency insensitivity.
2.1.2 Spark Streaming Adoption
By late 2016, Spark Streaming replaced Storm for many workloads due to better throughput and performance. Users still submitted jobs via AG servers to a YARN cluster. By the end of 2017, Spark Streaming became the dominant engine.
2.1.3 Summary of Startup Issues
Lack of business management mechanisms made it hard to map running jobs to owners.
Separate monitoring/alerting tools for Storm and Spark Streaming caused duplication.
Compute resources were not isolated; offline and streaming jobs shared the same YARN pool, leading to contention.
Resource scheduling was rigid; tasks specified resources in launch scripts, hindering failover.
These problems highlighted the need for a unified real‑time computing platform.
2.2 Platform‑ization Phase
Key requirements identified:
Business management functions to record real‑time applications and associate them with owners.
Task‑level monitoring, automatic fault recovery, customizable alerts, and traffic dashboards.
Dedicated YARN clusters for streaming to avoid interference with offline jobs.
Zero‑downtime task migration between clusters.
In early 2018, the first phase of the platform was built, initially supporting Spark Streaming. After two months all Spark Streaming jobs were migrated, followed by Storm support and migration. AG servers were decommissioned, eliminating manual job submission.
By mid‑2018 both Storm and Spark Streaming were in production, but each had drawbacks. A comparative evaluation favored Flink for:
Lower latency (Flink processes streams at sub‑second level versus Spark’s 15‑second micro‑batches).
Comparable throughput.
Superior state storage (in‑memory or RocksDB).
Better SQL support.
More flexible APIs.
Consequently, Flink was added to the platform (see the blog “Flink in Youzan’s Real‑Time Computing”).
2.2.2 New Challenges
Despite a stable platform, onboarding efficiency became a bottleneck. The typical workflow required:
Learning the SDK (≈½ day).
Applying for upstream/downstream resources (hours).
Developing, testing, and deploying the job (1‑3 days).
Ensuring code quality without systematic reviews, leading to production issues.
The total onboarding time was 2‑3 days, prompting two strategic directions:
SQL‑based real‑time tasks.
Introducing additional technology stacks for simple analytical scenarios.
2.2.2.1 SQL‑Based Real‑Time Tasks
SQLification aims to simplify development and shorten deployment cycles. The roadmap includes:
Stream‑to‑stream jobs based on Kafka.
Stream‑to‑storage jobs using HBaseSink.
Support for user‑defined functions (UDFs).
Implementation is in progress.
2.2.2.2 Real‑Time OLAP Engine
For high‑frequency UV/PV statistics, a real‑time OLAP store is needed. After evaluating Kudu (C++) and Druid (Java), Druid was chosen for its query performance, ease of integration, and lower operational cost.
3. Future Plans
The immediate goal is to fully realize SQL‑based tasks, covering about 70% of use cases, thereby improving development efficiency. After that, Youzan plans to build a real‑time data warehouse, with an initial design shown in the diagram below.
Building a complete real‑time warehouse will also require metadata management, data quality components, and other supporting infrastructure.
4. Conclusion
Youzan’s real‑time computing has continuously evolved according to business needs, shifting between technologies to achieve the best cost‑performance balance. The article provides a timeline overview rather than deep technical details.
Finally, the author includes a brief contact note: the Youzan big‑data infrastructure team is responsible for DP, real‑time computing (Storm, Spark Streaming, Flink), offline computing (HDFS, YARN, Hive, Spark SQL), online storage (HBase), and real‑time OLAP (Druid). Interested readers can contact [email protected].
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.