Big Data 14 min read

Evolution of Real-Time Computing at Youzan: From Storm to Flink and Future Directions

Youzan’s real‑time computing platform progressed from early Storm deployments through Spark Streaming to a Flink‑based architecture, adding unified task management, monitoring, and dedicated streaming clusters, while now pursuing SQL‑driven jobs, a Druid OLAP engine, and a future real‑time data warehouse.

Youzan Coder
Youzan Coder
Youzan Coder
Evolution of Real-Time Computing at Youzan: From Storm to Flink and Future Directions

Youzan is a merchant service company that provides e‑commerce solutions across all industries and scenarios. A large number of its business products rely on real‑time data processing, which is supported by a core technology component serving dozens of products and hundreds of streaming jobs (e.g., transaction dashboards, product statistics, log platforms, tracing, risk control).

1. Overview

The article outlines the development history and current architecture of Youzan's real‑time computing platform.

2. Real‑time Computing Development at Youzan

From a technology‑stack perspective, Youzan followed the industry trend: early Storm, then JStorm, Spark Streaming, and most recently Flink. The evolution can be divided into two phases: the initial (startup) phase and the platform‑ization phase.

2.1 Startup Phase

During this phase there was no overall real‑time computing plan, no platform‑level task management, monitoring, or alerting. Users submitted jobs via command‑line on AG servers, making reliability and usability difficult, though many use cases were accumulated.

2.1.1 Storm Introduction

In early 2014 Storm was used to decouple real‑time event statistics from business logic, listening to MySQL binlog updates and writing results back to MySQL or Redis. Over two years, nearly a hundred real‑time applications were built on Storm, but the system suffered from limited throughput and latency insensitivity.

2.1.2 Spark Streaming Adoption

By late 2016, Spark Streaming replaced Storm for many workloads due to better throughput and performance. Users still submitted jobs via AG servers to a YARN cluster. By the end of 2017, Spark Streaming became the dominant engine.

2.1.3 Summary of Startup Issues

Lack of business management mechanisms made it hard to map running jobs to owners.

Separate monitoring/alerting tools for Storm and Spark Streaming caused duplication.

Compute resources were not isolated; offline and streaming jobs shared the same YARN pool, leading to contention.

Resource scheduling was rigid; tasks specified resources in launch scripts, hindering failover.

These problems highlighted the need for a unified real‑time computing platform.

2.2 Platform‑ization Phase

Key requirements identified:

Business management functions to record real‑time applications and associate them with owners.

Task‑level monitoring, automatic fault recovery, customizable alerts, and traffic dashboards.

Dedicated YARN clusters for streaming to avoid interference with offline jobs.

Zero‑downtime task migration between clusters.

In early 2018, the first phase of the platform was built, initially supporting Spark Streaming. After two months all Spark Streaming jobs were migrated, followed by Storm support and migration. AG servers were decommissioned, eliminating manual job submission.

By mid‑2018 both Storm and Spark Streaming were in production, but each had drawbacks. A comparative evaluation favored Flink for:

Lower latency (Flink processes streams at sub‑second level versus Spark’s 15‑second micro‑batches).

Comparable throughput.

Superior state storage (in‑memory or RocksDB).

Better SQL support.

More flexible APIs.

Consequently, Flink was added to the platform (see the blog “Flink in Youzan’s Real‑Time Computing”).

2.2.2 New Challenges

Despite a stable platform, onboarding efficiency became a bottleneck. The typical workflow required:

Learning the SDK (≈½ day).

Applying for upstream/downstream resources (hours).

Developing, testing, and deploying the job (1‑3 days).

Ensuring code quality without systematic reviews, leading to production issues.

The total onboarding time was 2‑3 days, prompting two strategic directions:

SQL‑based real‑time tasks.

Introducing additional technology stacks for simple analytical scenarios.

2.2.2.1 SQL‑Based Real‑Time Tasks

SQLification aims to simplify development and shorten deployment cycles. The roadmap includes:

Stream‑to‑stream jobs based on Kafka.

Stream‑to‑storage jobs using HBaseSink.

Support for user‑defined functions (UDFs).

Implementation is in progress.

2.2.2.2 Real‑Time OLAP Engine

For high‑frequency UV/PV statistics, a real‑time OLAP store is needed. After evaluating Kudu (C++) and Druid (Java), Druid was chosen for its query performance, ease of integration, and lower operational cost.

3. Future Plans

The immediate goal is to fully realize SQL‑based tasks, covering about 70% of use cases, thereby improving development efficiency. After that, Youzan plans to build a real‑time data warehouse, with an initial design shown in the diagram below.

Building a complete real‑time warehouse will also require metadata management, data quality components, and other supporting infrastructure.

4. Conclusion

Youzan’s real‑time computing has continuously evolved according to business needs, shifting between technologies to achieve the best cost‑performance balance. The article provides a timeline overview rather than deep technical details.

Finally, the author includes a brief contact note: the Youzan big‑data infrastructure team is responsible for DP, real‑time computing (Storm, Spark Streaming, Flink), offline computing (HDFS, YARN, Hive, Spark SQL), online storage (HBase), and real‑time OLAP (Druid). Interested readers can contact [email protected].

Flinkstream processingreal-time computingSpark StreamingYouzan
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.