Star River Data Scheduling Platform: Architecture, Evolution, and Intelligent Operations at 58.com
This article details the design, evolution, and core capabilities of 58.com's self‑developed Star River data scheduling platform, covering its positioning, architectural challenges, high‑availability master design, intelligent monitoring, baseline management, and future roadmap for big‑data operations.
As big‑data workloads expand, effective operations become crucial; 58.com introduced the self‑built Star River data development platform to provide a robust scheduling system that connects low‑level big‑data components with upper‑level applications.
The scheduling system acts as the "heart" of the data middle platform, managing task dependencies, ensuring timely execution, and addressing challenges of stability, performance, and extensibility amid millions of daily tasks.
Its evolution spans three phases: 2016‑2019 a custom scheduler that eventually hit scalability limits; 2020 a major architectural overhaul introducing a Master/Worker model, Quartz‑driven task generation, Kafka‑based state propagation, and CGroup resource isolation; and 2021‑present adding intelligent operations, baseline monitoring, and resource‑aware scheduling.
Star River’s core capabilities include a visual drag‑and‑drop UI, support for diverse task types (Hive, Shell, MR, SparkSQL, Python, custom plugins), high‑availability with fault‑tolerant master failover via Zookeeper locks, decentralized master election, and flexible dependency handling (time, event, self‑dependency).
Architecturally, it distinguishes static (e.g., Airflow) and dynamic implicit workflow definitions, opting for the latter to simplify task management; the Master orchestrates task scanning, dependency checks, rate‑limiting, and dispatches work to Workers, which run tasks in isolated threads and report status through Kafka and Zookeeper.
Intelligent operations address task‑volume growth through multi‑dimensional throttling, tiered protection (P0‑P2), baseline monitoring with warning and breach thresholds, expected‑time prediction models based on historical runtimes, and key‑path analysis to optimize critical paths.
Future plans focus on integrating data quality checks, enhancing intelligent resource allocation, and automating data‑warehouse task creation to further lower the barrier for big‑data users.
The session concludes with a Q&A and an invitation to follow DataFun for more big‑data and AI insights.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.