Kuaishou Flink Real‑Time Architecture and Spring Festival Gala Assurance Practices
This article details Kuaishou's Flink‑based real‑time computing architecture, its massive cluster scale, and the comprehensive strategies—including overload protection, system stability, pressure testing, and resource guarantees—implemented to ensure reliable streaming for the 2020 Spring Festival Gala and its real‑time dashboard.
Author Liu Jiangang from Kuaishou shares the practice of real‑time pipeline assurance for the 2020 Spring Festival Gala, covering four main parts: Flink overview, gala real‑time guarantee scheme, real‑time dashboard case, and future plans.
1. Kuaishou Flink Overview Kuaishou operates a Flink cluster of over 3,000 machines, processing more than 20 trillion items daily with a peak of 38 billion records. The platform supports four major scenarios: a hosted real‑time SQL platform, short‑video and live‑stream metric computation, machine‑learning data preprocessing for advertising models, and log splitting/synchronization.
2. Spring Festival Real‑Time Guarantee Scheme
To meet the unprecedented data volume of the CCTV Spring Festival Gala, Kuaishou designed a set of measures focusing on overload protection, system stability, pressure testing, and resource guarantee.
2.1 Overload Protection When traffic spikes or a source becomes a bottleneck, Flink jobs may freeze or fail. Kuaishou combines health checks, intelligent throttling, and source‑side control: TaskManagers periodically report health to the Master, which, upon detecting extreme pressure, limits all sources to 50 % of their original QPS and gradually restores input by 10 % increments as the job stabilises. A lightweight hot‑update model allows runtime adjustments (e.g., disabling snapshots, setting sampling rates, source throttling) via a RESTful API without stopping jobs.
2.2 System Stability Fault handling includes component‑wise troubleshooting (YARN, HDFS, Kafka, Zookeeper), fault‑injection drills, and a comprehensive fault‑prevention plan. The job‑management system offers high availability, checklist‑driven development standards, global log queries, rapid migration tools, and alert/metric dashboards for early issue detection.
2.3 Pressure Testing A full‑scale simulation reproduces the gala workload. Data models are built by mapping user‑level statistics to topic‑level distributions, then generating traffic via three methods: data‑doubling (byte‑level scaling), time‑compression (increasing QPS without altering distribution), and sample‑based generation (producing data that matches target QPS/UV).
2.4 Resource Guarantee Jobs are classified into three priority levels (P0, P1, P2). Before the gala, all P2 jobs are stopped to free resources for P0/P1. During the event, P1 resources can be downgraded to protect P0. After the gala, P2 jobs resume from the latest Kafka offset. Additional strategies include multi‑cluster redundancy, rapid scaling based on real‑time metrics (throughput, latency, snapshot health, physical health), and a tiered resource‑selection process.
3. Real‑Time Dashboard Case The dashboard visualises over 100 live metrics (online users, red‑packet interactions, etc.) with requirements of million‑level QPS, sub‑second latency, and four‑nine availability. Architecture uses Flink as the core compute engine and Redis for fast KV lookups. Device‑ID deduplication is performed via bitmap structures, with three implementation options: Flink+HBase, Flink+Redis, or Flink’s native dictionary. The chosen solution is Flink+Redis for a balance of performance and simplicity.
Deployment includes dual‑data‑center hot‑standby, multi‑link physical isolation, and seamless failover that shields users from underlying failures.
4. Future Plans Kuaishou aims to promote SQL for unified batch‑stream processing, develop the SlimBase state backend for storage‑compute separation, enhance Flink’s self‑healing capabilities, build job‑diagnostic models for rapid issue localisation, and explore integrations with databases and Kubernetes.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.