Flink on Kubernetes: Kuaishou’s Practice, Migration, and Future Refactoring
This article details Kuaishou’s five‑year evolution of Flink, covering its background, production refactoring to Kubernetes, migration practices, and future improvements, highlighting architecture layers, resource management, observability, and testing strategies for large‑scale stream processing.
Background : Kuaishou’s Flink architecture has evolved over five years through three stages—initial real‑time platform construction (2018‑2020), deep optimization for stability and scale (2021‑2022), and migration to Kubernetes along with runtime adaptation and AI integration (2022‑2023).
Application Scenarios : Flink is used extensively for real‑time data streams (audio‑video, recommendation), unified batch‑stream processing, and large‑scale AI workloads, handling over a million CPU cores, 10‑billion events per second, and petabytes of daily data.
Architecture Evolution : Early deployments ran on Yarn due to scheduling performance and Hadoop ecosystem integration. From 2022‑2023 the system shifted to Kubernetes for unified resource and application management, better isolation, and cloud‑native benefits.
Current Architecture : The stack consists of a resource/storage layer (K8s/Yarn, HDFS, Kwaistore), a compute layer (Flink Streaming & Batch runtime), an application layer (online/offline platforms), and a business layer serving various company departments.
Production Refactoring : The migration to K8s addressed core pain points—smooth Yarn‑to‑K8s transition, minimal user impact, unified resource abstraction, and extensive testing (integration, fault, performance, regression). System components such as Dispatcher, Resource Manager, JobMaster, LogService, MetricReporter, Ingress/Service, and Kwaistore were redesigned for Kubernetes.
Observability & Debugging : Metrics from Flink and K8s are aggregated via a KafkaGateway to a unified OLAP store, reducing metric explosion. A dedicated log service decouples logs from pod lifecycles, storing them on hostPath and exposing them via a web service for easier troubleshooting.
Migration Practice : Migration includes seamless user‑level configuration switching between Yarn and K8s, batch migration using Flink queues, health checks with a 0‑10 scoring model, and one‑click rollback for unhealthy jobs.
Future Refactoring : Planned work focuses on compute‑storage separation with Kwaistore, priority‑based resource preemption, runtime adaptation for dynamic scaling, and unifying real‑time, near‑real‑time, and batch jobs on Kubernetes.
Overall, the sharing outlines Kuaishou’s comprehensive journey of scaling Flink on Kubernetes, emphasizing architecture redesign, operational robustness, and forward‑looking enhancements.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.