Big Data 12 min read

Flink on Kubernetes: Kuaishou’s Practice, Migration, and Future Refactoring

This article details Kuaishou’s five‑year evolution of Flink, covering its background, production refactoring to Kubernetes, migration practices, and future improvements, highlighting architecture layers, resource management, observability, and testing strategies for large‑scale stream processing.

DataFunTalk

Feb 22, 2024

Flink on Kubernetes: Kuaishou’s Practice, Migration, and Future Refactoring

Background : Kuaishou’s Flink architecture has evolved over five years through three stages—initial real‑time platform construction (2018‑2020), deep optimization for stability and scale (2021‑2022), and migration to Kubernetes along with runtime adaptation and AI integration (2022‑2023).

Application Scenarios : Flink is used extensively for real‑time data streams (audio‑video, recommendation), unified batch‑stream processing, and large‑scale AI workloads, handling over a million CPU cores, 10‑billion events per second, and petabytes of daily data.

Architecture Evolution : Early deployments ran on Yarn due to scheduling performance and Hadoop ecosystem integration. From 2022‑2023 the system shifted to Kubernetes for unified resource and application management, better isolation, and cloud‑native benefits.

Current Architecture : The stack consists of a resource/storage layer (K8s/Yarn, HDFS, Kwaistore), a compute layer (Flink Streaming & Batch runtime), an application layer (online/offline platforms), and a business layer serving various company departments.

Production Refactoring : The migration to K8s addressed core pain points—smooth Yarn‑to‑K8s transition, minimal user impact, unified resource abstraction, and extensive testing (integration, fault, performance, regression). System components such as Dispatcher, Resource Manager, JobMaster, LogService, MetricReporter, Ingress/Service, and Kwaistore were redesigned for Kubernetes.

Observability & Debugging : Metrics from Flink and K8s are aggregated via a KafkaGateway to a unified OLAP store, reducing metric explosion. A dedicated log service decouples logs from pod lifecycles, storing them on hostPath and exposing them via a web service for easier troubleshooting.

Migration Practice : Migration includes seamless user‑level configuration switching between Yarn and K8s, batch migration using Flink queues, health checks with a 0‑10 scoring model, and one‑click rollback for unhealthy jobs.

Future Refactoring : Planned work focuses on compute‑storage separation with Kwaistore, priority‑based resource preemption, runtime adaptation for dynamic scaling, and unifying real‑time, near‑real‑time, and batch jobs on Kubernetes.

Overall, the sharing outlines Kuaishou’s comprehensive journey of scaling Flink on Kubernetes, emphasizing architecture redesign, operational robustness, and forward‑looking enhancements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Migration Cloud Native Big Data Flink Kubernetes

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.