Big Data 11 min read

Scaling WeChat’s Big Data and AI Workloads on Kubernetes: Challenges and Optimizations

This article details WeChat's migration of large‑scale big data and AI workloads to a cloud‑native Kubernetes platform, discussing performance bottlenecks, API server and ETCD overload protection, scheduler enhancements, observability solutions, resource utilization gains, and future serverless directions.

DataFunSummit

Jun 1, 2025

Scaling WeChat’s Big Data and AI Workloads on Kubernetes: Challenges and Optimizations

With the rapid development of cloud computing, big data, and artificial intelligence, and the demand for integrated applications, Kubernetes‑based cloud‑native orchestration frameworks have become popular for decoupling application and IaaS layers, attracting many big data and AI workloads.

Since 2020, WeChat's big data platform has undergone systematic cloud migration, moving over ten open‑source and proprietary frameworks such as Flink, Spark, TensorFlow, MPI, and PyTorch to a cloud‑native foundation that offers high stability and performance.

At the DA Shenzhen Digital Intelligence Conference (July 25‑26), senior engineer Wu Qianhao from WeChat’s technical architecture department presented the optimization practices for scaling big data jobs on Kubernetes.

1. Performance collapse and operational black‑box issues: Wu explained that Spark Submit creates a controller per job, leading to a high number of short‑lived controllers that stress the API server and ETCD, causing frequent pod creation and scheduler load.

2. Comparing Kubernetes with Hadoop YARN: While YARN’s scheduling performance is several times higher for offline tasks, Kubernetes offers richer ecosystem and operational benefits; however, large‑scale job submission still pressures API server and ETCD, requiring full‑stack optimizations.

3. API server optimization strategies: Implemented rate‑limiting and circuit‑breaking, enforced expensive List requests to use cache, introduced business‑dedicated API servers to isolate traffic, and added token‑bucket throttling based on UserAgent.

4. Scheduler throughput improvements: Adopted Volcano for batch scheduling in early stages, later extended Kubernetes Scheduler Framework with plugins such as Capacity Scheduling, resource borrowing, and fair multi‑queue scheduling, achieving significant throughput gains.

5. Observability and alerting: Integrated internal monitoring tools for Flink, Spark, and other frameworks, collected container metrics via Kvass + Thanos into Prometheus, and stored audit logs in CLS, enabling detailed job, component, and resource analysis.

6. MPI job coordination: Used PodGroup and gang‑scheduling to achieve all‑or‑nothing task start, with community Scheduler‑Plugins providing Coscheduling support.

7. Post‑optimization performance: Job submission rate increased from under 1 k/min to 6 k/min, and API server/ETCD P99 write latency dropped nearly tenfold, validated through overload fault injection and rate‑limit testing.

8. Resource utilization gains: Overall utilization improved by over 20 %; HPA and VPA were employed for dynamic scaling, and QoS policies (CPU pre‑empt, memory OOM priority, I/O throttling) protected mixed‑workload environments.

9. Value for smaller enterprises: Many optimizations—custom scheduler plugins, ETCD compression, and modular control components—are open‑source and can be adapted, though cross‑cloud multi‑cluster scenarios remain under development.

10. Future outlook: Exploration of serverless frameworks such as Knative/OpenFaaS and edge‑lightweight Kubernetes nodes is planned, with attention to network latency and offline synchronization challenges.

The conference also featured lightning talks, workshops, and a discounted ticket offer (9% off) for attendees, with travel and hotel expenses covered for selected speakers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Performance Optimization Big Data AI Kubernetes

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.