Cloud Native 25 min read

Automated Elastic Scaling for Million‑Scale Core Services and Mixed Workloads on ByteDance's Private Cloud Platform

This article presents ByteDance's private cloud platform TCE architecture and explains how automated elastic scaling, dynamic over‑commit, and mixed‑workload deployment are used to improve resource utilization for millions of services, balancing online peak demand with offline batch tasks.

DataFunTalk
DataFunTalk
DataFunTalk
Automated Elastic Scaling for Million‑Scale Core Services and Mixed Workloads on ByteDance's Private Cloud Platform

ByteDance's private cloud platform, TCE, runs almost all stateless services—including micro‑services, recommendation, and advertising—on Kubernetes deployments, resulting in a massive cluster of over 40 K8s clusters, hundreds of thousands of servers, and millions of pods.

Analysis of traffic patterns shows that daily utilization is stable, but services often request more resources than needed, creating a gap between allocated and actual usage, especially during off‑peak periods.

To address this, ByteDance applies three complementary techniques: dynamic over‑commit based on weekly peak utilization, elastic scaling that adjusts replica counts according to real‑time CPU/memory/QPS thresholds, and mixed‑deployment (mix‑node) that reallocates idle online resources to offline jobs such as video transcoding and model training.

The elastic scaling control loop monitors per‑replica utilization, compares it with configurable thresholds, and scales out or in while ensuring stability through fast recovery, cluster‑wide monitoring, and a quota system that guarantees resource limits during scaling.

Underlying infrastructure improvements include prioritized API server request handling, federation across multiple clusters for failover, a custom high‑performance monitoring stack (SysProbe, Metrics Agent, Proxy, Store) that provides sub‑second latency and supports multi‑dimensional metrics, and a bespoke quota CRD that tracks group‑level resource usage.

Mixed‑deployment uses a node‑level water‑mark to identify under‑utilized nodes, marks them offline, and hands them to offline clusters; a state‑machine drives node transitions (Online ↔ Offline) with hooks to gracefully drain pods and minimize impact on long‑running offline tasks.

For peak traffic events, a “shadow service” approach replicates online services in the offline cluster with zero replicas under normal conditions; during spikes, these shadows are scaled up to consume the reclaimed offline resources, providing rapid burst capacity.

Future work aims to decouple resource providers and consumers via a unified resource market and tiered quota system, enabling elastic resources to be treated like regular resources and improving cost accounting and auditability.

cloud nativeKuberneteselastic scalingresource utilizationmixed workloadsquota system
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.