Cloud Native 9 min read

How We Scaled 10,000+ K8s CronJobs with Serverless and Solved Node Instability

This article describes the challenges encountered when migrating tens of thousands of Kubernetes cronjobs from VMs to a cluster—node instability, low resource utilization, and scheduling delays—and explains how introducing a serverless architecture with virtual nodes, a custom job scheduler, unified logging and monitoring, and sandbox reuse restored stability, improved performance, and reduced resource costs by about 70%.

Zuoyebang Tech Team
Zuoyebang Tech Team
Zuoyebang Tech Team
How We Scaled 10,000+ K8s CronJobs with Serverless and Solved Node Instability

Background Introduction

During the cloud‑native container migration at Zuoyebang, scheduled tasks originally running on virtual machines were moved to Kubernetes CronJobs. The system performed well with fewer than 1,000 CronJobs, but problems emerged as the scale grew to tens of thousands.

Problem Discovery

Two main issues were identified: (1) node stability within the cluster, and (2) low resource utilization.

Issue 1: Node Stability

Frequent minute‑level tasks caused rapid pod creation and destruction, resulting in hundreds of containers being created and removed per minute on a single node. This led to excessive cgroup entries, especially memory cgroups that were not reclaimed promptly. The kubelet’s periodic reads of

/sys/fs/cgroup/memory/memory.stat

slowed down, increasing CPU time in kernel mode and causing noticeable network latency.

Performance profiling (

perf record cat /sys/fs/cgroup/memory/memory.stat

and

perf report

) showed most CPU consumption in

memcg_stat_show

. The memcg_stat_show function in cgroup‑v1 traverses the memory cgroup tree many times per CPU core; with millions of memory cgroup entries, this became disastrous.

Memory cgroups are not released immediately after container termination because the kernel must walk all cached pages, which can be slow. The delayed reclamation strategy works for typical workloads but fails when a single machine creates and destroys hundreds of containers per minute, leading to tens of thousands of memory cgroup entries and seconds‑long reads of

memory.stat

. This also caused high dockerd load, slow kubelet PLEG, and nodes becoming UnReady.

Issue 2: Resource Utilization

The CNI network mode reserves a large portion of pod slots for cronjob pods, many of which run for only a few seconds and consume minimal resources, resulting in substantial idle capacity.

Other Issues: Scheduling Speed and Service Isolation

At peak times (e.g., midnight), thousands of jobs need to start simultaneously. The default Kubernetes scheduler processes pod placement serially, taking minutes to schedule all jobs, which is unacceptable for workloads requiring sub‑second precision. Additionally, CPU‑ or I/O‑intensive pods can interfere with normal services due to incomplete cgroup isolation.

Using Serverless in the K8s Cluster

To achieve stronger isolation, finer‑grained node control, and faster scheduling for cronjob workloads, a serverless solution was adopted. Virtual nodes were introduced, allowing pods to run on serverless nodes with the same security isolation and network connectivity as regular nodes, but without reserved resources and with pay‑per‑use billing.

Job Scheduler

All cronjob workloads now use a custom job scheduler that dispatches pods to serverless nodes in parallel, achieving millisecond‑level scheduling and falling back to regular nodes if serverless resources are insufficient.

Bridging Differences Between Serverless and Regular Pods

Key integration points include:

Unified Log Collection: Since virtual nodes cannot run DaemonSets, a custom log consumer aggregates logs from various cloud‑provider log services, normalizes them, and forwards them to a shared Kafka cluster.

Unified Monitoring and Alerting: Serverless pods expose the same Prometheus metrics (CPU, memory, disk, network) as regular pods, ensuring consistent observability.

Improving Startup Performance

Serverless jobs require second‑level startup to meet strict timing constraints. The main latency sources are sandbox creation/initialization and pulling the business image. By reusing sandboxes for identical workloads, the first launch may be slower, but subsequent launches achieve near‑instant startup.

Conclusion

Through a custom job scheduler, isolation of serverless pods from regular pods, and performance optimizations for serverless pod startup, the migration to serverless was transparent to developers. The approach eliminated the need to reserve resources for cronjobs, freeing roughly 10% of cluster capacity (tens of thousands of pods) and cutting cronjob resource costs by about 70%, while also resolving node instability caused by excessive pod churn.

Cloud NativeServerlessKubernetesResource OptimizationCronJob
Zuoyebang Tech Team
Written by

Zuoyebang Tech Team

Sharing technical practices from Zuoyebang

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.