Operations 18 min read

How Ctrip Scaled Its Cloud Platform to 10k Nodes: Real‑World Kubernetes Ops Lessons

This article shares Ctrip's practical experiences in scaling a hybrid private‑cloud platform to over ten thousand nodes, covering Kubernetes control‑plane stability, host monitoring, network observability, image management, and capacity planning to ensure high availability for massive online services.

Efficient Ops
Efficient Ops
Efficient Ops
How Ctrip Scaled Its Cloud Platform to 10k Nodes: Real‑World Kubernetes Ops Lessons

1. Ctrip Container Cloud Overview

Ctrip's cloud platform follows a hybrid private‑cloud architecture with a dual‑active setup in Shanghai and a test IDC in Nantong. The production environment now runs over 10,000 nodes and more than 200,000 pods.

Cloud platform architecture:

Application layer: online services, DB, Kafka, ES.

PaaS platform: CI/CD, Web Console, HPA, Serverless and other cloud‑native capabilities.

Image management: private Harbor for on‑prem, ECR (Amazon) and ACR (Alibaba) for public clouds, with synchronization and overseas acceleration.

K8s clusters: multiple clusters per IDC managed by a Meta cluster, with cross‑IDC dedicated lines.

IaaS: kernel, CNI, CSI, CRI and other infrastructure services.

Logging and monitoring are integrated across all layers.

Basic components

Production K8s version 1.19; migrating from Kubefed to Karmada for multi‑cluster management.

Network virtualization evolved from OpenStack + Neutron to Cilium + BGP (60% completed, target 80% by 2021).

Kernel upgraded from 4.14 → 4.19 → 5.10; 5.10 accounts for 10% of nodes.

2. Kubernetes Control‑Plane Stability

The stability of the control plane directly impacts application availability. As the platform grew, maintaining cluster stability became more challenging.

Case study

One night an etcd leader performed a defrag, causing leader election churn, which triggered massive restarts of components such as the Cilium agent, leading to OOM on the host and a full cluster outage.

etcd repeatedly switched leaders.

apiserver kept restarting.

Control‑plane host OOM.

Cluster avalanche.

Analysis

Defrag caused heavy block I/O, making etcd unavailable. Clients lost leases, causing components to reload massive metadata (e.g., Cilium pulling >1000 endpoints), which blew up memory on both etcd and apiserver sharing the same host, resulting in OOM.

Improvements

1. Monitoring & alerting

Integrate with a centralized alerting platform, define priority levels (P0 for etcd leader churn, P1 for single‑node core component failures) and use tools like kube‑healthy for health checks.

Establish SOPs for incident response to ensure rapid recovery.

2. Deployment architecture

Isolate core components to reduce mutual impact.

Run etcd on high‑performance SSD/NVMe disks and perform per‑node I/O benchmarking.

Split apiserver traffic by major request sources (e.g., Cilium agent, kubelet) and deploy multiple ingress points.

3. Application refactoring

Avoid over‑populating a single namespace to enable namespace indexing.

Reduce excessive List calls by using go‑reflector.

Limit CRD usage to lower control‑plane pressure.

apiserver tuning

Monitor node IO/CPU/memory; increase max‑requests‑inflight and max‑mutating‑requests‑inflight for large clusters.

Enable APIPriorityAndFairness and goaway to balance traffic.

etcd tuning

Upgrade to etcd 3.4.9+ to avoid deadlocks, data inconsistency, and memory leaks.

Disable defrag; use periodic compact instead to keep DB size stable.

Increase heartbeat and election timeouts.

Deploy learner nodes in another IDC for disaster recovery.

Use etcd‑operator to back up to S3‑compatible storage.

After these changes, leader switches became rare and stability improved.

3. Host Inspection

Host stability directly affects workloads. Fast detection and remediation of faulty nodes are essential.

Two main host issues:

Faulty hosts: hardware failures or component crashes (docker, cilium, kubelet).

Hotspot hosts: CPU overselling leading to high load and resource contention.

To detect issues early, Ctrip deployed a Node Probe DaemonSet (NPD) that collects OOM, deadlock, and custom metrics, exporting them via Prometheus for alerting.

Hotspot host mitigation

Scheduling policy: combine real‑load‑aware and consolidation strategies to balance load while reducing fragmentation.

Re‑scheduling: evict pods from overloaded hosts during high CPU/LOAD periods (suitable for stateless services).

Monitoring & alerts: real‑time detection and automated remediation workflows.

Abnormal pod inspection

Using kube‑state‑metrics, Ctrip aggregates pod‑level alerts (e.g., Pending, Terminating, ImageError) by namespace.

4. Network Observability

Multi‑dimensional monitoring covers top‑of‑rack switches, hosts, and individual instances.

Host network fine‑grained monitoring

Metrics include NIC errors, link status, kernel packet loss, and TCP state aggregation.

Co‑host instance aggregation

Aggregated metrics across all instances on the same host help identify whether slow API responses stem from host network issues.

Cilium TCP flow collection

Cilium captures high‑volume TCP flow data (source/destination IP/port), useful for security audits and identifying abnormal access patterns.

5. Image Management

Private‑cloud images are stored in Harbor backed by Ceph. Daily new images reach ten‑thousands, stressing GC and pull performance.

Ceph performance boost

Upgrade kernel and Ceph to use bluestore, reducing latency by 20‑30%.

Enable io_uring on kernel 5.x, increasing throughput by 20‑30%.

Image layering strategy

Images are built in layers: OS, debugging tools, business base (Java/Redis/MySQL), component versions, application code, and optional custom layers. This reduces pull size to only the topmost layer during deployments.

Harbor HA & mirroring

Each IDC runs a dual‑active Harbor cluster; writes go to the primary, which syncs to a standby.

Both clusters share the same DB for seamless failover.

Registry‑mirror caches images locally, reducing Ceph load during large‑scale rollouts.

Provides near‑site access for faster pulls.

Harbor sharding & cleanup

To address GC bottlenecks, Harbor sharding separates application images from base images into different clusters. When an application cluster reaches a size threshold, older versions are pruned, keeping only the current and a few historical releases.

6. Capacity Management

Capacity planning uses kube‑state‑metrics. Nodes across IDC clusters form a shared cache pool; under‑provisioned clusters can borrow nodes from this pool.

Holiday peak scaling

Before holidays, traffic peaks are forecasted from historical order data, followed by N‑fold full‑stack load testing. Resources are pre‑purchased for peak demand, and excess capacity is later absorbed by normal growth or migrated to public cloud for off‑loading.

monitoringperformance optimizationkubernetescloud operationsContainer ImagesNetwork Observability
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.