Cloud Native 29 min read

Large‑Scale Cost Optimization for Kubernetes/TKE: Data Collection, Measures, and Implementation

The article details a Tencent‑led, end‑to‑end cost‑optimization project for large‑scale Kubernetes/TKE clusters that collected extensive workload metrics, applied VPA/HPA enhancements, custom scheduling and node‑downscaling via the open‑source Crane platform, ultimately delivering up to 70% CPU and 50% memory savings with zero‑fault deployments.

Tencent Cloud Developer

Nov 24, 2022

Large‑Scale Cost Optimization for Kubernetes/TKE: Data Collection, Measures, and Implementation

Author Tang Cong, a Tencent Cloud container technology expert, is the founder of the open‑source projects kstone and the internal prototype of crane, an active etcd contributor, and is responsible for large‑scale Kubernetes and etcd platform stability, performance optimization, cost reduction, and stateful service containerization.

Background

Since the second half of 2021, under the pressure of the COVID‑19 pandemic and new internet policies, major internet companies have been pursuing cost‑reduction and efficiency‑improvement. A core method is to optimize computing‑resource costs. This article uses a Tencent internal Kubernetes/TKE business as a case study to describe a complete 0‑to‑1 cost‑optimization practice, covering data collection & analysis, optimization measures, industry status & solution selection, design & implementation, rollout, and summary. The practice achieved up to 70% CPU savings and 50% memory savings, with zero‑fault incidents. The solution was first implemented in the internal prototype of the open‑source project Crane and later contributed to the public Crane project (https://github.com/gocrane/crane) to provide a one‑stop cloud‑native cost‑optimization solution.

Business Status

(1) Data Collection & Analysis

All workloads are containerized in a TKE cluster. After two‑three years of rapid user growth, the cluster expanded to millions of pods, generating tens of millions of RMB in monthly bills. To understand the cost drivers, the following dimensions were collected and analyzed:

Cost bills: total cost per product/module, monthly trends, and identification of cost‑heavy services and cloud resources.

Resource details of the main cost driver (CVM nodes): total CPU/Memory/Extended resources per region and cluster.

Node resource utilization rates (actual load).

Node resource allocation rates (Kubernetes request allocation).

Business pod component load.

Pod CPU/Memory allocation vs. actual usage, OOM statistics.

HPA effectiveness data (coverage, min/max replica settings, trigger status).

Business analysis (load characteristics, service types, workload types).

Key findings:

CVM resources account for ~80% of total cost, used by three major business lines.

Node CPU load peaks at 5% (average) and 15% (peak); node allocation rate ~55% with uneven load distribution.

Significant gap between pod request and actual usage; some pods experience OOM without auto‑scaling.

HPA coverage is low and replica settings are unreasonable.

Workloads consist mainly of Deployments and custom Operators; both stateless and stateful services are present.

Optimization Measures

Based on the data, four improvement directions were defined:

Increase business pod resource utilization (introduce VPA, improve HPA coverage, use CronHPA for periodic workloads).

Improve node allocation rate (select optimal instance types, switch scheduler priority from LeastRequestedPriority to MostRequestedPriority, enlarge Pod CIDR).

Boost node load (admission webhook to adjust node resource requests, enable node‑level over‑commit, use dynamic scheduler and Descheduler).

Billing optimization (choose appropriate billing mode: spot, reserved, or pay‑as‑you‑go).

Industry Status & Solution Selection

VPA (Vertical Pod Autoscaler) and HPA (Horizontal Pod Autoscaler) are the core community solutions. Their architectures and limitations were examined:

VPA

Components: Metrics Server, History Storage (usually Prometheus), VPA Controller (Recommender + Updater), VPA Admission Controller.

Limitations: performance issues at large scale, lack of custom business metrics, need to create a VPA object per workload, eviction‑based update, limited algorithms, weak observability.

HPA

APIs: autoscaling/v1 (CPU only), autoscaling/v2 (multiple metrics, custom metrics).

Metrics sources: metrics.k8s.io (metrics‑server), custom.metrics.k8s.io (e.g., prometheus‑adapter), external.metrics.k8s.io (external systems).

Scaling formula: DesiredReplicas = ceil[currentReplicas * (currentMetric / desiredMetric)].

Example YAML (kept unchanged):

APIVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: php-apache
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: CPU
        target:
          type: Utilization
          averageUtilization: 50
    - type: Pods
      Pods:
        metric:
          name: packets-per-second
        target:
          type: AverageValue
          averageValue: 1k
    - type: Object
      object:
        metric:
          name: requests-per-second
        describedObject:
          apiVersion: networking.k8s.io/v1
          kind: Ingress
          name: main-route
        target:
          type: Value
          value: 10k

HPA drawbacks: latency due to reactive nature, limited observability, no dry‑run support.

Google Autopilot

Google’s Autopilot combines vertical and horizontal scaling. It uses a Recommender (multiple algorithms), an Autopilot service to apply recommendations via Borgmaster API, and Borglet (similar to Kubelet). Autopilot demonstrates the effectiveness of combined VPA/HPA for reducing OOM and CPU overload.

Solution Design & Implementation

Goals:

Scalability: support multi‑business, per‑namespace component providers, pluggable portrait algorithms, diverse trigger and updater providers, flexible record storage.

Observability: predict scaling impact, expose VPA/HPA/EHPA metrics, track scaling latency, queue length, OOM counts, etc.

Stability & Efficiency: dry‑run → gray‑release → adaptive rate‑limiting, safe node drain, rolling updates.

Architecture consists of two main modules:

Portrait Module : collects real‑time and historical metrics (metrics‑server, Prometheus, ES), runs algorithms (exponential‑decay histogram, XGBoost, SMA), and generates Portrait CRs describing workload resource profiles.

KMetis Module : provides unified VPA + HPA/EHPA + node‑downscale & self‑healing services. Core APIs are CSetScaler (component set scaling policies) and NodeScaler (node downscale tasks).

Key workflow:

ComponentSetScaler defines VPA/HPA/EHPA strategies per business component.

NodeScaler handles safe node eviction, disables scheduling, expands replicas if needed, verifies readiness, and finally drains the node.

Scaling loop: evaluate triggers (overload, OOM, custom metrics), estimate optimal resources, apply updates via Evict, RollingUpdate, In‑Place Update, record events.

Affinity example (kept unchanged):

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - preference:
        matchExpressions:
        - key: level
          operator: In
          values:
          - small
        - key: x.y.z/component
          operator: In
          values:
          - normal
      weight: 10

Deployment & Results

The rollout followed four steps:

Empty‑run mode (no actual updates) to validate correctness, performance, and to obtain predicted savings.

Gray‑release with adaptive rate‑limiting (max 20 concurrent scaling actions) to catch hidden issues.

Full‑scale rollout with custom scheduling (MostRequestedPriority) and dynamic scheduler/Descheduler to improve node allocation.

Safe node downscale using NodeScaler, node‑drain, and Descheduler.

Key outcomes:

CPU savings: up to 70% for business A, 45% for B, 50% for C.

Node allocation rate increased from ~50% to 99% CPU and 88% Memory.

Average CPU utilization rose from 5% to 21.4%.

Zero‑fault deployment across a large‑scale TKE cluster.

Stability measures included NodeProblemDetectorPlus for kernel/Docker bugs, preStop scripts for graceful termination, and strict pod‑termination handling.

Future Work

Further improve CPU utilization by leveraging node‑level over‑commit and continue refining portrait algorithms.

References: metrics.kubernetes.io, custom.metrics.kubernetes.io, external.metrics.kubernetes.io.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes Resource Management Autoscaling hpa VPA

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.