Cloud Native 10 min read

Elastic Scaling Practices in Cloud‑Native Kubernetes Environments

To overcome native HPA limits and business‑specific constraints in a fully containerized, cloud‑native Kubernetes environment, we implemented a dual‑threshold water‑level and scheduled scaling engine, hybrid‑cloud ClusterAutoScale, mixed‑deployment resource prioritization, and comprehensive Prometheus‑based observability, achieving higher utilization, lower costs, and a roadmap toward deeper optimization and AIOps.

HelloTech

Aug 1, 2023

Elastic Scaling Practices in Cloud‑Native Kubernetes Environments

After the organization completed full‑network containerization, first‑line developers encountered a series of usage issues such as timing, capacity, efficiency, and cost. Elastic scaling became an inevitable technology choice in a cloud‑native containerized environment.

Problems with native HPA

Initial attempts to use the native Horizontal Pod Autoscaler (HPA) revealed many limitations: lack of custom metrics, no scheduled scaling, reliance on resources.requests, and a single‑goroutine execution model. Business‑specific constraints, such as non‑interruptible job instances and downstream database availability, further complicated its use.

Business‑driven elastic capability

We built an elastic mechanism based on actual instance water‑level and effective load, featuring:

High‑low dual‑threshold control to bound stability for fluctuating workloads.

Ceiling‑based scaling (ceil) for expansion and floor‑based scaling (floor) for contraction.

Data denoising to exclude non‑ready instances, strong business‑relationship instances, and metric gaps.

Performance enhancement through namespace‑level listening and concurrency control.

Fusion of water‑level and timed elasticity

The solution merges water‑level thresholds with scheduled scaling, ensuring that expansion chooses the larger of the two triggers while contraction never falls below the scheduled replica count.

Hybrid‑cloud ClusterAutoScale

To address the growing resource pool in a hybrid‑cloud scenario, we designed a ClusterAutoScale that integrates:

Image‑as‑a‑service.

CloudProvider adapters for private‑cloud APIs.

Node initialization and reclamation workflows.

Two trigger strategies are used: unschedulable pod events and resource‑pool water‑level thresholds. Additional challenges solved include private‑cloud capacity assessment, pod CIDR routing, and gray‑scale resource reclamation.

Operational considerations

When deploying in production, attention must be paid to business‑pool capacity, instance volatility standards, health‑probe vs. readiness‑probe differences, metric thresholds, rule inspections, minimal filtering, and independence from external platforms.

Middleware containerization and mixed deployment

We merged Redis and Flink resource pools for time‑shared reuse, eliminating resource fragmentation, reducing cross‑cluster data aggregation, and simplifying operations.

Mixed‑deployment strategy

The overall approach abstracts factors into three layers: application tiering, mixed‑deployment scheduling, and resource QoS. Key directions include:

Application tier labeling (S1‑S4) stored in CMDB and reflected as K8s priority labels.

Resource pool prioritization for critical services while dispersing lower‑priority workloads.

Request recommendation using VPA histogram percentile (P95) multiplied by a water‑level factor, combined with elasticity and health‑state machines.

Load scheduling based on ideal‑value weighting and bin‑packing algorithms, filtering high‑water‑level nodes and predicting future node water‑levels.

Resource dispersion strategies (host‑level, zone‑level, MDU) to maximize distribution.

Results and challenges

Resource utilization improved markedly and cost bills decreased year‑over‑year. However, larger physical‑machine failure radii introduced new stability concerns and increased difficulty in root‑cause analysis.

K8s observability and stability

We built a Prometheus‑based monitoring platform comprising Thanos, Vertex Exporter, SentryD, CheckD, and an alerts system. Additional components include:

Event persistence with full‑resource list‑watch collection.

Log aggregation via Kafka.

Trace analysis with ID‑based querying, tag filtering, and topology inspection.

A stability dashboard tracking native component health, cluster capacity water‑level, resource load, abnormal instances, and cloud‑platform availability.

Future roadmap

The plan focuses on four areas: deeper mixed‑deployment and optimization, containerizing data stores (databases, NoSQL), exploring serverless scenarios for algorithmic and job workloads, and leveraging AIOps with time‑series prediction for proactive fault detection.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Cloud Native Kubernetes Auto Scaling elastic scaling hybrid cloud

Written by

HelloTech

Official Hello technology account, sharing tech insights and developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.