Design Principles and Implementation Details of Kubernetes Horizontal Pod Autoscaler and Custom Water Pod Autoscaler
The article explains Kubernetes’ built‑in Horizontal Pod Autoscaler, then details the custom Water Pod Autoscaler (WPA) that extends HPA with dual‑signal (load and SOA registration) detection, dual‑threshold scaling, noise filtering, configurable cooldown, frequency limits, tolerance buffers, and integrated alerting for reliable elastic scaling.
What is Elastic Scaling
Elastic scaling (Horizontal Pod Autoscaler, HPA) is a built‑in Kubernetes controller that monitors pod load and automatically adjusts the number of service pods to meet the expected value of the scaling algorithm. It has become a standard feature in major cloud providers.
The scaling loop consists of three main components:
Metrics Server – collects runtime load metrics from Deployments/RCs and provides data for HPA.
RC/Deployment – the resources that HPA can scale (any resource exposing a Scale sub‑resource, such as Deployment, StatefulSet, RC).
HPA Controller – the brain that fetches metrics, applies the scaling algorithm, and updates the replica count.
Application Scenarios
Elastic scaling resolves the tension between capacity planning and sudden load spikes. A typical example is a hot search on a social platform: traffic surges require rapid scaling up, and after the peak the system must scale down to reduce cost.
Advantages
Automatic – the controller handles scaling without human intervention.
Stable – scaling up adds instances to sustain high load, keeping the service stable.
Cost‑effective – scaling down releases unused instances, reducing waste.
Cloud‑Native Platform Implementation Details
The native HPA is simple to deploy but cannot directly serve the specific business needs of the company because it does not understand SOA registration states. To address this, a custom Water Pod Autoscaler (WPA) was built, extending HPA with dual‑signal detection (load + SOA registration).
WPA is implemented as a CRD. It gathers metrics from the Metrics Server and SOA registration information from the hahas platform, aggregates them, and computes the desired replica count.
Core Algorithm
To avoid frequent scaling, WPA uses a dual‑threshold (upper and lower) instead of a single line. The upper threshold triggers scaling up, the lower threshold triggers scaling down.
Scale‑Up Algorithm (average mode)
When the average mode is selected, averaged = n ; in absolute mode, averaged = 1 . Example: 5 current replicas, average load 1500m, max threshold 1200m → upScaleProposal = Ceil(5 * 1500 / 1200) = 7, so WPA adds 2 replicas.
Scale‑Down Algorithm (average mode)
Example: 7 replicas, average load 300m, min threshold 400m → downScaleProposal = floor(7 * 300 / 400) = 5, so WPA removes 2 replicas. The algorithm uses ceil for up‑scaling and floor for down‑scaling to maximize responsiveness.
Noise Handling
Two main noise sources are:
Pods in Starting or Stopping states inflate the count.
New pods without metrics appear as empty values.
WPA filters out non‑running pods and handles missing metrics. The relevant code is preserved below:
if pod.DeletionTimestamp != nil || pod.Status.Phase == corev1.PodFailed {
ignoredPods.Insert(pod.Name)
continue
}
if pod.Status.Phase == corev1.PodPending {
unReadyPods.Insert(pod.Name)
continue
}
if condition == nil || pod.Status.StartTime == nil {
unReady = true
} else {
if pod.Status.StartTime.Add(cpuInitializationPeriod).After(time.Now()) {
unReady = condition.Status == corev1.ConditionFalse || metric.Timestamp.Before(condition.LastTransitionTime.Time.Add(metric.Window))
} else {
unReady = condition.Status == corev1.ConditionFalse && pod.Status.StartTime.Add(delayOfInitialReadinessStatus).After(condition.LastTransitionTime.Time)
}
}
if unReady {
unReadyPods.Insert(pod.Name)
continue
}
if ignoredPods != nil && ignoredPods.Len() > 0 {
removeMetricsForPods(metrics, ignoredPods)
}
if unReadyPods != nil && unReadyPods.Len() > 0 {
removeMetricsForPods(metrics, unReadyPods)
}For missing metrics, WPA substitutes a tolerant value or zero depending on the scaling direction:
// Pods missing metrics
metric, found := metrics[pod.Name]
if !found {
missingPods.Insert(pod.Name)
continue
}
if len(missPods) > 0 {
if action == v1alpha1.CronScaleDown {
for podName := range missPods {
metrics[podName] = metricsclient.PodMetric{Value: metric.Resource.HighWatermark.MilliValue() + metric.Resource.HighWatermark.MilliValue()*wpa.Spec.Tolerance.MilliValue()/1000}
}
} else {
for podName := range missPods {
metrics[podName] = metricsclient.PodMetric{Value: 0}
}
}
}Cooldown Period
The cooldown period defines a waiting time after a scaling action to avoid rapid oscillations. It can be configured directly in the platform.
Frequency Control
Limits the number of replicas changed in a single scaling operation. The formula is:
Example: 3 current replicas, target 6, up‑scale percent 20 %, max 7 → tentative total = min(3 + max(1, floor(3 * 0.2)), 7) = 4. After frequency control, only 1 replica is added.
Similar logic applies to down‑scaling.
Tolerance
Tolerance adds a buffer to the upper and lower thresholds (default 1 %). This smooths out minor metric fluctuations.
Upper threshold: highWaterMark * (1 + Tolerance)
Lower threshold: lowWaterMark * (1 - Tolerance)
Alerting and Notification
The custom autoscaler integrates with monitoring and alerting systems, providing real‑time notifications for scaling actions and version mismatches, and alerts for critical inconsistencies.
HelloTech
Official Hello technology account, sharing tech insights and developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.