Operations 15 min read

How Hupu Scaled to Millions: Inside the Flex Auto‑Scaling Platform

This article details Hupu's massive sports‑traffic environment, the design and implementation of the Flex auto‑scaling platform, its architecture, core functions such as resource statistics, node and pod scaling, scenario scheduling, and the performance optimizations that enable rapid, cost‑effective scaling across multi‑cloud Kubernetes clusters.

Efficient Ops

Mar 21, 2023

How Hupu Scaled to Millions: Inside the Flex Auto‑Scaling Platform

Overview

Hupu, a sports community founded in 2004, serves over 100 million registered users, with daily peak traffic of 230 million and 11 million monthly active users, demanding high reliability.

1. Hupu Business Scenario

Hupu runs about 500 online production services using Java, Python, Go, PHP, and Node across four environments (development, test, pre‑release, production). Since 2019, 80 % of workloads have been containerized on Kubernetes; peak node utilization exceeds 60 % and cloud resource cost has been reduced by 50 %.

Multiple clusters are deployed on Tencent Cloud and Alibaba Cloud, covering test, pre‑release, production, operation, and disaster‑recovery environments.

Traffic Characteristics

Traffic is low at night and stable during the day, but spikes dramatically during sports events or unexpected incidents, making it hard to predict.

Challenges

Hupu must keep sufficient standby resources while minimizing cost (only 15‑20 % buffer in production clusters), support multi‑cloud deployment, and ensure compatibility between Pods and VMs.

2. Flex Auto‑Scaling Platform

Overview

Flex provides a dashboard with cluster‑level and application‑level resource views, scenario scheduling status, and hot‑match information for operators.

The left menu includes scaling for Pods, cloud VMs, nodes, database scaling, scenario switches, audit logs, and permission management.

Architecture

Flex stores application metadata (name, project, owner, department, resource configuration, replica count) in a CMDB. When scaling is required, Flex retrieves the target application and ensures a minimum number of ready instances.

For scaling up, Prometheus collects metrics from Kubernetes and cloud VMs, processes alerts via the Rig system, and passes decisions to Flex. An API layer abstracts calls to K8s, Tencent Cloud, and Alibaba Cloud, enabling unified Pod and VM scaling.

Core Functions

Resource statistics, node scaling, and scenario scheduling.

Resource Statistics

The dashboard shows node counts, distinguishing between subscription‑based and pay‑as‑you‑go nodes. Pay‑as‑you‑go nodes running more than 8 hours trigger evaluation to add subscription nodes for cost optimization.

Resource allocation is tuned via Requests; when usage approaches thresholds, scaling is considered, otherwise excess subscription capacity is reduced. Application‑level rankings display replica, resource usage, and cost, helping identify hot spots for performance tuning and cost reduction.

Node Scaling

Initially scaling was triggered only by Pod Pending events. Later a reserved‑resource dimension was added, initially 35‑40 % reserve, later compressed to 15‑20 % after cost analysis.

Standard resource packages (e.g., 2C4G, 8C16G) are reserved to avoid fragmentation; with 8C16G at least ten instances can be created. Over three years, node scaling has reached 200 k operations.

Pod & Cloud‑VM Scaling

Scaling decisions consider CPU, memory, QPS, JVM thread count, and both real‑time and predictive metrics (10‑minute forecast). Predictive scaling can pre‑empt traffic spikes.

Black‑list rules can forbid scaling up or down for specific applications, and thresholds can be defined by percentage or absolute numbers.

An API allows services to control replica counts directly.

Scenario Scheduling

Pre‑defined strategies handle special events (e.g., NBA start) by activating specific scaling phases before operators arrive, ensuring resources are ready for sudden traffic surges.

Other Scaling

An alpha version adds MySQL/Redis scaling based on CPU, memory, disk, connections, and shard count.

3. Problems and Optimizations

Slow Node Join

Node join time on Alibaba Cloud was ~2 minutes; after script tuning it dropped to 10 seconds. Asynchronous batch creation of up to 50 nodes further improved speed.

Pod Scheduling Delays

Initial pod creation of 200 Pods took ~20 seconds due to the arms‑pilot component; after removal, creation fell to 1 second. DaemonSet overhead was reduced by embedding functionality into nodes.

A separate Scale step was added after HPA patch to shorten reaction time.

Image Pull Latency

Layered Docker images are used: base layer (CentOS), language layer (OpenJDK/Golang/Python), intermediate layer (common packages), and business layer (application code). A custom NodeImage plugin pre‑pulls core images, and Harbor proxy accelerates concurrent pulls.

Pod Startup Slowness

Three tactics are applied: lazy‑load agents, adjust health checks (delay liveness, advance readiness), and move postStart/preStop logic to external processes.

Master Node Performance

During massive scaling, master CPU spikes to 100 % and memory exhausts. Profiling showed JSON decoding overhead; scaling the master to 64C/128G (or 64C/256G) during peak periods balances cost and stability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Kubernetes multi-cloud Resource Management Auto Scaling cloud-native operations

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.