Operations 12 min read

Automated Capacity Planning and Auto‑Scaling for Hotel Services During Traffic Peaks

This document describes a comprehensive capacity‑planning solution that predicts traffic‑peak impacts for hotel services, automatically estimates required CPU resources, creates timed scaling tasks, and evaluates performance using detailed metrics, thereby improving operational efficiency and reducing manual effort during events such as exam‑ticket printing and holiday travel surges.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Automated Capacity Planning and Auto‑Scaling for Hotel Services During Traffic Peaks

1. Background

Traffic spikes caused by events such as exam‑ticket printing or holiday travel can overwhelm hotel services, leading to performance degradation, throttling, or crashes. Existing Horizontal Pod Autoscaler (HPA) requires manual calculation of required machine numbers, which is inaccurate and inefficient.

Automatically estimating event impact and pre‑scaling services can protect stability and improve operational efficiency.

2. Overall Solution

The solution integrates a traffic‑calendar platform, an algorithm service, and Ops (operations) interfaces to predict CPU requirements and trigger automatic scaling.

2.1 System Architecture

(1) The traffic‑calendar platform aggregates business monitoring data and obtains CPU core counts from Ops.

(2) It determines the peak order/QPS value for the event, calls the algorithm service to predict total CPU cores needed.

(3) Ops converts the predicted CPU cores into an estimated instance count and schedules automatic scaling tasks.

2.2 Business Process

The event lifecycle includes nine stages: pre‑judgment → pending evaluation → evaluating → evaluation completed → task creation → scaling → scaling completed → review → closed.

Key steps:

Hotspot Event Entry : Events can be imported from Ctrip or created manually, entering the pre‑judgment state.

Event Pre‑judgment : Determines whether automatic scaling is needed; if not, the event ends.

Pending Evaluation : Estimates peak business volume using the formula Peak Business = Baseline × (1 + Growth Rate) .

Evaluation : The traffic‑calendar calls the algorithm to predict CPU usage for the estimated peak.

Task Creation : Ops creates timed scaling tasks based on the predicted instance count, using a mix of on‑premise and cloud resources.

Review & Closure : After the peak, the system reviews accuracy and coverage metrics.

2.3 Metrics

Key indicators include:

Prediction accuracy (a, M, N, K codes)

Coverage rate

Average absolute percentage error (MAPE)

Order‑CPU correlation coefficient

Average actual CPU usage

Mean absolute error of CPU prediction

Formulas such as Platform Estimated Cores = Algorithm Predicted Cores × (1 + Safety Threshold) are used to compute final scaling numbers.

2.4 Algorithm Details

The model is a neural‑network trained on recent two‑month data from containerized applications with auto‑scaling enabled. Training data includes application, sub‑environment, timestamp, order volume, and CPU usage. The model focuses on order volume as the primary factor influencing CPU consumption.

Model validation uses metrics like MAPE (0.08) and order‑CPU correlation (0.91). The model is updated periodically, ensuring offline validation and online performance monitoring.

2.5 HPA Scaling Safety Strategy

Safety limits are applied to maximum and minimum replica counts. If predicted instances exceed the configured maximum, creation is blocked with a notification. If predicted instances fall below the minimum threshold (1‑a%), scaling is restricted to maintain stability.

3. Project Data & Value

• Over 150 hotel applications (≈90% of total cores) are integrated.

• Completed high‑peak event protection for exam‑related and holiday traffic.

• Average coverage: 96%; average accuracy: 89%.

• Each peak event saves ~3 person‑days of manual ops, totaling ~270 person‑days annually, and reduces resource prediction cost by ~20%.

4. Future Plans

Expand intelligent scaling to all application scenarios, including bare‑metal and KVM, and add DB/Redis resource checks.

Leverage AI to improve business‑volume forecasting and strengthen order‑CPU correlation.

Broaden adoption across all business lines for company‑wide resource orchestration.

Algorithmcloud computingoperationsresource managementcapacity planningauto scaling
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.