Capacity Management: Goals, Stages, Optimization Techniques, and Scaling Practices
The article explains how capacity management balances cost control and service quality through defined goals, three development stages, detailed resource optimization methods, stress‑testing metrics and standards, and automated scaling to achieve significant cost reductions while maintaining system stability.
Background: As ZhaiZhai's business expands, hardware and infrastructure investments increase but resource utilization declines, prompting the need for capacity management to balance cost and service quality.
1. Goals of Capacity Management
Capacity management aims at cost control and business support, ensuring services meet SLA while optimizing resource usage.
2. Development Stages
Three stages: (1) No capacity management, mixed deployment on physical and KVM machines; (2) Analyzing availability and performance to reduce mixing, decommission KVM, improve utilization, cutting resource cost by ~50%; (3) Cloud era with stress‑test standards, further halving cost.
3. Capacity Management Practices
3.1 Capacity Water Level
Defines the ratio of actual consumed resources to total available resources, measured for cloud hosts (CPU, memory, disk, NIC) and application services (JVM memory, threads, GC frequency, QPS, response time).
3.2 Resource Capacity Optimization
Examples include reducing service CPU from 4 cores to 2 when average usage is low, adjusting JVM memory using the formula JVM total memory = heap + thread stack (XSS) * thread count + constant overhead , and mixed deployment of high‑ and low‑priority services.
3.3 Cluster Capacity
Combines stress‑testing with capacity water level to determine accurate cluster capacity, using either log replay/TCP‑Copy per‑instance tests or whole‑cluster tests.
3.4 Stress‑Test Metrics
System metrics: CPU, memory, disk I/O, NIC bandwidth. Service metrics: response time, latency percentiles, error rate, slow‑request ratio.
3.5 Stress‑Test Standards
Defines acceptable error rates (≤1% for A‑level services, ≤3% for B, ≤5% for others) and response‑time thresholds for median, 90th, and 99th percentiles relative to average.
4. Scaling Operations
Based on capacity data, automatic scaling is applied during promotional activities and for daily service quality assurance.
5. Summary
Capacity management is a complex engineering discipline that combines strategies, processes, and standards to achieve cost reduction and efficiency while ensuring service stability.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.