Optimizing Container Resource Utilization and Cost at TAL Education
This article details TAL Education's systematic approach to improving container CPU utilization and reducing cloud expenses through dynamic scaling, resource overcommit strategies, mixed online‑offline deployments, and careful selection of public‑cloud compute types, supported by real‑world data and best‑practice recommendations.
The article describes TAL Education's exploration and practice in container resource optimization, covering dynamic scaling to reduce time‑based resource usage, over‑selling (overcommit) and mixed deployment of offline services to improve CPU utilization, and offers guidance on public‑cloud container resource type selection and billing‑cost optimization.
Current Situation Overview Most business lines have been containerized, providing finer‑grained and faster resource management. However, overall cluster CPU utilization remains low (average daily <10%), while memory is abundant. Applications often request far more CPU than they actually use, leading to pending pods. Figure 1 shows CPU utilization of an IDC Kubernetes cluster.
Dynamic Scaling Based on Traffic Cycle Internet traffic exhibits clear daily and weekly peaks. Using Alibaba Cloud's CronHPA and elastic node pools, resources are dynamically requested and released, aligning compute capacity with demand and cutting costs. Figure 3 illustrates the elastic node pool.
In a sample service, about half of the nodes are fixed and the other half are elastic, achieving roughly a 23% reduction in resource cost; fully elastic scaling could save over 30%.
Application Resource Over‑Commit Applications typically request CPU 3+ times their peak usage. Kubernetes QoS classes (Guaranteed, Burstable, BestEffort) are explained. Requests are often 2‑3× actual peaks, while limits can be 5‑6×, leading to significant over‑commit. Figure 4 shows the current resource utilization.
Node Over‑Commit Imbalance Over‑committing creates uneven CPU utilization across nodes, causing instability. Figure 5 displays the variance in node CPU usage. The container team is developing scheduler plugins to mitigate this issue.
Node Over‑Commit Testing Tests show that over‑committed nodes achieve about 10% higher CPU utilization than non‑over‑committed nodes (Figure 6), but this raises stability risks, prompting a balance between utilization and reliability.
Mixed Deployment of Offline Services By separating online and offline workloads into distinct resource pools, compute resources can be dynamically borrowed, improving overall utilization. The approach is detailed in the referenced Yarn mixed‑deployment article.
Public‑Cloud Resource and Cost Optimization TAL Education heavily uses Alibaba Cloud for container workloads. Elastic scaling can save over 20% of compute costs. Serverless K8s (ASK/ECI) is convenient but up to 40% more expensive for persistent services compared to bare‑metal ACK clusters. Bare‑metal machines are preferred where possible; AMD instances and various billing models (reserved, spot, elastic) are evaluated for cost efficiency.
Collaboration with Development Teams Resource and cost optimization requires joint effort: providing billing dashboards, usage rankings, and ensuring stability while reducing expenses. The container team also implements safeguards to balance over‑commit benefits against potential instability.
TAL Education Technology
TAL Education is a technology-driven education company committed to the mission of 'making education better through love and technology'. The TAL technology team has always been dedicated to educational technology research and innovation. This is the external platform of the TAL technology team, sharing weekly curated technical articles and recruitment information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.