Soul's Container Cluster Cost Governance: A Case Study on Resource Optimization
Soul's container cluster cost governance case study details their approach to optimizing resource utilization through Kubernetes-based solutions, addressing challenges like resource fragmentation and implementing strategies such as SNAS for elastic scaling and HPA+CronHPA coordination to achieve significant cost reductions.
Soul’s container cluster cost governance case study details their approach to optimizing resource utilization through Kubernetes-based solutions, addressing challenges like resource fragmentation and implementing strategies such as SNAS for elastic scaling and HPA+CronHPA coordination to achieve significant cost reductions.
The governance process involved addressing multiple obstacles: HPA node expansion limitations during traffic surges, service resource preemption affecting stability, resource pool wastage during tidal fluctuations, and the complexity of ongoing operations. Solutions included service governance improvements (HPA+CronHPA coordination), resource pool elasticity upgrades (SNAS implementation), and establishing a resource usage observation mechanism.
Key technical implementations comprised:
Service Governance: Optimized HPA+CronHPA coordination to handle traffic surges and ensure resource availability during peak periods.
Resource Pool Elasticity: Deployed SNAS (Soul Node AutoScaler) to dynamically adjust node counts based on resource pool water levels, reducing waste while maintaining service continuity.
Service Binding: Separated CPU and GPU services, optimized resource pool assignments, and implemented resource pool water level control.
Hotspot Rescheduling: Utilized Koord-descheduler for low-node-load-based pod migration during resource contention.
Cost Control: Established resource approval workflows, implemented cost monitoring dashboards, and created service load inspection mechanisms.
Governance outcomes demonstrated improved resource utilization (90%+), reduced overall costs (20%+), and enhanced operational stability through systematic monitoring and optimization.
Soul Technical Team
Technical practice sharing from Soul
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.