Root Cause Analysis and Resolution of Disk Exhaustion During a Promotion Event
During a large‑scale promotion, an online service suffered severe disk usage spikes due to undeleted log files held open by an SLS process, and the issue was resolved by identifying the open handles, terminating the process, and implementing log‑level controls to prevent recurrence.
During a major promotion, an online application suddenly generated a large number of alerts indicating that disk usage had surged to over 80%, severely impacting the cluster.
The first step was to log into the affected machines and run $df to view filesystem usage, confirming that the root partition was indeed near full capacity.
Although the system is configured to automatically compress and clean up logs when they reach certain sizes, the cleanup did not trigger on the promotion day, causing the disk to fill up.
Using du -sm * | sort -nr , the team identified several massive log files (e.g., service.log.20201105193331 ) that were continuously growing.
Manual deletion of some log files with rm service.log.20201105193331 did not reduce disk usage because the space was still held by an active process.
Running lsof | grep deleted revealed that an SLS (Alibaba Log Service) process (PID 11526) still had the deleted log files open, preventing the filesystem from reclaiming the space.
The underlying Linux mechanism was explained: a file is only truly removed when both its link count (i_nlink) and reference count (i_count) drop to zero; an open file descriptor keeps i_count > 0.
To resolve the issue, the SLS process was force‑killed with kill -9 11526 , after which $df showed disk usage falling back to normal levels.
Post‑mortem analysis identified two root causes: excessive log generation during the promotion and slow log pulling by the shared SLS project. Mitigations include implementing log‑level downgrade strategies during high‑traffic periods and separating SLS configurations for critical services.
Full-Stack Internet Architecture
Introducing full-stack Internet architecture technologies centered on Java
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.