Operations 7 min read

Root Cause Analysis and Resolution of Disk Exhaustion During a Promotion Event

During a large‑scale promotion, an online service suffered severe disk usage spikes due to undeleted log files held open by an SLS process, and the issue was resolved by identifying the open handles, terminating the process, and implementing log‑level controls to prevent recurrence.

Full-Stack Internet Architecture
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Root Cause Analysis and Resolution of Disk Exhaustion During a Promotion Event

During a major promotion, an online application suddenly generated a large number of alerts indicating that disk usage had surged to over 80%, severely impacting the cluster.

The first step was to log into the affected machines and run $df to view filesystem usage, confirming that the root partition was indeed near full capacity.

Although the system is configured to automatically compress and clean up logs when they reach certain sizes, the cleanup did not trigger on the promotion day, causing the disk to fill up.

Using du -sm * | sort -nr , the team identified several massive log files (e.g., service.log.20201105193331 ) that were continuously growing.

Manual deletion of some log files with rm service.log.20201105193331 did not reduce disk usage because the space was still held by an active process.

Running lsof | grep deleted revealed that an SLS (Alibaba Log Service) process (PID 11526) still had the deleted log files open, preventing the filesystem from reclaiming the space.

The underlying Linux mechanism was explained: a file is only truly removed when both its link count (i_nlink) and reference count (i_count) drop to zero; an open file descriptor keeps i_count > 0.

To resolve the issue, the SLS process was force‑killed with kill -9 11526 , after which $df showed disk usage falling back to normal levels.

Post‑mortem analysis identified two root causes: excessive log generation during the promotion and slow log pulling by the shared SLS project. Mitigations include implementing log‑level downgrade strategies during high‑traffic periods and separating SLS configurations for critical services.

operationslinuxTroubleshootingLog ManagementSLSDisk Usage
Full-Stack Internet Architecture
Written by

Full-Stack Internet Architecture

Introducing full-stack Internet architecture technologies centered on Java

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.