Operations 7 min read

Essential Linux Ops: Proven Troubleshooting Steps for Common Failures

This guide outlines a systematic Linux operations troubleshooting framework—emphasizing error messages, log analysis, root‑cause isolation, and step‑by‑step solutions for six real‑world scenarios ranging from filesystem corruption to inode exhaustion and read‑only file‑system errors.

Efficient Ops
Efficient Ops
Efficient Ops
Essential Linux Ops: Proven Troubleshooting Steps for Common Failures

As a professional Linux operations engineer, having a clear troubleshooting framework is essential for quickly pinpointing and resolving issues.

First, pay close attention to error messages, as they often indicate where the problem lies.

Next, examine logs—both system logs (e.g., files under /var/log) and application‑specific logs—to gain deeper insight.

Then analyze and locate the root cause by correlating error messages, log entries, and contextual information.

Finally, address the core issue once its cause is identified.

Scenario 1: Filesystem corruption prevents boot

When the root filesystem shows errors on /dev/sda6, force a check and repair:

<code># umount /dev/sda6
# fsck.ext3 -y /dev/sda6
</code>

Scenario 2: “Argument list too long” and disk‑full errors

Saving a crontab can fail with “no space left on device” because /var is full. Use

df -h

to confirm, then identify large directories with

du

. Deleting files in /var/spool/clientmqueue frees space:

<code># rm /var/spool/clientmqueue/*
</code>

For “argument list too long”, you can delete files in batches, use

find

, or write a shell script:

<code># rm [a-n]* -rf
# rm [o-z]* -rf
# find /var/spool/clientmqueue -type f -print -exec rm -f {} \;
#!/bin/bash
RM_DIR='/var/spool/clientmqueue'
cd $RM_DIR
for i in $(ls); do
    rm -f "$i"
done
</code>

Scenario 3: Inode exhaustion causing service failure

An Oracle database may refuse to start despite sufficient disk space because inodes are exhausted. Check with

df -i

and clear inode‑heavy files:

<code># df -i
# dumpe2fs -h /dev/sda3 | grep 'Inode count'
# find /var/spool/clientmqueue/ -name "*" -exec rm -rf {} \;
</code>

Scenario 4: Deleted files still occupying space

If large files are removed from /tmp but space is not released, an httpd process may still hold the deleted

access_log

. Identify the open file and truncate or restart the service:

<code># lsof | grep delete
# echo "" > /tmp/access_log
</code>

Scenario 5: “Too many open files” in a Java web app

Check the file‑descriptor limit with

ulimit -n

and set appropriate values in

/etc/security/limits.conf

. Remember to restart Tomcat after changing limits.

Scenario 6: “Read‑only file system” error

When a filesystem becomes read‑only due to corruption, unmount the affected partition, stop any processes using it, run

fsck

, and remount:

<code># fuser -m /dev/sdb1
# ps -ef | grep <PID>
# kill <PID>
# umount /dev/sdb1
# fsck -V -a /dev/sdb1
# mount /dev/sdb1 /www/data
</code>
OperationsLinuxtroubleshootingSystem AdministrationShell Commands
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.