Operations 18 min read

Mastering strace: Diagnose Linux Process Issues with Real-World Examples

This article explains what strace is, how it works, and provides step‑by‑step examples—including fixing a failed service start, tracing nginx, diagnosing process crashes, shared‑memory errors, and performance analysis—to help operations engineers quickly locate and resolve Linux system problems.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering strace: Diagnose Linux Process Issues with Real-World Examples

What is strace?

According to its official description, strace is a Linux user‑space tracer used for diagnosis, debugging, and teaching. It monitors interactions between user‑space processes and the kernel, such as system calls, signals, and process state changes.

strace uses the kernel's ptrace feature under the hood.

In daily operations, fault handling and diagnosis are essential skills. strace, as a dynamic tracing tool, helps operators efficiently locate process and service failures by revealing the “trace” of system calls.

What can strace do?

Example: a package called

some_server

fails to start.

Startup command

<code>./some_server ../conf/some_server.conf</code>

Output

<code>FATAL: InitLogFile failed iRet: -1!
Init error: -1655</code>

Running strace shows the underlying cause.

<code>strace -tt -f ./some_server ../conf/some_server.conf</code>

Sample strace output (excerpt):

The line before the fatal error shows an

open

system call:

<code>23:14:24.448034 open("/usr/local/apps/some_server/log//server_agent.log", O_RDWR|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = -1 ENOENT (No such file or directory)</code>

The call fails with

ENOENT

because the log directory does not exist.

ENOENT with O_CREAT means a component of the pathname is missing or a dangling symlink.

Checking the path:

<code>ls -l /usr/local/apps/some_server/log
ls: cannot access /usr/local/apps/some_server/log: No such file or directory
ls -l /usr/local/apps/some_server
... (bin and conf directories exist)</code>

The missing log subdirectory caused the failure; creating it fixes the problem.

strace opens the “black box” of an application and tells you roughly what the process is doing.

How to use strace?

Before using strace, understand system calls.

About system calls

A system call is a request from a user‑space program to the operating system kernel for privileged services.

The kernel runs directly on hardware, providing device management, memory management, scheduling, etc.

User space requests kernel services via APIs, which are the system calls.

On Linux, applications invoke system calls through the glibc wrapper.

Linux has over 300 system calls, grouped as:

<code>File and device access: open, close, read, write, chmod, ...
Process management: fork, clone, execve, exit, getpid, ...
Signal handling: signal, sigaction, kill, ...
Memory management: brk, mmap, mlock, ...
IPC: shmget, semget, message queues, ...
Network: socket, connect, sendto, sendmsg, ...
Other</code>
For deeper study, see “Linux System Programming” or “Advanced Programming in the Unix Environment”.

strace has two running modes:

Prefix the command with

strace

to start a new process, e.g.

strace ls -lh /var/log/messages

.

Attach to an existing process with

-p pid

, e.g.

pidof some_server

then

strace -p 17553

.

Terminate tracing with

Ctrl+C

.

Common options

Example command:

<code>strace -tt -T -v -f -e trace=file -o /data/log/strace.log -s 1024 -p 23489</code>
- -tt : prepend each line with millisecond‑resolution timestamps - -T : show time spent in each call - -v : verbose output for certain calls - -f : follow child processes - -e trace=… : select which calls to trace (e.g., file ) - -o : write output to a file - -s : limit string argument length - -p : specify PID to attach

Example: tracing nginx file accesses:

<code>strace -tt -T -f -e trace=file -o /data/log/strace.log -s 1024 ./nginx</code>

The first column shows the PID, the timestamp column comes from

-tt

, and the final column shows the time spent per call thanks to

-T

. The output is limited to file‑related calls because of

-e trace=file

.

strace troubleshooting cases

1. Locating a process crash

Problem: a persistent script

run.sh

dies after a minute.

Solution: find its PID (e.g., 24298) and trace it:

<code>strace -o strace.log -tt -p 24298</code>

At the end of

strace.log

we see:

<code>22:47:42.803937 wait4(-1, <unfinished ...>
22:47:43.228422 +++ killed by SIGKILL +++</code>

The process was killed by a SIGKILL sent by another watchdog script that mistakenly terminated it.

When a process exits normally, strace shows an

exit_group

call and

+++ exited with X +++

.

<code>#include <stdio.h>
#include <stdlib.h>
int main(){ exit(1); }</code>
<code>23:07:24.672849 execve("./test_exit", ["./test_exit"], [...]) = 0
23:07:24.674665 arch_prctl(ARCH_SET_FS, 0x7f1c0eca7740) = 0
23:07:24.675108 exit_group(1) = ?
23:07:24.675259 +++ exited with 1 +++</code>
The glibc exit function ultimately invokes the exit_group system call, which terminates all threads of the process.

2. Shared‑memory error

Problem: a service fails with shmget 267264 30097568: Invalid argument . Trace with:

<code>strace -tt -f -e trace=ipc ./a_mon_svr ../conf/a_mon_svr.conf</code>

Output shows:

<code>22:46:36.351798 shmget(0x5feb, 12000, 0666) = 0
22:46:36.351939 shmat(0, 0, 0) = ?
Process 21406 attached
22:46:36.355439 shmget(0x41400, 30097568, 0666) = -1 EINVAL (Invalid argument)</code>

Manual lookup of EINVAL for shmget reveals three possible causes; the third applies here: a segment with the same key already exists but with a different size. Checking with ipcs -m confirms the existing segment size (30095516) differs from the requested size (30097568). The mismatch was due to mixing 32‑bit and 64‑bit binaries; recompiling both as 64‑bit resolves it.

3. Performance analysis

Two shell scripts count lines of code in the Linux 4.5.4 source tree. Using strace -c -f to profile them shows the efficient good_script.sh finishes in ~2 seconds, while the naive poor_script.sh takes ~539 seconds, creating over 126 000 processes versus only 3.

This demonstrates that process creation overhead dominates performance; using fewer processes and efficient system‑call patterns yields dramatic speedups.

Summary

When a process or service behaves abnormally, strace lets you trace its system calls to discover the root cause. Familiarity with common system calls and strace options enables effective debugging and performance tuning, while more advanced tools (gdb, perf, SystemTap) complement strace for cases where it provides no output.

debuggingoperationsLinuxperformance analysissystem call tracingstrace
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.