Operations 19 min read

Master Linux Performance: Tools & Flame Graphs for Fast Issue Diagnosis

This article presents a comprehensive guide to Linux performance analysis, covering CPU, memory, disk I/O, network, system load, flame‑graph techniques, and a real‑world Nginx case study, enabling engineers to quickly locate and resolve bottlenecks.

Efficient Ops
Efficient Ops
Efficient Ops
Master Linux Performance: Tools & Flame Graphs for Fast Issue Diagnosis

1. Background

Sometimes we encounter difficult problems that monitoring plugins cannot immediately reveal; deep analysis on the server is required. Accumulated technical experience and broad knowledge are needed to locate issues, and good analysis tools can greatly improve efficiency.

2. Description

This article introduces various problem‑location tools and combines case studies for analysis.

3. Problem‑analysis methodology

Applying the 5W2H method raises several performance‑analysis questions:

What – what does the phenomenon look like

When – when does it happen

Why – why does it happen

Where – where does it happen

How much – how many resources are consumed

How to do – how to solve it

4. CPU

4.1 Description

For applications we usually focus on kernel CPU scheduler functionality and performance. Thread‑state analysis distinguishes on‑CPU (user and sys time) and off‑CPU (waiting for I/O, lock, paging, etc.).

If most time is spent on‑CPU, CPU profiling quickly explains the cause; if most time is off‑CPU, locating the problem takes longer.

Processor

Core

Hardware thread

CPU cache

Clock frequency

CPI / IPC

Instruction set

Utilization

User time / kernel time

Scheduler

Run queue

Preemption

Multi‑process

Multi‑thread

Word size

4.2 Analysis tools

uptime, vmstat, mpstat, top, pidstat – show CPU and load usage.

perf – traces function‑level CPU time and can target specific kernel functions.

4.3 Usage

<code>// view system CPU usage
top
// view per‑CPU info
mpstat -P ALL 1
// view CPU usage and load average
vmstat 1
// process CPU statistics
pidstat -u 1 -p <pid>
# trace function‑level CPU usage of a process
perf top -p <pid> -e cpu-clock</code>

5. Memory

5.1 Description

Memory problems affect not only performance but also service availability. Key concepts include:

Main memory

Virtual memory

Resident memory

Address space

OOM

Page cache

Page fault

Swapping

Swap space

Allocator libraries (libc, glibc, libmalloc, mtmalloc)

Linux SLUB allocator

5.2 Analysis tools

free, vmstat, top, pidstat, pmap – report memory usage.

valgrind – detects memory leaks.

dtrace – dynamic tracing of kernel functions (requires D language scripts).

5.3 Usage

<code>// view system memory usage
free -m
// view virtual memory stats
vmstat 1
// view memory usage
top
// per‑process memory stats
pidstat -p <pid> -r 1
// view process memory map
pmap -d <pid>
# detect memory leaks
valgrind --tool=memcheck --leak-check=full --log-file=./log.txt ./program</code>

6. Disk I/O

6.1 Description

Disk is the slowest subsystem and a common performance bottleneck. Understanding file system, VFS, page cache, buffer cache, inode, and I/O scheduling is essential.

6.2 Analysis tools

6.3 Usage

<code>// view I/O statistics
iotop
// detailed I/O stats
iostat -d -x -k 1 10
// per‑process I/O
pidstat -d 1 -p <pid>
# investigate I/O anomalies
perf record -e block:block_rq_issue -ag
perf report</code>

7. Network

7.1 Description

Network monitoring is complex due to latency, blocking, collisions, packet loss, and external devices such as routers and switches.

7.2 Analysis tools

7.3 Usage

<code>// network statistics
netstat -s
// UDP connections
netstat -nu
// UDP port usage
netstat -apu
// count connections per state
netstat -a | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
// TCP connections
ss -t -a
// socket summary
ss -s
// UDP sockets
ss -u -a
// TCP/ETCP stats
sar -n TCP,ETCP 1
// network I/O
sar -n DEV 1
// packet capture
tcpdump -i eth1 host 192.168.1.1 and port 80
// flow capture
tcpflow -c p host 192.168.1.1</code>

8. System Load

8.1 Description

Load measures how much work the system is doing; Load Average is the average length of the run queue over 1, 5, and 15 minutes.

8.2 Analysis tools

8.3 Usage

<code>// view load
uptime
top
vmstat
// system call latency
strace -c -p <pid>
// trace specific syscalls
strace -T -e epoll_wait -p <pid>
// view kernel logs
dmesg</code>

9. Flame Graphs

9.1 Description

Flame Graphs, created by Brendan Gregg, visualize CPU call stacks. The y‑axis shows stack depth, the x‑axis shows sample count; wider blocks indicate functions that consume more CPU time.

9.2 Install dependencies

<code>// install systemtap
yum install systemtap systemtap-runtime
// install kernel debug packages matching the running kernel
uname -r
# then install corresponding -debuginfo and -devel packages</code>

9.3 Install

<code>git clone https://github.com/lidaohang/quick_location.git
cd quick_location</code>

9.4 On‑CPU flame graph

High CPU usage can be pinpointed to the functions that dominate the on‑CPU flame graph.

<code>// on‑CPU user mode
sh ngx_on_cpu_u.sh <pid>
cd ngx_on_cpu_u
python -m SimpleHTTPServer 8088
// open http://127.0.0.1:8088/<pid>.svg</code>

9.4.1 on‑CPU

CPU time is split into user and kernel.

9.4.2 off‑CPU

Off‑CPU time represents waiting for I/O, locks, paging, etc.

<code>// off‑CPU user mode
sh ngx_off_cpu_u.sh <pid>
cd ngx_off_cpu_u
python -m SimpleHTTPServer 8088</code>

9.5 Memory‑level flame graph

Useful for locating memory‑leak hotspots.

<code>sh ngx_on_memory.sh <pid>
cd ngx_on_memory
python -m SimpleHTTPServer 8088</code>

9.6 Differential (red‑blue) flame graph

Compares two profiles to highlight performance regressions.

<code>// capture before change
perf record -F 99 -p <pid> -g -- sleep 30 > out1
// capture after change
perf record -F 99 -p <pid> -g -- sleep 30 > out2
// generate diff
./FlameGraph/stackcollapse-perf.pl out1 > folded1
./FlameGraph/stackcollapse-perf.pl out2 > folded2
./FlameGraph/difffolded.pl folded1 folded2 | ./FlameGraph/flamegraph.pl > diff.svg</code>

10. Case Study – Nginx Cluster Anomaly

10.1 Symptom

On 2017‑09‑25 the Nginx cluster showed many 499/5xx responses and increased CPU usage.

10.2 Nginx metrics analysis

Traffic did not spike; response time increased, likely due to upstream latency.

10.3 System CPU analysis

Top showed high CPU usage by Nginx workers; perf top revealed most time spent in free, malloc, and JSON parsing.

10.4 Flame‑graph analysis

On‑CPU flame graph confirmed heavy JSON parsing cost.

10.5 Summary

Root causes: upstream latency and inefficient JSON parsing in Nginx modules. Disabling the costly module reduced CPU usage and restored normal traffic.

11. References

http://www.brendangregg.com/

http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

http://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html

http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html

http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html

https://github.com/openresty/openresty-systemtap-toolkit

https://github.com/brendangregg/FlameGraph

https://www.slideshare.net/brendangregg/blazing-performance-with-flame-graphs

monitoringsystem optimizationLinuxperformance analysisCPU ProfilingFlame Graphs
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.