Operations 26 min read

How to Pinpoint Disk I/O Bottlenecks on Linux with iostat and blktrace

This guide walks you through a step‑by‑step, non‑disruptive workflow for diagnosing high disk I/O on Linux servers using built‑in tools such as vmstat, iostat, iotop, pidstat, lsof and the low‑level tracer blktrace, then shows how to interpret the data, identify common root causes, apply targeted fixes, and verify improvements with fio benchmarks and continuous monitoring.

Ops Community
Ops Community
Ops Community
How to Pinpoint Disk I/O Bottlenecks on Linux with iostat and blktrace

Background

High %iowait, load average far above CPU core count, and latency spikes while disk throughput is not saturated indicate a disk‑IO bottleneck. The goal is to locate the exact I/O source—process, file, or device—without restarting services.

Tool preparation

Typical Linux distributions already provide the following utilities: iostat – reports CPU and block‑device statistics (package sysstat) iotop – real‑time per‑process I/O usage blktrace – traces the full path of block I/O requests pidstat – per‑process I/O statistics (package sysstat) df / du – disk‑space analysis (built‑in) lsof – lists open files per process strace – traces system calls, including I/O perf – performance profiling

Verify availability:

which iostat iotop blktrace pidstat
iostat -V   # show sysstat version

Phase 1 – Determine whether I/O is the problem

1.1 vmstat overview

vmstat 1 10

Key columns: bi – blocks read per second bo – blocks written per second wa – % CPU time waiting for I/O b – processes blocked in uninterruptible sleep

Interpretation thresholds: wa consistently > 20‑30 % signals strong I/O pressure. b greater than the number of CPU cores indicates many processes blocked on I/O.

If bi and bo are low but wa is high, random I/O latency is likely the culprit.

1.2 iostat global device statistics

iostat -xzm 2 5

Important fields (rough thresholds): r/s, w/s – requests per second (business‑dependent; SSD vs HDD) rMB/s, wMB/s – throughput in MiB/s (watch for approaching device rating) await – average I/O wait (ms); > 10 ms for HDD, > 1 ms for SSD is concerning avgqu‑sz – average queue length; > 2 for HDD indicates serious backlog %util – device utilization; > 60‑70 % means near saturation

Do not rely solely on %util. A 7200 RPM HDD at 40 % util with avgqu‑sz = 8 still suffers high latency.

1.3 Identify the busy device

iostat -d -p ALL 2 1 | grep -E "^sd|^nvme|^dm-"
cat /proc/diskstats

If both a whole disk (e.g. /dev/sda) and its partition (e.g. /dev/sda1) show activity, drill down:

iostat -p sda 2 3

Phase 2 – Pinpoint the process

2.1 iotop per‑process I/O

sudo iotop -o -a -d 2

Options: -o – show only processes doing I/O -a – display accumulated I/O -d 2 – refresh every 2 seconds

Typical output shows, for example, mysqld reading at 152 MiB/s and redis‑server writing at 120 MiB/s.

2.2 pidstat per‑PID I/O

# All processes every 2 seconds
pidstat -d 2 5
# Specific PID (e.g., 12433)
pidstat -d -p 12433 1 10

Watch the kB_rd/s and kB_wr/s columns.

2.3 Locate hot files with lsof

sudo lsof -p 12433 | grep -E "REG|DIR" | awk '{print $NF}' | sort | uniq -c | sort -rn | head -20

Common culprits for MySQL include innodb_data_file_path, innodb_log_file_size, and large temporary files ( tmpdir, sort_buffer_size).

Phase 3 – Analyse I/O pattern (sequential vs random)

3.1 iostat extended metrics

iostat -xzm 2 5

Key indicator: avgrq‑sz. Values around 2.5 MiB imply large sequential I/O; values between 4 KB‑32 KB suggest random small I/O typical of database page reads.

3.2 blktrace deep trace

Install:

# RHEL/CentOS
sudo yum install blktrace -y
# Debian/Ubuntu
sudo apt install blktrace -y

Start tracing a device (e.g., sda) for a full business cycle: sudo blktrace -d /dev/sda -o ./blktrace_output Stop with Ctrl+C or specify a duration:

sudo blktrace -d /dev/sda -w 60 -o ./blktrace_output

Convert binary output to readable form:

sudo blkparse -i ./blktrace_output -d - | head -100

Relevant columns (1‑10) include device numbers, CPU ID, timestamp, PID, operation stage (Q/G/I/D/C), request size, and process name.

Analysis checklist:

Large Q→D delta → scheduler queue delay (kernel bottleneck).

Large D→C delta → disk’s own service time (hardware bottleneck).

Read Q→C latency > write latency → read‑heavy queueing, often random reads.

Combine with avgqu‑sz to gauge overall backlog.

Generate a concise report with the bundled btt script:

sudo blkparse -i ./blktrace_output -d /tmp/blktrace.bin
python3 /usr/share/doc/blktrace/scripts/btt/btt.py -i /tmp/blktrace.bin -d 2>/tmp/btt_report.txt
cat /tmp/btt_report.txt

Typical btt output shows average latencies for each stage (e.g., Q2G, I2D, D2C).

If I2D > 10 ms, consider switching the scheduler (e.g., to noop for SSD) or increasing queue_depth. If D2C > 20 ms on HDD, hardware replacement may be required.

Phase 4 – Common root causes and fixes

4.1 Database high‑concurrency random reads

Typical symptom: MySQL InnoDB buffer‑pool miss causing massive page swaps.

mysql -e "SHOW ENGINE INNODB STATUS\G" | grep -E "Buffer pool hit rate|Page reads"

Remediation steps (illustrated in a MySQL my.cnf snippet):

Set innodb_buffer_pool_size to 60‑80 % of RAM.

Enable innodb_buffer_pool_dump_at_shutdown=ON for faster warm‑up.

Adjust innodb_max_dirty_pages_pct to flush dirty pages earlier.

If strict durability is not required, set innodb_flush_log_at_trx_commit=2 (risk: up to 1 s data loss).

innodb_buffer_pool_size = 32G   # 60‑80 % of available memory
innodb_buffer_pool_instances = 8
innodb_max_dirty_pages_pct = 75
innodb_flush_log_at_trx_commit = 2
innodb_read_io_threads = 16
innodb_write_io_threads = 16

4.2 Swap thrashing

Typical symptom: Memory exhaustion leads to rapid swap I/O.

vmstat 1 5   # watch si/so columns
for f in /proc/*/status; do awk '/VmSwap/{s+=$2}/Name/{n=$2}END{if(s>0)print n,s}' $f 2>/dev/null; done | sort -k2 -rn | head -10

Fixes:

Validate application memory usage; fix leaks.

Lower vm.swappiness (e.g., to 10‑30).

Add physical RAM.

Apply cgroup memory limits or reduce max_connections for MySQL, pm.max_children for php‑fpm.

# View current swappiness
cat /proc/sys/vm/swappiness
# Temporary change
sudo sysctl -w vm.swappiness=10
# Permanent change
echo "vm.swappiness = 10" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

4.3 Mismatched I/O scheduler

Check current scheduler and device type:

cat /sys/block/sda/queue/scheduler
cat /sys/block/sda/queue/rotational   # 1=HDD, 0=SSD/NVMe

Common schedulers: mq-deadline – SSD/NVMe, latency‑first, low‑delay bfq – Desktop/multimedia, fair queue, interactive cfq – Traditional HDD, fair but unstable latency (deprecated) noop – SSD, high‑IOPS databases, no merging, minimal overhead

Switch temporarily:

echo mq-deadline | sudo tee /sys/block/sda/queue/scheduler

Persist via udev rules (RHEL/CentOS example):

echo 'ACTION=="add|change", SUBSYSTEM=="block", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="mq-deadline"' | sudo tee /etc/udev/rules.d/60-ssd-scheduler.rules

4.4 Filesystem fragmentation or bad mount options

Typical symptom: ext4 fragmentation after large file deletions or missing noatime causing extra writes.

# Check mount options
mount | grep sda1
# Check ext4 fragmentation (read‑only)
sudo fsck -n /dev/sda1
# For xfs
sudo xfs_db -r /dev/sda1
xfs_db> frag

Fixes:

Add noatime,nodiratime to mount options.

For severe ext4 fragmentation, run e4defrag (use with caution in production).

# Temporary remount
sudo mount -o remount,noatime,nodiratime /dev/sda1 /data
# Permanent entry in /etc/fstab
# /dev/sda1 /data ext4 defaults,noatime,nodiratime 0 2
# Online defragmentation (requires e2fsprogs)
sudo e4defrag -c /data

4.5 Excessive log writes

Typical symptom: Applications writing massive DEBUG logs or mis‑configured logrotate cause synchronous write stalls.

# Find frequently written log files
sudo find /var/log -type f -name "*.log" -exec ls -lth {} + | head -20
# Observe per‑process write I/O
pidstat -d 1 5 | grep -v "^Average" | awk '$7 > 0 {print}'

Mitigations:

Move log directories to a dedicated SSD.

Reduce log level; disable unnecessary DEBUG output.

Enable asynchronous logging if supported.

Configure logrotate to keep files small and compress old logs.

# Example /etc/logrotate.d/myapp
/var/log/myapp/*.log {
    daily
    rotate 14
    missingok
    notifempty
    compress
    delaycompress
    sharedscripts
    postrotate
        kill -USR1 $(cat /var/run/myapp.pid 2>/dev/null)
    endscript
}

Phase 5 – Validation and ongoing monitoring

5.1 Benchmark before/after with fio

# Install fio
sudo yum install fio -y   # or apt install fio
# Random read (4 KB, 4 jobs)
sudo fio --name=randread --filename=/tmp/fio_test --ioengine=libaio \
  --rw=randread --bs=4k --numjobs=4 --size=1G --runtime=60 \
  --group_reporting --iodepth=32
# Sequential write (1 MiB)
sudo fio --name=seqwrite --filename=/tmp/fio_test --ioengine=libaio \
  --rw=write --bs=1M --numjobs=1 --size=2G --runtime=60 \
  --group_reporting --fsync=1

Key metrics: IOPS, bandwidth (BW), average latency and 99th‑percentile latency.

5.2 Continuous monitoring

Collect iostat periodically via cron:

*/5 * * * * /usr/bin/iostat -xzm 2 3 >> /var/log/iostat.log 2>&1

Simple alert script for %util > 80 %:

#!/bin/bash
UTIL=$(iostat -x -c 1 1 | awk '/^%util/{print $NF}')
if (( $(echo "$UTIL > 80" | bc -l) )); then
  echo "Disk IO alert: sda util ${UTIL}%" | tee /dev/kmsg
  # Insert notification channel (e.g., PagerDuty)
fi

For richer metrics, enable node_exporter and monitor node_disk_io_time_seconds_total, node_disk_read_time_seconds_total in Prometheus/Grafana, setting appropriate thresholds.

Conclusion

The recommended troubleshooting chain:

vmstat → iostat → iotop → pidstat/lsof → blktrace → identify root cause (DB, swap, scheduler, filesystem, logs) → apply targeted fix → fio validation → monitoring

Key principles: start with high‑level metrics before deep tracing, combine tools for cross‑validation, avoid unnecessary restarts, and codify findings into repeatable SOPs and alerts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performance monitoringLinuxdisk I/Ofioiotopblktraceiostatpidstat
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.