How to Pinpoint Disk I/O Bottlenecks on Linux with iostat and blktrace
This guide walks you through a step‑by‑step, non‑disruptive workflow for diagnosing high disk I/O on Linux servers using built‑in tools such as vmstat, iostat, iotop, pidstat, lsof and the low‑level tracer blktrace, then shows how to interpret the data, identify common root causes, apply targeted fixes, and verify improvements with fio benchmarks and continuous monitoring.
Background
High %iowait, load average far above CPU core count, and latency spikes while disk throughput is not saturated indicate a disk‑IO bottleneck. The goal is to locate the exact I/O source—process, file, or device—without restarting services.
Tool preparation
Typical Linux distributions already provide the following utilities: iostat – reports CPU and block‑device statistics (package sysstat) iotop – real‑time per‑process I/O usage blktrace – traces the full path of block I/O requests pidstat – per‑process I/O statistics (package sysstat) df / du – disk‑space analysis (built‑in) lsof – lists open files per process strace – traces system calls, including I/O perf – performance profiling
Verify availability:
which iostat iotop blktrace pidstat
iostat -V # show sysstat versionPhase 1 – Determine whether I/O is the problem
1.1 vmstat overview
vmstat 1 10Key columns: bi – blocks read per second bo – blocks written per second wa – % CPU time waiting for I/O b – processes blocked in uninterruptible sleep
Interpretation thresholds: wa consistently > 20‑30 % signals strong I/O pressure. b greater than the number of CPU cores indicates many processes blocked on I/O.
If bi and bo are low but wa is high, random I/O latency is likely the culprit.
1.2 iostat global device statistics
iostat -xzm 2 5Important fields (rough thresholds): r/s, w/s – requests per second (business‑dependent; SSD vs HDD) rMB/s, wMB/s – throughput in MiB/s (watch for approaching device rating) await – average I/O wait (ms); > 10 ms for HDD, > 1 ms for SSD is concerning avgqu‑sz – average queue length; > 2 for HDD indicates serious backlog %util – device utilization; > 60‑70 % means near saturation
Do not rely solely on %util. A 7200 RPM HDD at 40 % util with avgqu‑sz = 8 still suffers high latency.
1.3 Identify the busy device
iostat -d -p ALL 2 1 | grep -E "^sd|^nvme|^dm-"
cat /proc/diskstatsIf both a whole disk (e.g. /dev/sda) and its partition (e.g. /dev/sda1) show activity, drill down:
iostat -p sda 2 3Phase 2 – Pinpoint the process
2.1 iotop per‑process I/O
sudo iotop -o -a -d 2Options: -o – show only processes doing I/O -a – display accumulated I/O -d 2 – refresh every 2 seconds
Typical output shows, for example, mysqld reading at 152 MiB/s and redis‑server writing at 120 MiB/s.
2.2 pidstat per‑PID I/O
# All processes every 2 seconds
pidstat -d 2 5
# Specific PID (e.g., 12433)
pidstat -d -p 12433 1 10Watch the kB_rd/s and kB_wr/s columns.
2.3 Locate hot files with lsof
sudo lsof -p 12433 | grep -E "REG|DIR" | awk '{print $NF}' | sort | uniq -c | sort -rn | head -20Common culprits for MySQL include innodb_data_file_path, innodb_log_file_size, and large temporary files ( tmpdir, sort_buffer_size).
Phase 3 – Analyse I/O pattern (sequential vs random)
3.1 iostat extended metrics
iostat -xzm 2 5Key indicator: avgrq‑sz. Values around 2.5 MiB imply large sequential I/O; values between 4 KB‑32 KB suggest random small I/O typical of database page reads.
3.2 blktrace deep trace
Install:
# RHEL/CentOS
sudo yum install blktrace -y
# Debian/Ubuntu
sudo apt install blktrace -yStart tracing a device (e.g., sda) for a full business cycle: sudo blktrace -d /dev/sda -o ./blktrace_output Stop with Ctrl+C or specify a duration:
sudo blktrace -d /dev/sda -w 60 -o ./blktrace_outputConvert binary output to readable form:
sudo blkparse -i ./blktrace_output -d - | head -100Relevant columns (1‑10) include device numbers, CPU ID, timestamp, PID, operation stage (Q/G/I/D/C), request size, and process name.
Analysis checklist:
Large Q→D delta → scheduler queue delay (kernel bottleneck).
Large D→C delta → disk’s own service time (hardware bottleneck).
Read Q→C latency > write latency → read‑heavy queueing, often random reads.
Combine with avgqu‑sz to gauge overall backlog.
Generate a concise report with the bundled btt script:
sudo blkparse -i ./blktrace_output -d /tmp/blktrace.bin
python3 /usr/share/doc/blktrace/scripts/btt/btt.py -i /tmp/blktrace.bin -d 2>/tmp/btt_report.txt
cat /tmp/btt_report.txtTypical btt output shows average latencies for each stage (e.g., Q2G, I2D, D2C).
If I2D > 10 ms, consider switching the scheduler (e.g., to noop for SSD) or increasing queue_depth. If D2C > 20 ms on HDD, hardware replacement may be required.
Phase 4 – Common root causes and fixes
4.1 Database high‑concurrency random reads
Typical symptom: MySQL InnoDB buffer‑pool miss causing massive page swaps.
mysql -e "SHOW ENGINE INNODB STATUS\G" | grep -E "Buffer pool hit rate|Page reads"Remediation steps (illustrated in a MySQL my.cnf snippet):
Set innodb_buffer_pool_size to 60‑80 % of RAM.
Enable innodb_buffer_pool_dump_at_shutdown=ON for faster warm‑up.
Adjust innodb_max_dirty_pages_pct to flush dirty pages earlier.
If strict durability is not required, set innodb_flush_log_at_trx_commit=2 (risk: up to 1 s data loss).
innodb_buffer_pool_size = 32G # 60‑80 % of available memory
innodb_buffer_pool_instances = 8
innodb_max_dirty_pages_pct = 75
innodb_flush_log_at_trx_commit = 2
innodb_read_io_threads = 16
innodb_write_io_threads = 164.2 Swap thrashing
Typical symptom: Memory exhaustion leads to rapid swap I/O.
vmstat 1 5 # watch si/so columns
for f in /proc/*/status; do awk '/VmSwap/{s+=$2}/Name/{n=$2}END{if(s>0)print n,s}' $f 2>/dev/null; done | sort -k2 -rn | head -10Fixes:
Validate application memory usage; fix leaks.
Lower vm.swappiness (e.g., to 10‑30).
Add physical RAM.
Apply cgroup memory limits or reduce max_connections for MySQL, pm.max_children for php‑fpm.
# View current swappiness
cat /proc/sys/vm/swappiness
# Temporary change
sudo sysctl -w vm.swappiness=10
# Permanent change
echo "vm.swappiness = 10" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p4.3 Mismatched I/O scheduler
Check current scheduler and device type:
cat /sys/block/sda/queue/scheduler
cat /sys/block/sda/queue/rotational # 1=HDD, 0=SSD/NVMeCommon schedulers: mq-deadline – SSD/NVMe, latency‑first, low‑delay bfq – Desktop/multimedia, fair queue, interactive cfq – Traditional HDD, fair but unstable latency (deprecated) noop – SSD, high‑IOPS databases, no merging, minimal overhead
Switch temporarily:
echo mq-deadline | sudo tee /sys/block/sda/queue/schedulerPersist via udev rules (RHEL/CentOS example):
echo 'ACTION=="add|change", SUBSYSTEM=="block", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="mq-deadline"' | sudo tee /etc/udev/rules.d/60-ssd-scheduler.rules4.4 Filesystem fragmentation or bad mount options
Typical symptom: ext4 fragmentation after large file deletions or missing noatime causing extra writes.
# Check mount options
mount | grep sda1
# Check ext4 fragmentation (read‑only)
sudo fsck -n /dev/sda1
# For xfs
sudo xfs_db -r /dev/sda1
xfs_db> fragFixes:
Add noatime,nodiratime to mount options.
For severe ext4 fragmentation, run e4defrag (use with caution in production).
# Temporary remount
sudo mount -o remount,noatime,nodiratime /dev/sda1 /data
# Permanent entry in /etc/fstab
# /dev/sda1 /data ext4 defaults,noatime,nodiratime 0 2
# Online defragmentation (requires e2fsprogs)
sudo e4defrag -c /data4.5 Excessive log writes
Typical symptom: Applications writing massive DEBUG logs or mis‑configured logrotate cause synchronous write stalls.
# Find frequently written log files
sudo find /var/log -type f -name "*.log" -exec ls -lth {} + | head -20
# Observe per‑process write I/O
pidstat -d 1 5 | grep -v "^Average" | awk '$7 > 0 {print}'Mitigations:
Move log directories to a dedicated SSD.
Reduce log level; disable unnecessary DEBUG output.
Enable asynchronous logging if supported.
Configure logrotate to keep files small and compress old logs.
# Example /etc/logrotate.d/myapp
/var/log/myapp/*.log {
daily
rotate 14
missingok
notifempty
compress
delaycompress
sharedscripts
postrotate
kill -USR1 $(cat /var/run/myapp.pid 2>/dev/null)
endscript
}Phase 5 – Validation and ongoing monitoring
5.1 Benchmark before/after with fio
# Install fio
sudo yum install fio -y # or apt install fio
# Random read (4 KB, 4 jobs)
sudo fio --name=randread --filename=/tmp/fio_test --ioengine=libaio \
--rw=randread --bs=4k --numjobs=4 --size=1G --runtime=60 \
--group_reporting --iodepth=32
# Sequential write (1 MiB)
sudo fio --name=seqwrite --filename=/tmp/fio_test --ioengine=libaio \
--rw=write --bs=1M --numjobs=1 --size=2G --runtime=60 \
--group_reporting --fsync=1Key metrics: IOPS, bandwidth (BW), average latency and 99th‑percentile latency.
5.2 Continuous monitoring
Collect iostat periodically via cron:
*/5 * * * * /usr/bin/iostat -xzm 2 3 >> /var/log/iostat.log 2>&1Simple alert script for %util > 80 %:
#!/bin/bash
UTIL=$(iostat -x -c 1 1 | awk '/^%util/{print $NF}')
if (( $(echo "$UTIL > 80" | bc -l) )); then
echo "Disk IO alert: sda util ${UTIL}%" | tee /dev/kmsg
# Insert notification channel (e.g., PagerDuty)
fiFor richer metrics, enable node_exporter and monitor node_disk_io_time_seconds_total, node_disk_read_time_seconds_total in Prometheus/Grafana, setting appropriate thresholds.
Conclusion
The recommended troubleshooting chain:
vmstat → iostat → iotop → pidstat/lsof → blktrace → identify root cause (DB, swap, scheduler, filesystem, logs) → apply targeted fix → fio validation → monitoringKey principles: start with high‑level metrics before deep tracing, combine tools for cross‑validation, avoid unnecessary restarts, and codify findings into repeatable SOPs and alerts.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
