Demystifying Linux I/O: From VFS and Inodes to ZFS and Block Layer
This article explains how Linux handles I/O operations, covering the virtual file system, inode and dentry structures, superblock layout, ZFS features, disk types, the generic block layer, I/O scheduling strategies, and key performance metrics for storage.
File System
What is a file system
File systems are mechanisms that organize and manage files on storage devices; different organization methods produce different file systems such as Ext4, XFS, ZFS, and NFS.
Application developers usually interact only with system calls like
open,
read,
write, and
close, without worrying about the underlying file system type, disk interface, or storage medium.
How the file system works (VFS)
Linux files
In Linux, everything is a file, including regular files, directories, block devices, sockets, and pipes.
<code>brw-r--r-- 1 root root 1, 2 Apr 25 11:03 bnod // block device file
crw-r--r-- 1 root root 1, 2 Apr 25 11:04 cnod // character device file
drwxr-xr-x 2 user user 6 Apr 25 11:01 dir // directory
-rw-r--r-- 1 user user 0 Apr 25 11:01 file // regular file
prw-r--r-- 1 root root 0 Apr 25 11:04 pipeline // named pipe
srwxr-xr-x 1 root root 0 Apr 25 11:06 socket.sock // socket file
lrwxrwxrwx 1 root root 4 Apr 25 11:04 softlink -> file // symbolic link
-rw-r--r-- 2 user user 0 Apr 25 11:07 hardlink // hard link (also a regular file)</code>inode (index node): stores metadata such as inode number, size, permissions, timestamps, and data location.
dentry (directory entry): stores the file name, inode pointer, and directory hierarchy.
inode and dentry
Inode records a file's metadata; it is persisted on disk and occupies space.
<code>stat file
File: file
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: fe21h/65057d Inode: 32828 Links: 2
Access: (0644/-rw-r--r--) Uid: ( 3041/ user) Gid: ( 3041/ user)
Access: 2021-04-25 11:07:59.603745534 +0800
Modify: 2021-04-25 11:07:59.603745534 +0800
Change: 2021-04-25 11:08:04.739848692 +0800
Birth: -</code>Dentry keeps the file name, the inode pointer, and the relationship to other dentries, forming the directory tree. Dentry is maintained in memory (dentry cache).
<code>tree
.
├── dir
│ └── file_in_dir
├── file
└── hardlink</code>ZFS
ZFS is a widely used file system; many database applications rely on it.
Typical ZFS hierarchy:
ZFS operations
Create zpool
<code>root@:~ # zpool create tank raidz /dev/ada1 /dev/ada2 /dev/ada3 raidz /dev/ada4 /dev/ada5 /dev/ada6
root@:~ # zpool list tank
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 11G 824K 11.0G - - 0% 0% 1.00x ONLINE -
root@:~ # zpool status tank
pool: tank
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada1 ONLINE 0 0 0
ada2 ONLINE 0 0 0
ada3 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
ada4 ONLINE 0 0 0
ada5 ONLINE 0 0 0
ada6 ONLINE 0 0 0</code>Created a zpool named
tankusing RAID‑Z (RAID5‑like) layout.
Create ZFS filesystem
<code>root@:~ # zfs create -o mountpoint=/mnt/srev tank/srev
root@:~ # df -h tank/srev
Filesystem Size Used Avail Capacity Mounted on
tank/srev 7.1G 117K 7.1G 0% /mnt/srev</code>Mounted the ZFS filesystem at
/mnt/srevwith size equal to the zpool.
Set ZFS quota
<code>root@:~ # zfs set quota=1G tank/srev
root@:~ # df -h tank/srev
Filesystem Size Used Avail Capacity Mounted on
tank/srev 1.0G 118K 1.0G 0% /mnt/srev</code>ZFS features
Pool storage : zpool can be expanded dynamically, and multiple filesystems share the same pool without pre‑allocation.
Transactional filesystem : writes are atomic (copy‑on‑write), preventing partial writes after power loss.
ARC cache : Adaptive Replacement Cache balances LRU and LFU based on workload, using four lists (LRU, LFU, LRU ghost, LFU ghost).
Disk Types
Storage media
HDD (mechanical hard drive)
SSD (solid‑state drive)
Interfaces
IDE
SCSI
SAS
SATA
Linux disk management
Disks appear as block devices with major/minor numbers; e.g.,
/dev/sdahas major number 8 indicating an sd‑type block device.
<code>ls -l /dev/sda*
brw-rw---- 1 root disk 8, 0 Apr 25 15:53 /dev/sda
brw-rw---- 1 root disk 8, 1 Apr 25 15:53 /dev/sda1
...</code>Generic Block Layer
The Generic Block Layer abstracts heterogeneous block devices for the VFS and provides a unified framework for drivers and I/O scheduling.
I/O Scheduling
Classic single‑queue schedulers:
NOOP – simple FIFO with basic request merging.
CFQ – Completely Fair Queueing, gives each process a fair share.
Deadline – prioritises requests that approach their deadline.
Multi‑queue (blk‑mq) schedulers:
BFQ – Budget Fair Queueing, allocates bandwidth based on request size.
Kyber – maintains separate sync/async queues and limits outstanding requests.
mq‑deadline – multi‑queue version of Deadline.
Performance Metrics
Common I/O performance indicators:
Utilisation (ioutil) – percentage of time the disk spends handling I/O.
IOPS – number of I/O operations per second.
Throughput/Bandwidth – amount of data transferred per second (MB/s or GB/s).
Latency – time from issuing an I/O request to receiving a response.
Saturation – overall busy level of the disk, often inferred from queue length or latency.
Typical monitoring commands:
iostat -d -x– shows per‑device I/O statistics.
pidstat -d– shows I/O of individual processes.
iotop– interactive view of processes sorted by I/O usage.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.