Understanding Linux File I/O: From User Read Calls to Disk Operations
This article explains how a simple read of a single byte in user space triggers a complex Linux I/O stack involving the read system call, VFS, page cache, generic block layer, and I/O scheduler, and clarifies when actual disk I/O occurs and how many bytes are transferred.
At a job interview, a candidate argued that reading a configuration file for each request would cause an extra disk I/O and degrade performance; this sparked a deeper look into how Linux actually handles file reads.
We start with a minimal C program that opens a file and reads one byte:
int main()
{
char c;
int in;
in = open("in.txt", O_RDONLY);
read(in,&c,1);
return 0;
}To answer two questions—whether a disk I/O occurs and how many bytes Linux really reads—we need to examine the Linux I/O stack.
1. Linux I/O Stack Overview
A simplified diagram of the Linux I/O stack (source: http://www.ilinuxkernel.com/files/Linux.IO.stack_v1.0.pdf) shows the layers involved from the user request down to the hardware.
The stack includes the I/O engine, VFS, page cache, generic block layer, and I/O scheduler.
2. I/O Engine
Read/write functions belong to the synchronous I/O engine; other engines include mmap, libaio, and posixaio. The sync engine ultimately invokes the VFS read system call.
3. VFS (Virtual File System)
VFS abstracts different file systems and provides a uniform API. Its core structures are superblock , inode , file , and dentry . Operations such as mkdir and rename are defined in inode_operations , while read and write are defined in file_operations :
struct inode_operations {
...
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*mkdir) (struct inode *,struct dentry *,umode_t);
int (*rmdir) (struct inode *,struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *, unsigned int);
...
};
struct file_operations {
...
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
...
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
};4. Page Cache
The page cache is a pure‑memory cache that speeds up disk access. If the requested block is already cached, no actual disk I/O occurs; otherwise a new page is allocated, a page‑fault interrupt is raised, and the block is read from disk into the cache.
5. File System Layer
File systems manage inode and block structures; a typical block size is 4 KB. Example structures for ext4 are shown below:
const struct file_operations ext4_file_operations = {
.read_iter = ext4_file_read_iter,
.write_iter = ext4_file_write_iter,
.mmap = ext4_file_mmap,
.open = ext4_file_open,
...
};
const struct inode_operations ext4_file_inode_operations = {
.setattr = ext4_setattr,
.getattr = ext4_file_getattr,
...
};6. Generic Block Layer
The generic block layer handles all block‑device I/O requests using the bio structure. A bio represents an I/O operation composed of one or more segments, each segment being a full page or a part of a page.
7. I/O Scheduler
After the block layer creates a request, the I/O scheduler orders requests (e.g., noop, deadline, cfq) to maximize throughput, often using an elevator‑like algorithm.
You can view supported schedulers with dmesg | grep -i scheduler .
8. Full Read Flow
The library read function enters the sys_read system call.
sys_read calls VFS functions like vfs_read and generic_file_read .
If the page cache hits, data is returned immediately.
If not, the kernel allocates a new page, triggers a page‑fault, and sends a block I/O request to the generic block layer.
The block layer queues the request as a bio .
The I/O scheduler orders the request.
The driver issues a DMA read to the disk, filling the new page in the cache.
An interrupt notifies completion, and the byte is copied to user space.
The process is awakened.
When the page cache hits, no disk I/O occurs. When it misses, the smallest unit transferred is a sector (typically 512 bytes). Higher layers work with larger units: the block layer with segments (often a full 4 KB page), the page cache with pages (4 KB), and the file system with blocks (commonly 4 KB). Consequently, reading a single byte can cause the kernel to read several kilobytes from disk.
Additional caches (disk internal cache, RAID controller cache) may further hide physical disk activity, so a miss in the page cache does not always mean the spindle spins.
Understanding these mechanisms helps developers reason about performance and diagnose latency issues in production systems.
Refining Core Development Skills
Fei has over 10 years of development experience at Tencent and Sogou. Through this account, he shares his deep insights on performance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.