Fundamentals 14 min read

Can Distributed File Systems Outperform Local NVMe? A Deep Performance Evaluation

This article explains what a Distributed File System (DFS) is, outlines key evaluation criteria such as reliability, availability, performance, scalability, and then compares HDD and SSD performance, investigates whether DFS can surpass local NVMe in large‑IO workloads, and discusses user‑side, cluster‑level, and cache‑level performance assessment methods.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Can Distributed File Systems Outperform Local NVMe? A Deep Performance Evaluation

Distributed File System (DFS) refers to a file system whose physical storage resources are not directly attached to the local node but are connected via a network, forming a hierarchical logical file system that presents a tree‑like structure to users, offering scalability, high availability, low cost, and elastic storage.

HDD&SSD

Performance Comparison

Mechanical hard drives (HDD) provide large capacity at low price but have slower read/write speeds (IOPS 75‑200, bandwidth 100‑200 MB/s, sequential speeds up to 10 GB/s). NVMe SSDs, using PCIe, deliver much higher bandwidth (up to 3,000‑5,000 MB/s) and IOPS (150,000‑500,000), with latency in the tens of microseconds.

Can Distributed File Systems Exceed NVMe?

DFS design must consider specific scenarios; typical use cases include log storage, AI training, inference, databases, and shared files. Evaluation metrics such as IO size, read/write pattern, file count, and concurrency are used to assess suitability.

In tests with 1 MiB IO blocks, DFS can surpass local NVMe on large‑file read/write throughput, though small‑IO latency remains higher.

From the user perspective, file systems are presented as tree structures, but internal implementations may differ. For example, XFS uses B+ trees for inode management. Local NVMe reads 4 KB data in ~20 µs, while DFS incurs network latency (~100 µs) plus protocol and software overhead, resulting in several hundred microseconds to milliseconds per IO.

NVMe SSDs connect directly to the CPU via PCIe, achieving microsecond‑level latency, whereas DFS must traverse the network, handle data sharding, replication, and consistency protocols, adding ~100 µs overhead. DFS mitigates latency through concurrency, prefetching, caching, zero‑copy, DirectIO, AIO, and network optimizations.

Overall, single‑client latency favors local NVMe, but DFS can achieve higher aggregate bandwidth and IOPS in large‑IO scenarios by aggregating resources and increasing concurrency.

User‑Dimension Evaluation

Performance must be measured against real business workloads. In AI training, thousands of GPUs read millions of small files, requiring high metadata throughput and cache efficiency. In AI inference, millisecond‑level response and high‑concurrency reads are critical. Online services often exhibit a write‑once‑many‑read pattern demanding low read latency. Container workloads focus on resource consumption.

Evaluation should consider IO size, concurrency, access pattern, file size, file count, total data volume, and client‑side resource usage (CPU, memory, ports). Tuning parameters such as multi‑level data and metadata caches, readahead, prefetch, write‑back policies, and consistency settings can optimize performance for specific scenarios.

Cluster‑Dimension Performance Evaluation

Cluster‑level goals include measuring maximum IOPS, aggregate bandwidth, and latency at a fixed scale, as well as linear scalability as the cluster grows. Tests involve multi‑client, multi‑thread workloads across various IO sizes (e.g., 4 KB, 1 MiB) and patterns, monitoring hardware (CPU, memory, NIC, disks) and software bottlenecks.

Key metrics include raw read/write bandwidth, client‑side aggregate bandwidth, service‑side aggregate bandwidth, and traffic amplification caused by prefetching. Analysis must identify whether performance limits stem from hardware saturation or software design.

Cache Performance Evaluation

Caching (software, OS page cache, local, distributed) shortens access distance and boosts performance. Evaluation should measure full‑hit, full‑miss, and partial‑hit scenarios, as well as performance decay at various hit rates.

PoleFS employs separate read and write caches: read cache accelerates data retrieval, while write cache merges/delays writes and can hide slow storage. Tests should observe read cache hit/miss behavior, write cache capacity versus write volume, and backend object‑store performance.

Cachingstorageperformance evaluationDistributed File SystemNVMecluster scaling
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.