Big Data 25 min read

Design and Analysis of 3FS: An AI‑Optimized Distributed File System

The article provides a comprehensive English overview of 3FS, an AI‑focused distributed file system that leverages FoundationDB for metadata, CRAQ for chunk replication, and a hybrid Fuse/native client architecture, detailing its design, components, fault handling, and performance considerations for large‑scale training workloads.

AntData
AntData
AntData
Design and Analysis of 3FS: An AI‑Optimized Distributed File System

3FS Design Document

DeepSeek AI released the 3FS project in February 2025, attracting strong industry interest due to its AI‑centric storage capabilities. Ant Group’s Data Intelligence team conducted an early investigation, analyzing design documents and source code to explain how 3FS supports AI workloads such as training data preprocessing, dataset loading, checkpointing, KVCache for inference, and embedding vector search.

System Architecture

Overall Architecture

Core Components

Cluster Manager – manages cluster state and fail‑over (uses ZK or etcd for leader election).

Metadata Service – stores file metadata in an external FoundationDB cluster, relying on its strong consistency.

Storage Service – stores data chunks; replication follows the CRAQ protocol to guarantee high availability.

Client – uses a Fuse client for latency‑insensitive workloads and a native client for performance‑critical applications.

External Dependencies

ClickHouse – stores generated metrics.

FoundationDB – provides transactional metadata storage.

Zookeeper/etcd – support the Cluster Manager’s multi‑replica leader election.

MetaService (Metadata Service)

Design

Similar to ADLS, 3FS stores metadata in FoundationDB using SSI (snapshot‑serializable) transactions, which are equivalent to serializable isolation. All directory‑tree operations are executed as transactions, eliminating consistency anomalies such as rename‑cycle problems.

Implementation

Metadata operations use FoundationDB transactions:

Read‑only transactions for queries: fstat , lookup , listdir , etc.

Read‑write transactions for updates: create , link , unlink , rename , etc.

Conflicts automatically trigger transaction retries, simplifying error handling.

Analysis

The approach mirrors systems such as HopsFS, Tectonic, and Baidu CFS, moving complexity to the underlying distributed KV store and enabling stateless metadata services.

Chunk Storage Service

Data is striped similarly to Ceph. After a file is created, a configurable number of chains (CRAQ replicas) are allocated. Each chunk’s ID and location are deterministic: chunk_id = {inode}{chunk_index} , and the chain is selected via a shuffle seed and chain table stored in MetaService.

Data Structures

Metadata

Key

Value

Description

dentry

DENT{parent_ino}{name}

{parent_ino}{ino}{chain_table}{chunk_size}{stripe_size}{owner}{permission}{atime}{mtime}{ctime}

Prefix isolates directory entries in FoundationDB.

inode

INOD{ino}

{file_length}{chunk_size}{selected_range_in_chain_table}{shuffle_seed}{owner}{permission}{atime}{mtime}{ctime}{target_path (for symlinks)}

Little‑endian encoding spreads keys across shards.

Dynamic File Attributes

Delayed reclamation of files deleted while write‑opened, using a delay_unlink prefix to track active clients.

Lazy file‑length updates: clients report the maximum write offset every few seconds; MetaService adopts the new length if no concurrent truncate occurs.

Fault Handling

Each chain maintains a version number visible to clients. When a node fails or a lease expires, the Cluster Manager increments the chain version; clients with stale versions are rejected, providing a form of fencing.

Chunk and chain versions are managed separately: chunk versions are internal to CRAQ, while chain versions are exposed for client‑side consistency.

MetaService Lifecycle

MetaService is stateless; if its lease expires or it crashes, the Cluster Manager removes it and clients retry against another MetaService instance, updating topology as needed.

Data Node Recovery

When a data node leaves, its chains are marked as failed, a new target is added, and the system syncs missing chunks before the node rejoins the cluster.

Fuse and Native Client

3FS uses a user‑space libfuse daemon for control operations (open, close, stat) while data paths use shared memory (IB “lov” and “lor” rings) for zero‑copy transfers. This design avoids kernel‑space overhead and enables high‑throughput RDMA communication.

Training‑Related Workloads

3FS accelerates checkpointing by splitting model and optimizer files into small chunks, asynchronously moving data from GPU to CPU, and overlapping metadata processing with I/O. Reported single‑node throughput reaches 10 GB/s, with checkpoint intervals as low as five minutes.

Summary

3FS is an AI‑focused, read‑optimized storage system that co‑designs file system, network, and training framework layers. Its strengths include zero‑copy RDMA data paths, metadata scalability via FoundationDB, and support for concurrent writes and overwrites, making it well‑suited for large‑scale model training.

cloud-nativedistributed file systemmetadata serviceFoundationDBAI StoragecheckpointingCRAQ replicationzero‑copy RDMA
AntData
Written by

AntData

Ant Data leverages Ant Group's leading technological innovation in big data, databases, and multimedia, with years of industry practice. Through long-term technology planning and continuous innovation, we strive to build world-class data technology and products.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.