Operations 28 min read

How 3FS Revolutionizes AI Storage with High‑Throughput Distributed Filesystem

3FS, DeepSeek’s high‑performance parallel file system, is engineered for AI workloads, offering ultra‑low latency, high‑throughput storage via RDMA, CRAQ consistency, and seamless cloud‑native integration, with detailed architecture, deployment steps, performance benchmarks, and cost‑saving strategies for large‑scale model training and inference.

Instant Consumer Technology Team
Instant Consumer Technology Team
Instant Consumer Technology Team
How 3FS Revolutionizes AI Storage with High‑Throughput Distributed Filesystem

Background

As AI models enter the trillion‑parameter era, the computational demand and multimodal training data grow exponentially, creating new challenges for distributed parallel file storage. DeepSeek’s 3FS (Fire‑Flyer File System) is introduced to provide high‑throughput storage for AI workloads.

3FS Architecture Overview

Design Philosophy

3FS abandons traditional FUSE kernel paths and adopts a user‑space zero‑copy RDMA transfer to maximize hardware performance and focus on AI‑specific large files and high‑bandwidth requirements.

Key Features

Hardware Performance : Bypasses the FUSE kernel layer and uses user‑space zero‑copy RDMA transmission.

AI‑Centric Design : Drops the “one‑size‑fits‑all” approach of generic file systems, concentrating on large files and high bandwidth.

Separation of Compute and Storage : Allows independent scaling of storage and compute resources.

Strong Data Consistency : Implements the CRAQ (Chain Replication with Apportioned Queries) protocol to guarantee strong consistency across nodes.

Storage Throughput Optimisation : Uses Direct I/O and RDMA to avoid OS cache overhead and achieve high I/O throughput.

Applicable Scenarios

Data preparation: hierarchical directory structures for massive intermediate outputs.

Data loading: random access to training samples without pre‑fetching.

Checkpoint storage: high‑throughput parallel checkpoint access for large‑scale training.

Model inference: KVCache interface provides a cost‑effective DRAM alternative with higher throughput.

Software Design

The system consists of four components: Client, Cluster Manager, Meta Service, and Storage Service, all communicating over RDMA.

Cluster Manager

Provides high availability with a primary‑backup architecture and uses etcd for failover.

Cluster change sync: real‑time node and configuration updates.

Health management: periodic heartbeats from Meta and Storage services.

Meta Service

Stateless service with multi‑instance scalability; metadata is persisted in FoundationDB and managed at chunk granularity using CRAQ for consistency.

Storage Service

Handles data persistence via a Chunk Engine composed of Chunk Allocator and MetaStore.

Client

Uses a FUSE client to connect to any Meta Service, retrieve node information, and perform I/O on the appropriate Storage Server.

Replication Strategy

Currently supports a three‑replica policy (no erasure coding). Default ChunkSize is 1 MiB and StripeSize is 16.

I/O Model

For random sample reads during training, 3FS employs asynchronous Direct I/O via

io_uring

to avoid frequent system‑call context switches and eliminate file‑cache overhead.

FFRecord Format

FFRecord is a binary sequence format optimised for 3FS, compatible with PyTorch’s Dataset and DataLoader interfaces, enabling efficient data loading for training.

Deployment in "马上消费" (MSXF)

The following steps outline the end‑to‑end deployment of 3FS in a production environment.

Hardware Setup

4 storage ECS instances (each with 128 CPU, 1 TiB RAM, 100 Gbps RDMA NIC, 8 × 3.84 TiB NVMe SSDs).

4 compute ECS instances (each with 192 CPU, 512 GiB RAM, 100 Gbps RDMA NIC).

Software Installation

Install required packages, libfuse ≥ 3.16, Rust toolchain, and FoundationDB.

# for Ubuntu 22.04
apt install cmake libuv1-dev liblz4-dev liblzma-dev libdouble-conversion-dev libdwarf-dev libunwind-dev libaio-dev libgflags-dev libgoogle-glog-dev libgtest-dev libgmock-dev clang-format-14 clang-14 clang-tidy-14 lld-14 libgoogle-perftools-dev google-perftools libssl-dev gcc-12 g++-12 libboost-all-dev build-essential

# Install libfuse 3.16
wget https://github.com/libfuse/libfuse/releases/download/fuse-3.16.1/fuse-3.16.1.tar.gz
apt install -y build-essential meson ninja-build pkg-config libudev-dev
tar -xzf fuse-3.16.1.tar.gz
cd fuse-3.16.1
mkdir build && cd build
meson setup ..
ninja
sudo ninja install
sudo ldconfig

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install FoundationDB
wget https://github.com/apple/foundationdb/releases/download/7.3.63/foundationdb-server_7.3.63-1_amd64.deb
wget https://github.com/apple/foundationdb/releases/download/7.3.63/foundationdb-clients_7.3.63-1_amd64.deb
sudo dpkg -i foundationdb-{server,clients}_7.3.63-1_amd64.deb
sudo systemctl start foundationdb
fdbcli --exec "status"

Building 3FS

# Clone and build
git clone https://github.com/deepseek-ai/3fs
cd 3fs
git submodule update --init --recursive
./patches/apply.sh

cmake -S . -B build -DCMAKE_CXX_COMPILER=clang++-14 -DCMAKE_C_COMPILER=clang-14 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
cmake --build build -j 56

Service Deployment

Deploy Cluster Manager, Meta Service, Storage Service, and FUSE Client on the respective nodes using systemd unit files. Configure etcd addresses, token authentication, and RDMA endpoints as shown in the original configuration snippets.

Data Placement

Generate chain tables and placement policies with the provided Python scripts, then upload them via

admin_cli

to the management service.

# Generate chain table
python3 ~/3fs/deploy/data_placement/src/model/data_placement.py -ql -relax -type CR --num_nodes 4 --replication_factor 3 --min_targets_per_disk 6
python3 ~/3fs/deploy/data_placement/src/setup/gen_chain_table.py --chain_table_type CR --node_id_begin 10001 --node_id_end 10004 --num_disks_per_node 8 --num_targets_per_disk 6 --target_id_prefix 1 --chain_id_prefix 9 --incidence_matrix_path output/DataPlacementModel-v_4-b_8-r_6-k_3-λ_2-lb_2-ub_2/incidence_matrix.pickle

Performance Evaluation

Using

fio

, the single‑client read bandwidth reaches 9.5 GB/s (saturating a 100 Gbps RDMA link), write bandwidth 3.3 GB/s, and read IOPS ~117 K. Multi‑client tests show linear scaling of aggregate bandwidth.

# Random read bandwidth test
fio -numjobs=64 -fallocate=none -iodepth=2 -ioengine=libaio -direct=1 -rw=randread -bs=4M --group_reporting -size=100M -time_based -runtime=120 -name=iotest -directory=/3fs/stage/iotest

# Random read IOPS test
fio -numjobs=64 -fallocate=none -iodepth=2 -ioengine=libaio -direct=1 -rw=randread -bs=4K --group_reporting -size=100M -time_based -runtime=120 -name=iotest -directory=/3fs/stage/iotest

Benefits

Accelerated AI Workflows : High‑throughput data loading and fast checkpoint storage reduce training iteration time.

Unified Storage Strategy : Integration with CSI enables seamless use of 3FS in Kubernetes, breaking the POSIX/HDFS/S3 silos.

Cost Reduction : Combining 3FS with object storage (OSS) moves 90 % of data to cheap storage, cutting overall storage cost by >50 % while keeping hot data on high‑performance 3FS.

Conclusion

The deployment demonstrates a complete, production‑grade 3FS stack that supports large‑scale AI training and inference, delivers multi‑fold performance gains over traditional storage, and provides a cost‑effective tiered storage solution when coupled with object storage.

PerformanceCloud Nativedistributed file systemRDMAhigh throughputAI storage
Instant Consumer Technology Team
Written by

Instant Consumer Technology Team

Instant Consumer Technology Team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.