Artificial Intelligence 21 min read

Analysis and Optimization of CephFS I/O Performance for AI Training on the Xingchen Compute Platform

This article investigates why AI training tasks on Tencent's Xingchen compute platform experience severe I/O slowdown when using CephFS, analyzes the underlying Ceph‑FUSE and MDS mechanisms, and proposes metadata‑caching and file‑caching optimizations that can accelerate training speed by three to four times.

Tencent Architect

Feb 23, 2021

Analysis and Optimization of CephFS I/O Performance for AI Training on the Xingchen Compute Platform

Background – With the rapid growth of big data and AI, the Xingchen compute platform aggregates GPU resources for internal algorithm engineers. Training tasks that read massive numbers of small files from CephFS suffer from severe I/O slowdown, especially for computer‑vision workloads.

CephFS I/O Flow – When a client issues open, read or readdir, it first contacts the Metadata Server (MDS) to obtain inode, uid, gid, etc. The MDS caches hot metadata in memory to avoid costly RADOS lookups. The client then uses the CRUSH algorithm to locate data in RADOS and interacts with the storage devices.

Problem – In AI training, the same dataset is read repeatedly across epochs, but the cache is repeatedly cleared. Logs show periodic trim_caps messages that remove 5,000 CAPS every 5 seconds, causing cache invalidation and slowing down subsequent epochs.

Root‑Cause Analysis – The MDS periodically runs tick, which calls mdcache->trim() and mdcache->trim_client_leases(). When the number of CAPS per client exceeds mds_max_caps_per_client (default 1 Mi), the MDS issues a recall_client_state request to trim CAPS, limited by mds_recall_max_caps (default 5,000). The Ceph‑FUSE client receives a CEPH_SESSION_RECALL_STATE and executes trim_caps, which clears the corresponding dentry, inode, and page caches and even triggers a kernel remount to purge caches.

void MDSRankDispatcher::tick() {
  ...
  if (is_active() || is_stopping()) {
    server->recall_client_state(nullptr, Server::RecallFlags::ENFORCE_MAX);
    mdcache->trim();
    mdcache->trim_client_leases();
    mdcache->check_memory_usage();
    mdlog->trim();
  }
  ...
}

std::pair<bool, uint64_t> Server::recall_client_state(MDSGatherBuilder* gather, RecallFlags flags) {
  const bool enforce_max = flags & RecallFlags::ENFORCE_MAX;
  const uint64_t max_caps_per_client = g_conf->get_val<uint64_t>("mds_max_caps_per_client");
  const uint64_t recall_max_caps = g_conf->get_val<uint64_t>("mds_recall_max_caps");
  ...
}

Solution – Metadata Cache Optimization – Mark read‑only opened files with an I_CACHED flag so Ceph‑FUSE can serve metadata from its own cache without contacting the MDS. When trim_caps occurs, keep the cached metadata (set inode state to I_ORPHAN) and avoid the kernel remount, preserving the dentry cache.

void Client::trim_caps(MetaSession *s, uint64_t max) {
  uint64_t caps_size = s->caps.size();
  uint64_t trimmed = 0;
  std::set<Dentry *> to_trim;
  while ((caps_size - trimmed) > max && !p.end()) {
    ... // remove excess CAPS
  }
  for (const auto &dn : to_trim) {
    trim_dentry(dn);
  }
  if (s->caps.size() > max)
    _invalidate_kernel_dcache(); // previously called remount
}

Solution – File Cache Layer – After the first epoch, cache file contents on a local SSD as a single large cache file. Record each file’s offset in a metadata cache. Subsequent reads use pread on the local cache, falling back to CephFS only on a miss, thus turning network I/O into local SSD I/O.

static int remount_cb(void *handle) {
  char cmd[1024];
  CephFuse::Handle *cfuse = (CephFuse::Handle *)handle;
  snprintf(cmd, sizeof(cmd), "mount -i -o remount %s", cfuse->opts.mountpoint);
  int r = system(cmd);
  ...
}

Extended Scenarios – Discusses additional acceleration techniques for AI workloads: metadata caching, file aggregation (LMDB, TFRecord), GPUDirect Storage, data prefetching, GPUDirect RDMA, NVIDIA DALI, and feature storage. Each targets different bottlenecks (I/O‑bound, GPU‑compute‑bound, CPU‑compute‑bound).

Conclusion – By reducing MDS‑driven CAPS reclamation and preserving client‑side metadata, training speed improves 3–4× and becomes comparable to local SSD performance. Future work includes deeper integration with GPUDirect Storage and RADOS APIs for even lower latency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed storage AI training CephFS Ceph-FUSE metadata cache

Written by

Tencent Architect

We share insights on storage, computing, networking and explore leading industry technologies together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.