Tag

RDMA

1 views collected around this technical thread.

Architects' Tech Alliance
Architects' Tech Alliance
Jun 10, 2025 · Fundamentals

Why RDMA Is Revolutionizing High‑Performance Computing and AI

This article explores how Remote Direct Memory Access (RDMA) technology transforms high‑performance computing, artificial intelligence, and cloud storage by eliminating data copies, bypassing the kernel, and offloading protocols to hardware, while reviewing key metrics, product ecosystems, real‑world use cases, challenges, and future trends.

Artificial IntelligenceDPUHigh Performance Computing
0 likes · 11 min read
Why RDMA Is Revolutionizing High‑Performance Computing and AI
Architects' Tech Alliance
Architects' Tech Alliance
Jun 3, 2025 · Artificial Intelligence

Comprehensive Analysis of RDMA Technology: Principles, Features, Products, and Applications in HPC, AI, and Cloud Storage

The article provides an in‑depth technical overview of Remote Direct Memory Access (RDMA), covering its zero‑copy, kernel‑bypass, and protocol‑offload features, hardware and software ecosystems, and its impact on high‑performance computing, artificial intelligence, cloud storage, finance, and edge computing.

Artificial IntelligenceHigh Performance ComputingRDMA
0 likes · 10 min read
Comprehensive Analysis of RDMA Technology: Principles, Features, Products, and Applications in HPC, AI, and Cloud Storage
Architects' Tech Alliance
Architects' Tech Alliance
May 26, 2025 · Fundamentals

Understanding RDMA, InfiniBand, and RoCEv2 for High‑Performance Distributed Training

The article explains how distributed AI training performance depends on reducing inter‑card communication latency, introduces RDMA technology and its implementations (InfiniBand, RoCEv2, iWARP), compares their latency and scalability against traditional TCP/IP, and outlines the hardware components and trade‑offs of InfiniBand and RoCEv2 networks.

High Performance ComputingInfiniBandRDMA
0 likes · 12 min read
Understanding RDMA, InfiniBand, and RoCEv2 for High‑Performance Distributed Training
AntData
AntData
Mar 14, 2025 · Fundamentals

Analysis of DeepSeek 3FS Storage Service Architecture and Design

This article provides an in‑depth technical analysis of DeepSeek's open‑source 3FS distributed file system, focusing on the StorageService architecture, space pooling, allocation mechanisms, reference counting, fragmentation handling, and the RDMA‑based read/write data path.

Distributed StorageFile SystemRDMA
0 likes · 15 min read
Analysis of DeepSeek 3FS Storage Service Architecture and Design
ByteDance Cloud Native
ByteDance Cloud Native
Mar 13, 2025 · Backend Development

Inside DeepSeek 3FS: Architecture of a High‑Performance Parallel File System

This article dissects DeepSeek's 3FS parallel file system, detailing its four‑component architecture, high‑throughput RDMA networking, metadata handling with FoundationDB, client access methods, chain replication (CRAQ), custom FFRecord format, and recovery mechanisms, offering a deep technical perspective for storage engineers.

RDMAchain replicationdistributed file system
0 likes · 22 min read
Inside DeepSeek 3FS: Architecture of a High‑Performance Parallel File System
AntData
AntData
Mar 5, 2025 · Cloud Native

DeepSeek 3FS Network Communication Module: Design, Implementation, and Impact on AI Infrastructure

This article provides an in‑depth analysis of DeepSeek's open‑source 3FS distributed storage system, focusing on its network communication module, RDMA‑based design, core classes such as IBSocket, Listener, and IOWorker, and how these innovations advance high‑performance AI infrastructure.

AI infrastructureDistributed StorageFolly Coroutines
0 likes · 15 min read
DeepSeek 3FS Network Communication Module: Design, Implementation, and Impact on AI Infrastructure
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Jan 7, 2025 · Big Data

High‑Performance Distributed Storage: Ceph vs Alibaba Pangu 2.0 vs XSKY INFINI

This article compares three high‑performance distributed storage systems—Ceph, Alibaba's Pangu 2.0, and XSKY INFINI—examining their architectures, key technologies such as RTC thread models, append‑only writes, kernel‑bypass, RDMA, data compression, and metadata management to reveal how they exploit modern flash hardware.

CephDistributed StorageNVMe
0 likes · 21 min read
High‑Performance Distributed Storage: Ceph vs Alibaba Pangu 2.0 vs XSKY INFINI
Deepin Linux
Deepin Linux
Dec 25, 2024 · Fundamentals

An Introduction to RDMA: Principles, Programming, and Applications

This article explains RDMA technology, covering its core principles, programming model with Verbs API, various communication modes, and its impact on data‑center networking, high‑performance computing, and distributed storage, highlighting its low‑latency, zero‑copy advantages over traditional TCP/IP.

High Performance ComputingRDMAZero Copy
0 likes · 30 min read
An Introduction to RDMA: Principles, Programming, and Applications
Architects' Tech Alliance
Architects' Tech Alliance
Sep 12, 2024 · Artificial Intelligence

Comparison of InfiniBand and RoCEv2 Architectures for AI Compute Networks

This article examines the two dominant AI compute network architectures, InfiniBand and RoCEv2, detailing their designs, flow‑control mechanisms, performance, cost and scalability characteristics, and evaluates their respective advantages and limitations to guide network selection for AI data centers.

AI computeInfiniBandRDMA
0 likes · 9 min read
Comparison of InfiniBand and RoCEv2 Architectures for AI Compute Networks
Architects' Tech Alliance
Architects' Tech Alliance
Aug 18, 2024 · Artificial Intelligence

RDMA, InfiniBand, RoCE, and iWARP: High‑Performance Networking for Large‑Scale Generative AI Model Training

The article explains how RDMA technologies—including InfiniBand, RoCE, and iWARP—provide high‑throughput, low‑latency, CPU‑free data transfer for massive generative AI model training, compares their architectures, and discusses modern network designs and load‑balancing strategies to optimize AI‑focused data‑center networks.

AI trainingHigh Performance ComputingInfiniBand
0 likes · 11 min read
RDMA, InfiniBand, RoCE, and iWARP: High‑Performance Networking for Large‑Scale Generative AI Model Training
Architects' Tech Alliance
Architects' Tech Alliance
Aug 14, 2024 · Artificial Intelligence

Network Architecture and Performance Requirements for Training Large-Scale Generative AI Models

The article examines the ultra‑large‑scale, high‑bandwidth, low‑latency, and automated network infrastructure needed for training generative AI models, covering custom network designs, congestion control, deterministic RDMA, topology choices such as Fat‑Tree, and emerging deterministic networking technologies.

High BandwidthNetwork AutomationRDMA
0 likes · 8 min read
Network Architecture and Performance Requirements for Training Large-Scale Generative AI Models
Baidu Geek Talk
Baidu Geek Talk
Jul 10, 2024 · Artificial Intelligence

Baidu HPN Network: Solving Hash Collision for 95% Physical Network Bandwidth Efficiency in Large Model Training

Baidu's HPN network solves hash‑collision bottlenecks in large‑model training by combining TOR‑affinity scheduling with Dynamic Load Balancing on self‑developed switches, boosting physical network bandwidth efficiency to about 95%, improving throughput by roughly 10% and adding a further 1.5% training‑speed gain via the BCCL library.

Baidu CloudCollective CommunicationDLB Dynamic Load Balancing
0 likes · 12 min read
Baidu HPN Network: Solving Hash Collision for 95% Physical Network Bandwidth Efficiency in Large Model Training
Architects' Tech Alliance
Architects' Tech Alliance
Jun 20, 2024 · Artificial Intelligence

Comparative Analysis of InfiniBand and RoCEv2 Architectures for AI Compute Networks

This article provides a detailed comparison of InfiniBand and RoCEv2 network architectures, examining their technical features, flow‑control mechanisms, performance, cost, and suitability for AI compute environments to guide designers in selecting the optimal solution.

AI computeInfiniBandRDMA
0 likes · 9 min read
Comparative Analysis of InfiniBand and RoCEv2 Architectures for AI Compute Networks
Architects' Tech Alliance
Architects' Tech Alliance
May 3, 2024 · Fundamentals

From OSI Model to RDMA: High‑Performance Networking, Leaf‑Spine Architecture, and Switch Selection

This article examines the evolution of network protocols from the OSI seven‑layer model and TCP/IP to RDMA technologies such as InfiniBand and RoCE, compares traditional three‑tier and leaf‑spine data‑center designs, and evaluates Ethernet, InfiniBand, and RoCE switches for high‑throughput, low‑latency HPC environments.

High Performance ComputingInfiniBandRDMA
0 likes · 13 min read
From OSI Model to RDMA: High‑Performance Networking, Leaf‑Spine Architecture, and Switch Selection
360 Smart Cloud
360 Smart Cloud
Apr 25, 2024 · Cloud Native

Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training

This article explains how to construct high‑performance RoCE v2 and InfiniBand networks within a cloud‑native Kubernetes environment, detailing the underlying technologies, required components, configuration steps, and performance test results that demonstrate significant communication speed improvements for large‑scale AI model training.

AI trainingInfiniBandKubernetes
0 likes · 12 min read
Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training
Architects' Tech Alliance
Architects' Tech Alliance
Apr 21, 2024 · Fundamentals

Understanding RDMA: InfiniBand, RoCE, and Their Role in High‑Performance AI Model Training

This article explains how Remote Direct Memory Access (RDMA) technologies such as InfiniBand and RoCE bypass OS kernels to achieve ultra‑low latency and high bandwidth, discusses their hardware implementations, cost considerations, and their critical impact on large‑scale AI model training and HPC network design.

AIGPUHigh Performance Computing
0 likes · 11 min read
Understanding RDMA: InfiniBand, RoCE, and Their Role in High‑Performance AI Model Training
vivo Internet Technology
vivo Internet Technology
Dec 13, 2023 · Artificial Intelligence

Practice of Multi-NIC Container Network Acceleration for Offline Training

The talk explains how Vivo leverages a Kubernetes‑based solution that combines Calico and RoCEv2 to migrate offline training workloads from single‑NIC to multi‑NIC, integrating loss‑less RDMA, planning topology and IP allocation, and employing Volcano, SpiderPool, Macvlan, and Multus CNI for efficient container networking.

Container NetworkingKubernetesMulti-NIC
0 likes · 4 min read
Practice of Multi-NIC Container Network Acceleration for Offline Training
Architects' Tech Alliance
Architects' Tech Alliance
Dec 6, 2023 · Artificial Intelligence

The Relationship Between Switches, Network Protocols, and AI in Modern Data Centers

This article explains how network protocols and switch architectures—including OSI layers, TCP/IP, RDMA, InfiniBand, RoCE, and leaf‑spine designs—support high‑throughput, low‑latency AI and HPC workloads, compares Ethernet and InfiniBand markets, and examines NVIDIA’s Spectrum/X and SuperPOD solutions.

AIInfiniBandNvidia
0 likes · 11 min read
The Relationship Between Switches, Network Protocols, and AI in Modern Data Centers