Artificial Intelligence 8 min read

How Tencent’s DeepEP Doubles GPU Communication Speed on RoCE Networks

Tencent engineers highlighted a massive speedup in DeepSeek’s open‑source DeepEP communication framework, revealing how their TRMT‑based optimizations—dynamic multi‑QP topology awareness, IBGDA‑driven CPU‑bypass, and atomic signaling—boost RoCE network throughput up to 300% and add another 30% gain when applied to InfiniBand, effectively doubling GPU communication performance for large AI models.

Tencent Tech

May 7, 2025

How Tencent’s DeepEP Doubles GPU Communication Speed on RoCE Networks

DeepSeek engineers recently highlighted a "huge speedup" on GitHub, showcasing a performance boost contributed by Tencent.

The key technology is Tencent’s long‑term work on data‑center and GPU communication, distilled into the TRMT (Tensor‑Remote‑Memory‑Transport) technique that powers DeepSeek’s open‑source communication framework DeepEP, pushing its performance to a new level.

In February, DeepSeek open‑sourced five major codebases, including DeepEP, revealing how they achieve the performance of a traditional ten‑thousand‑GPU cluster using only one‑fifth of the hardware resources.

DeepEP, as a communication framework that breaks the NCCL performance bottleneck, delivers a 300% increase in communication efficiency, allowing many MoE‑based large models to operate without relying on NVIDIA NCCL.

However, this technology performs excellently on high‑cost InfiniBand (IB) networks but struggles on the more common RoCE networks, similar to a supercar that can only run on a professional race track.

Leveraging years of RoCE experience, Tencent quickly identified two critical breakthrough points:

Low lane utilization: RoCE NICs often use a dual‑port architecture, but existing systems cannot intelligently distribute traffic, leading to single‑lane congestion and idle dual lanes.

CPU control bottleneck: Although DeepEP uses RDMA for GPU‑direct communication, the control plane still relies on CPU mediation, leaving room for latency and energy‑efficiency improvements.

// Fully utilize dual lanes: topology‑aware multi‑QP linking

The core idea is to use a dynamic allocation algorithm to maximize bandwidth utilization of dual‑port NICs.

When an AI model starts, multiple GPUs form communication groups. Within each group, every GPU must establish communication links, and each GPU pair creates multiple QPs (queue pairs).

This architecture resembles an intelligent traffic‑management system: when 2,048 special vehicles (GPU packets) need to traverse a city network (RoCE), the controller opens dedicated routes (QP‑bound ports) for each cargo type.

By dynamically assigning UDP source ports as entry ramps, the algorithm balances traffic across the two physical NIC lanes, preventing congestion and achieving theoretical peak bandwidth.

// Bypass the CPU: IBGDA‑based multi‑channel load‑balanced data transfer

RDMA‑direct GPU communication is like a port where cargo ships unload and continue without stopping; however, the control plane still requires CPU handling, similar to a port managing arrival schedules.

Tencent applied IBGDA (InfiniBand GPU Direct Accelerator) to let the control plane also bypass the CPU, reducing control latency to the hardware limit.

Additionally, each GPU can use multiple channels simultaneously, with automatic data distribution that prevents any channel from being overloaded while others stay idle.

// No ordering errors: atomic QP sequencing lock

When GPU A writes data directly into GPU B’s memory, B cannot know when the data arrives, leading to out‑of‑order delivery if many transfers occur in parallel.

Engineers introduced a "QP internal sequencing lock" mechanism that generates a hardware‑level digital fingerprint for each transmission; the receiver must acknowledge packets in the correct order, ensuring reliable sequencing even with thousands of concurrent tasks.

These three optimizations enable DeepEP to double performance on RoCE networks and add an extra 30% gain when the same techniques are applied to InfiniBand.

All of these technologies have been fully open‑sourced to the DeepEP community and are already used in Tencent’s Hunyuan large‑model training and inference pipelines, demonstrating broad applicability in high‑performance AI environments.

Thanks to the DeepSeek engineers and all collaborators for exploring GPU communication bottlenecks, and thanks for the spirit of open source.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

InfiniBand RoCE AI model training GPU communication DeepEP

Written by

Tencent Tech

Tencent's official tech account. Delivering quality technical content to serve developers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.