14 min read

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

This guide explains how to use the AIBrix distributed inference platform to deploy the massive DeepSeek‑R1 671B model across multiple GPU nodes, covering cluster setup, custom vLLM images, storage options, RDMA networking, autoscaling, request handling, and observability, turning a weeks‑long deployment into an hour‑scale process.

ByteDance Cloud Native

Mar 20, 2025

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

DeepSeek‑R1 671B demonstrates strong logical reasoning with 6.71 trillion total parameters, 370 billion active parameters, and a 128k context window, but its size creates demanding deployment challenges.

AIBrix provides container‑orchestrated solutions for multi‑node GPU resource allocation, seamless distributed inference management, RDMA‑based high‑performance networking, and automated elastic scaling, reducing deployment time from weeks to hours.

1. Prerequisites

Download model weights to object storage or a shared filesystem and prepare a custom container image. Example cluster configuration on Volcano Engine includes two ecs.ebmhpcpni3l.48xlarge instances with 96 GB × 8 GPUs, 192 vCPU, 2048 GiB RAM, 400 Gbps × 8 RDMA, and local NVMe disks.

1.1 Cluster Configuration

Cloud platform: Volcano Engine

Instance: ecs.ebmhpcpni3l.48xlarge × 2

CPU: 192 vCPU

Memory: 2048 GiB DRAM

GPU: 96 GB × 8

Network: 400 Gbps × 8 RDMA + 96 Gbps

Disk: NVMe 3576 GiB × 4

1.2 vLLM Image

Use the custom image aibrix/vllm-openai:v0.7.3.self.post1. It upgrades nvidia-nccl-cu12==2.25.1 to fix NCCL hangs and reinstalls ray[default,adag]==2.40.0 to address a Ray regression.

FROM vllm/vllm-openai:v0.7.3
RUN pip3 install -U ray[default,adag]==2.40.0
RUN pip3 install -U nvidia-nccl-cu12
ENTRYPOINT [""]

For users in China, prepend aibrix-container-registry-cn-beijing.cr.volces.com/ to the image name.

1.3 Model Weights

Four storage options are discussed: HuggingFace (not recommended for DeepSeek‑R1), Persistent Volumes via cloud CSI, object storage (e.g., S3, GCS), and local disks.

1.4 High‑Performance Network

Configure pod annotations k8s.volcengine.com/pod‑networks with RDMA CNI and set vke.volcengine.com/rdma: "8". Add IPC_LOCK capability in the security context.

k8s.volcengine.com/pod-networks: |
  [
    {"cniConf":{"name":"rdma"}},
    ...
  ]
securityContext:
  capabilities:
    add:
    - IPC_LOCK

2. Component Installation

Install AIBrix v0.2.1 core and dependencies:

kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.2.1/aibrix-dependency-v0.2.1.yaml
kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.2.1/aibrix-core-v0.2.1.yaml

3. How AIBrix Supports DeepSeek‑R1

AIBrix orchestrates RayClusterFleet, Gateway‑Plugin, and Autoscaler to manage distributed inference, route traffic to the head node, and provide autoscaling based on pod metrics.

4. Model Deployment

Apply the runtime and autoscaling manifests:

kubectl apply -f deepseek-r1-ai-runtime.yaml
kubectl apply -f deepseek-r1-autoscaling.yaml

Verify pods are running, e.g., deepseek-r1-671b-...-head-... and worker pods.

5. Sending Requests

Expose the endpoint via LoadBalancer or port‑forwarding and send a chat completion request:

# LoadBalancer
LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"

# Port‑forward (no LB)
kubectl -n envoy-gateway-system port-forward service/envoy-aibrix-system-aibrix-eg-903790dc 8888:80 &
ENDPOINT="localhost:8888"

curl http://${ENDPOINT}/v1/chat/completions \
    -H "Content-Type: application/json" -H "routing-strategy: least-request" \
    -d '{"model":"deepseek-r1-671b","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Who won the world series in 2020?"}]}'

Remove the routing-strategy header to use the default Kubernetes routing.

6. Observability

Deploy a ServiceMonitor to collect metrics from the Ray head pod:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: deepseek-r1-svc-discover
  namespace: default
  labels:
    volcengine.vmp: "true"
spec:
  endpoints:
  - port: service
  namespaceSelector:
    matchNames:
    - default
  selector:
    matchLabels:
      ray.io/node-type: head

Import the provided Grafana dashboard (link in the original article) to visualize model performance.

7. Further Help

For questions, join the AIBrix Slack channel.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed inference vLLM DeepSeek-R1 GPU cluster AIBrix

Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.