How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours
This guide explains how to use the AIBrix distributed inference platform to deploy the massive DeepSeek‑R1 671B model across multiple GPU nodes, covering cluster setup, custom vLLM images, storage options, RDMA networking, autoscaling, request handling, and observability, turning a weeks‑long deployment into an hour‑scale process.
DeepSeek‑R1 671B demonstrates strong logical reasoning with 6.71 trillion total parameters, 370 billion active parameters, and a 128k context window, but its size creates demanding deployment challenges.
AIBrix provides container‑orchestrated solutions for multi‑node GPU resource allocation, seamless distributed inference management, RDMA‑based high‑performance networking, and automated elastic scaling, reducing deployment time from weeks to hours.
1. Prerequisites
Download model weights to object storage or a shared filesystem and prepare a custom container image. Example cluster configuration on Volcano Engine includes two
ecs.ebmhpcpni3l.48xlargeinstances with 96 GB × 8 GPUs, 192 vCPU, 2048 GiB RAM, 400 Gbps × 8 RDMA, and local NVMe disks.
1.1 Cluster Configuration
Cloud platform: Volcano Engine
Instance: ecs.ebmhpcpni3l.48xlarge × 2
CPU: 192 vCPU
Memory: 2048 GiB DRAM
GPU: 96 GB × 8
Network: 400 Gbps × 8 RDMA + 96 Gbps
Disk: NVMe 3576 GiB × 4
1.2 vLLM Image
Use the custom image
aibrix/vllm-openai:v0.7.3.self.post1. It upgrades
nvidia-nccl-cu12==2.25.1to fix NCCL hangs and reinstalls
ray[default,adag]==2.40.0to address a Ray regression.
<code>FROM vllm/vllm-openai:v0.7.3
RUN pip3 install -U ray[default,adag]==2.40.0
RUN pip3 install -U nvidia-nccl-cu12
ENTRYPOINT [""]
</code>For users in China, prepend aibrix-container-registry-cn-beijing.cr.volces.com/ to the image name.
1.3 Model Weights
Four storage options are discussed: HuggingFace (not recommended for DeepSeek‑R1), Persistent Volumes via cloud CSI, object storage (e.g., S3, GCS), and local disks.
1.4 High‑Performance Network
Configure pod annotations
k8s.volcengine.com/pod‑networkswith RDMA CNI and set
vke.volcengine.com/rdma: "8". Add
IPC_LOCKcapability in the security context.
<code>k8s.volcengine.com/pod-networks: |
[
{"cniConf":{"name":"rdma"}},
...
]
securityContext:
capabilities:
add:
- IPC_LOCK
</code>2. Component Installation
Install AIBrix v0.2.1 core and dependencies:
<code>kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.2.1/aibrix-dependency-v0.2.1.yaml
kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.2.1/aibrix-core-v0.2.1.yaml
</code>3. How AIBrix Supports DeepSeek‑R1
AIBrix orchestrates
RayClusterFleet,
Gateway‑Plugin, and
Autoscalerto manage distributed inference, route traffic to the head node, and provide autoscaling based on pod metrics.
4. Model Deployment
Apply the runtime and autoscaling manifests:
<code>kubectl apply -f deepseek-r1-ai-runtime.yaml
kubectl apply -f deepseek-r1-autoscaling.yaml
</code>Verify pods are running, e.g.,
deepseek-r1-671b-...-head-...and worker pods.
5. Sending Requests
Expose the endpoint via LoadBalancer or port‑forwarding and send a chat completion request:
<code># LoadBalancer
LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"
# Port‑forward (no LB)
kubectl -n envoy-gateway-system port-forward service/envoy-aibrix-system-aibrix-eg-903790dc 8888:80 &
ENDPOINT="localhost:8888"
curl http://${ENDPOINT}/v1/chat/completions \
-H "Content-Type: application/json" -H "routing-strategy: least-request" \
-d '{"model":"deepseek-r1-671b","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Who won the world series in 2020?"}]}'
</code>Remove the routing-strategy header to use the default Kubernetes routing.
6. Observability
Deploy a
ServiceMonitorto collect metrics from the Ray head pod:
<code>apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: deepseek-r1-svc-discover
namespace: default
labels:
volcengine.vmp: "true"
spec:
endpoints:
- port: service
namespaceSelector:
matchNames:
- default
selector:
matchLabels:
ray.io/node-type: head
</code>Import the provided Grafana dashboard (link in the original article) to visualize model performance.
7. Further Help
For questions, join the AIBrix Slack channel.
ByteDance Cloud Native
Sharing ByteDance's cloud-native technologies, technical practices, and developer events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.