Artificial Intelligence 12 min read

Deploying DeepSeek‑R1 671B Distributed Inference Service on Alibaba Cloud ACK with vLLM and Dify

This article explains how to quickly deploy the full‑parameter DeepSeek‑R1 671B model in a multi‑node GPU‑enabled Kubernetes cluster on Alibaba Cloud ACK, covering prerequisites, model parallelism, vLLM‑Ray distributed deployment, service verification, and integration with Dify to build a private AI Q&A assistant.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Deploying DeepSeek‑R1 671B Distributed Inference Service on Alibaba Cloud ACK with vLLM and Dify

Background : DeepSeek‑R1 is the first‑generation inference model from DeepSeek, featuring 671 billion parameters and strong performance on mathematical reasoning, programming contests, and general QA tasks. To run the full model, it must be split across multiple GPUs using tensor and pipeline parallelism.

Prerequisites : A GPU‑enabled ACK Kubernetes cluster, the Cloud‑Native AI suite installed, Arena client (≥ 0.14.0), and optionally the ack‑dify component.

Model Partitioning : The model is divided with Tensor Parallelism (TP=8) and Pipeline Parallelism (PP=2). Pipeline parallelism creates two stages (M1, M2) on separate GPU nodes, while tensor parallelism distributes each stage’s computation across eight GPUs.

Distributed Deployment : The deployment uses vLLM + Ray. Two vLLM Pods run on two EGS nodes, each with eight GPUs, forming a Ray head and worker. The following command creates the distributed serving job via Arena:

arena serve distributed \
        --name=vllm-dist \
        --version=v1 \
        --restful-port=8080 \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.7.2 \
        --readiness-probe-action="tcpSocket" \
        --readiness-probe-action-option="port: 8080" \
        --readiness-probe-option="initialDelaySeconds: 30" \
        --readiness-probe-option="periodSeconds: 30" \
        --share-memory=30Gi \
        --data=llm-model:/mnt/models \
        --leader-num=1 \
        --leader-gpus=8 \
        --leader-command="/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); vllm serve /mnt/models/DeepSeek-R1 --port 8080 --trust-remote-code --served-model-name deepseek-r1 --enable-prefix-caching --max-model-len 8192 --gpu-memory-utilization 0.98 --tensor-parallel-size 8 --pipeline-parallel-size 2 --enable-chunked-prefill" \
        --worker-num=1 \
        --worker-gpus=8 \
        --worker-command="/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"

The table below explains the most important parameters (name, image, port, leader/worker settings, etc.).

Verification : After deployment, use arena serve get vllm-dist to check status, then forward the service port with kubectl port-forward svc/vllm-dist-v1 8080:8080 and send a test request via curl to http://localhost:8080/v1/completions . Expected JSON output contains a short completion.

Integration with Dify : Install the Dify application, add a model provider (OpenAI‑API‑compatible), set the model name to deepseek-r1 , configure the API endpoint to point to the locally deployed service, and create a chat‑assistant app. The resulting assistant can answer queries using the private DeepSeek‑R1 model.

Conclusion : The guide demonstrates end‑to‑end deployment of the full DeepSeek‑R1 671B model on ACK, using model parallelism, vLLM‑Ray distributed serving, and Dify for a private Q&A assistant, with performance metrics and next‑step hints for further optimization.

LLMKubernetesvLLMDeepSeekDifyDistributed Deployment
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.