Artificial Intelligence 16 min read

Deploying DeepSeek R1 Model Inference on ACK Edge with Virtual Nodes and Serverless GPU

This article explains how to use Alibaba Cloud ACK Edge to manage on‑premise GPU resources and seamlessly fall back to cloud‑based ACS Serverless GPU via virtual nodes for deploying DeepSeek R1 inference, covering environment preparation, model download, storage setup, custom scheduling, and scaling strategies.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Deploying DeepSeek R1 Model Inference on ACK Edge with Virtual Nodes and Serverless GPU

Alibaba Cloud ACK Edge clusters adopt a cloud‑edge integrated architecture, hosting the Kubernetes control plane in the cloud while IDC machines act as data‑plane nodes, enabling containerized management of existing on‑premise GPU resources and improving deployment efficiency.

With the rapid growth of AI large‑model services, ACK Edge has helped many customers manage IDC GPU machines and quickly deploy inference workloads. The DeepSeek R1 model, however, uses a Mixture‑of‑Experts architecture that requires at least eight GPUs and newer GPU cards for FP8 training, creating a resource challenge for IDC environments.

This guide demonstrates how to manage IDC GPU machines through ACK Edge and deploy the DeepSeek inference service using the ACK AI suite. The workflow prioritizes running inference Pods on IDC GPUs, and when those resources are insufficient, it automatically creates cloud‑based ACS Serverless GPU virtual nodes to run the Pods, achieving business scalability and cost optimization.

Solution Advantages

• Extreme elasticity: provides massive, second‑level scaling to handle traffic spikes. • Fine‑grained cost control: pay‑as‑you‑go without purchasing servers. • Rich elastic resources: supports CPU, GPU, and other instance types.

Usage Example

Prepare Environment

• Choose a region as the central region and create an ACK Edge cluster. • Install the virtual‑node component (see component management documentation). • Install KServe (see ack‑kserve component guide). • Install Arena (see Arena client configuration). • Deploy monitoring components and configure GPU metrics for auto‑scaling. • Create an edge node pool in a dedicated VPC and add IDC resources to the pool.

Step 1: Download DeepSeek‑R1‑Distill‑Qwen‑7B model

git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git
cd DeepSeek-R1-Distill-Qwen-7B/
git lfs pull

Upload the model to OSS (create a bucket directory first):

ossutil mkdir oss://
/models/DeepSeek-R1-Distill-Qwen-7B
ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://
/models/DeepSeek-R1-Distill-Qwen-7B

Step 2: Create PV and PVC for the model

apiVersion: v1
kind: Secret
metadata:
name: oss-secret
stringData:
akId: <your-oss-ak>
akSecret: <your-oss-sk>
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-model
labels:
alicloud-pvname: llm-model
spec:
capacity:
storage: 30Gi
accessModes:
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: ossplugin.csi.alibabacloud.com
volumeHandle: llm-model
nodePublishSecretRef:
name: oss-secret
namespace: default
volumeAttributes:
bucket: <your-bucket-name>
url: <your-bucket-endpoint>
otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
path: /models/DeepSeek-R1-Distill-Qwen-7B/
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-model
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 30Gi
selector:
matchLabels:
alicloud-pvname: llm-model

Step 3: Create a custom scheduling policy

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
name: deepseek
namespace: default
spec:
selector:
app: isvc.deepseek-predictor
strategy: prefer
units:
- resource: ecs
nodeSelector:
alibabacloud.com/nodepool-id: np*********
- resource: eci

Step 4: Deploy the model with Arena/KServe

arena serve kserve \
    --name=deepseek \
    --annotation=k8s.aliyun.com/eci-use-specs=ecs.gn6e-c12g1.3xlarge \
    --annotation=k8s.aliyun.com/eci-vswitch=vsw-*********,vsw-********* \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6 \
    --gpus=1 \
    --cpu=4 \
    --memory=12Gi \
    --scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \
    --scale-target=50 \
    --min-replicas=1 \
    --max-replicas=3 \
    --data=llm-model:/model/DeepSeek-R1-Distill-Qwen-7B \
    "vllm serve /model/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager --dtype=half"

Check node status:

kubectl get nodes -owide

Expected output shows one IDC node (idc001) with a V100 GPU and one virtual node.

Query the inference service:

arena serve get deepseek

Expected output confirms the Pod is scheduled on the IDC node.

Step 5: Simulate traffic spikes to trigger cloud‑side scaling

hey -z 5m -c 5 \
    -m POST -host deepseek-default.example.com \
    -H "Content-Type: application/json" \
    -d '{"model": "deepseek-r1", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}' \
    http://
:
/v1/chat/completions

When GPU utilization exceeds the threshold, the HPA creates additional replicas on the virtual node.

Final summary: ACK Edge provides a cloud‑native, edge‑integrated Kubernetes platform that manages IDC, ENS, and cross‑region ECS resources, reducing operational complexity while seamlessly leveraging cloud elasticity. Combining ACK Edge with virtual nodes enables fine‑grained cost control and reliable scaling for AI inference workloads.

serverlessKubernetesDeepSeekAI inferenceGPUVirtual NodeACK Edge
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.