Deploy DeepSeek‑R1‑Distill on Volcengine CPU Cloud for Low‑Cost AI Inference
This guide walks you through deploying the DeepSeek‑R1‑Distill model on Volcengine CPU ECS instances, covering use‑case scenarios, recommended server types, Docker setup, environment configuration, and verification steps to achieve cost‑effective, high‑compatibility AI inference.
This is the third part of the "Cloud Practice" series, presenting a solution for deploying the DeepSeek‑R1‑Distill model service on Volcengine CPU cloud servers, which offers advantages in cost, universality, maintenance, scalability, and energy consumption.
Personal trial: low AI performance needs, lower CPU cost, sufficient for typical experience.
Enterprise API debugging: CPU deployment avoids GPU driver and CUDA compatibility issues, reducing development and management costs.
Lightweight model demand: small‑scale tasks (low‑frequency calls, small batch data) can be handled by multi‑core CPUs, suitable for internal knowledge‑base Q&A systems.
Testing on an ecs.c3il.8xlarge instance shows a throughput of 14 tokens/s with bf16 precision, meeting normal usage requirements.
Deployment Overview
We recommend different Volcengine CPU ECS types for various model sizes; ensure memory exceeds the model size.
Step 1: Create ECS Instance
Log in to the Volcengine ECS console ( https://console.volcengine.com/ecs ), select region/az, choose an appropriate instance type, and configure storage. The example uses a Shanghai region instance.
Step 2: Deploy Docker Environment and Enable the Model
Install Docker on the instance:
<code>sudo apt update</code> <code>sudo apt install docker.io</code>Run the Docker container with the DeepSeek‑R1‑Distill model:
<code>docker run -d --network host --privileged --shm-size 15g -v /data00/models:/data00/models -e MODEL_PATH=/data00/models -e PORT=8000 -e MODEL_NAME=DeepSeek-R1-Distill-Qwen-7B -e DTYPE=bf16 -e KV_CACHE_DTYPE=fp16 ai-containers-cn-shanghai.cr.volces.com/deeplearning/xft-vllm:1.8.2.iaas bash /llama2/entrypoint.sh</code>For the Beijing region, use the image
ai-containers-cn-beijing.cr.volces.com/deeplearning/xft-vllm:1.8.2.iaas.
Environment variable details are illustrated below:
Step 3: Test and Verify
Execute a curl request to confirm the service is running:
<code>curl http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "xft",
"messages":[{"role":"user","content":"你好!请问你是谁?"}],
"max_tokens": 256,
"temperature": 0.6
}'</code>The response image shows a successful reply:
Conclusion
The entire process demonstrates how to quickly launch the DeepSeek‑R1‑Distill model service using Volcengine CPU cloud products, offering a low‑cost, high‑compatibility solution for interested users.
ByteDance Cloud Native
Sharing ByteDance's cloud-native technologies, technical practices, and developer events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.