Artificial Intelligence 6 min read

Deploy DeepSeek‑R1‑Distill on Volcengine CPU Cloud for Low‑Cost AI Inference

This guide walks you through deploying the DeepSeek‑R1‑Distill model on Volcengine CPU ECS instances, covering use‑case scenarios, recommended server types, Docker setup, environment configuration, and verification steps to achieve cost‑effective, high‑compatibility AI inference.

ByteDance Cloud Native

Feb 21, 2025

Deploy DeepSeek‑R1‑Distill on Volcengine CPU Cloud for Low‑Cost AI Inference

This is the third part of the "Cloud Practice" series, presenting a solution for deploying the DeepSeek‑R1‑Distill model service on Volcengine CPU cloud servers, which offers advantages in cost, universality, maintenance, scalability, and energy consumption.

Personal trial: low AI performance needs, lower CPU cost, sufficient for typical experience.

Enterprise API debugging: CPU deployment avoids GPU driver and CUDA compatibility issues, reducing development and management costs.

Lightweight model demand: small‑scale tasks (low‑frequency calls, small batch data) can be handled by multi‑core CPUs, suitable for internal knowledge‑base Q&A systems.

Testing on an ecs.c3il.8xlarge instance shows a throughput of 14 tokens/s with bf16 precision, meeting normal usage requirements.

Deployment Overview

We recommend different Volcengine CPU ECS types for various model sizes; ensure memory exceeds the model size.

Step 1: Create ECS Instance

Log in to the Volcengine ECS console ( https://console.volcengine.com/ecs ), select region/az, choose an appropriate instance type, and configure storage. The example uses a Shanghai region instance.

Step 2: Deploy Docker Environment and Enable the Model

Install Docker on the instance:

sudo apt update

sudo apt install docker.io

Run the Docker container with the DeepSeek‑R1‑Distill model:

docker run -d --network host --privileged --shm-size 15g -v /data00/models:/data00/models -e MODEL_PATH=/data00/models -e PORT=8000 -e MODEL_NAME=DeepSeek-R1-Distill-Qwen-7B -e DTYPE=bf16 -e KV_CACHE_DTYPE=fp16 ai-containers-cn-shanghai.cr.volces.com/deeplearning/xft-vllm:1.8.2.iaas bash /llama2/entrypoint.sh

For the Beijing region, use the image

ai-containers-cn-beijing.cr.volces.com/deeplearning/xft-vllm:1.8.2.iaas

Environment variable details are illustrated below:

Step 3: Test and Verify

Execute a curl request to confirm the service is running:

curl http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "xft",
    "messages":[{"role":"user","content":"你好!请问你是谁?"}],
    "max_tokens": 256,
    "temperature": 0.6
}'

The response image shows a successful reply:

Conclusion

The entire process demonstrates how to quickly launch the DeepSeek‑R1‑Distill model service using Volcengine CPU cloud products, offering a low‑cost, high‑compatibility solution for interested users.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker DeepSeek AI model deployment Volcengine CPU inference LLM serving

Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.