Deploy DeepSeek, Llama, Qwen Models Fast on Baidu Baige AI Heterogeneous Platform

This guide walks you through creating a lightweight compute instance, adding it to Baidu Baige AI heterogeneous computing platform, deploying the vLLM tool, loading and serving small‑scale dense models such as DeepSeek, Llama and Qwen, and provides recommended configuration lists to achieve low‑cost, high‑performance inference.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Deploy DeepSeek, Llama, Qwen Models Fast on Baidu Baige AI Heterogeneous Platform

1. Provision a lightweight compute instance

Use the Baidu Baige AI platform to create a compute instance of type H20 (identifier ebc.lgn7t.c208m2048.8h20.4d). This instance provides 8 vCPU, 20 GB memory, and GPU resources suitable for dense small‑scale models such as DeepSeek‑V3, DeepSeek‑R1, Llama, and Qwen.

2. Install vLLM from the Tool Market

In the left navigation panel select “Tool Market”, locate the vLLM tool and click Deploy . The platform pulls the vLLM container image and starts the service on the provisioned instance.

3. Prepare the model and start inference

After vLLM is running, SSH into the instance, download the desired model checkpoint from its official repository, and launch vLLM with appropriate arguments. Example commands:

git clone https://github.com/deepseek-ai/DeepSeek-Model.git
cd DeepSeek-Model
python -m vllm.entrypoints.openai \
    --model-path ./deepseek-v3 \
    --tensor-parallel-size 2 \
    --max-model-len 4096

Optionally install a WebUI client (e.g., an OpenAI‑compatible UI) and send a POST request to http://<instance_ip>:8000/v1/chat/completions with the standard JSON payload to start a conversation.

4. Recommended hardware configurations

The following table (illustrated) lists the minimum instance specifications for each model series. For example, DeepSeek‑V3 requires at least 8 GB GPU memory, while Llama‑2‑13B benefits from 16 GB GPU memory.

Model configuration table
Model configuration table

5. Platform capabilities

Baidu Baige AI offers full lifecycle management, proprietary inference acceleration, and automatic resource fragmentation handling. These features improve service stability, lower inference cost, and increase throughput for deployed models.

Reference: https://cloud.baidu.com/product/aihc.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

vLLMDeepSeekinferenceAI model deploymentCloud AIBaidu BaigeModel configuration
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.