8 min read

How to Deploy the QwQ-32B Large Language Model on Volcengine Cloud in Minutes

This guide walks you through the end‑to‑end process of deploying the open‑source QwQ‑32B inference model on Volcengine's cloud platform, covering GPU ECS selection, VKE cluster creation, continuous delivery CP setup, vLLM service launch, and API gateway exposure.

ByteDance Cloud Native

Mar 7, 2025

How to Deploy the QwQ-32B Large Language Model on Volcengine Cloud in Minutes

In the past year, AI technology has advanced rapidly, becoming a primary driver of innovation across industries. Enterprises are eager to operationalize large models to drive business growth and achieve intelligent transformation. Volcengine Cloud Base introduces a series of cloud‑based practices to help users quickly experience various large models.

QwQ‑32B is a newly open‑source inference model that excels on benchmarks such as AIME24 (mathematical reasoning), LiveCodeBench (coding), LiveBench, IFEval (instruction following), and BFCL. It leverages large‑scale reinforcement learning to enhance reasoning abilities, explicitly showing its chain‑of‑thought during inference for better interpretability.

To let enterprise users quickly try QwQ‑32B in a cloud environment, this article combines Volcengine GPU ECS, Container Service VKE, and Continuous Delivery CP to propose a rapid deployment solution using vLLM.

Step 1: Create a VKE Cluster

Before deploying the QwQ‑32B inference service, a VKE (Kubernetes‑based) cluster must be created. VKE efficiently manages the massive heterogeneous compute, storage, and network resources required by AI workloads and provides cloud‑native capabilities such as elastic scaling across clouds.

Visit the VKE console at and create a managed cluster, recommending the VPC‑CNI network model.

QwQ‑32B has 32 billion parameters and requires at least 80 GB of total GPU memory. Choose GPU ECS instances accordingly; recommended instance types are listed in the accompanying table.

Step 2: Create a Deployment Cluster

Use Volcengine Continuous Delivery CP’s AI Application feature, which provides pre‑configured templates with popular AI frameworks. This accelerates deployment in the container service.

Open the CP console at and select “Resource Management → Deployment Resources”, then click “Create Deployment Resource”.

In the form, set the access type to “Container Service VKE”, choose the region and the VKE cluster created in Step 1, and set the sharing scope to “All workspaces”.

Step 3: Create the AI Application

In the CP console, go to “AI Application” (invite‑only) and click “Create Application”.

Select “Custom Create”.

Provide the application name and select the deployment cluster created earlier.

Choose the vLLM image for deployment and select the official QwQ‑32B model, mounting it at /model.

The default vLLM launch command is:

vllm serve /model --host 0.0.0.0 --port 8080 --max-model-len 2048 --gpu-memory-utilization 0.9 --tensor-parallel-size ${GPU_NUM}

Replace ${GPU_NUM} with the actual number of GPU cards in the selected instance.

Step 4: Configure API Gateway Access

Volcengine API Gateway (APIG) provides a cloud‑native, highly available gateway service. To expose the inference service:

In the AI Application page, click “Access Settings”.

Add an API Gateway, selecting HTTP 1.1. If no gateway exists, create one.

Ensure the private network matches the VKE cluster’s network; choose a 1c2g specification with two nodes.

After creation, the public domain appears in the “Access Settings” page.

Finally, test the service with a curl request:

curl -X POST http://example.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model",
    "messages": [{"role": "user", "content": "Your question"}],
    "temperature": 0.7
}'

In summary, using Volcengine GPU ECS, VKE, and CP enables rapid deployment of the QwQ‑32B large model. Enterprise customers can further tune the architecture to fully leverage the model’s capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

vLLM Large Language Model cloud deployment QwQ-32B GPU ECS VKE

Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.