Artificial Intelligence 17 min read

Deploying QwQ-32B LLM with vLLM on Alibaba Cloud ACK and Configuring Intelligent Routing

This guide explains how to deploy the QwQ-32B large language model using vLLM on an Alibaba Cloud ACK Kubernetes cluster, configure storage, set up OpenWebUI, enable ACK Gateway with AI Extension for intelligent routing, and benchmark the inference service performance.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Deploying QwQ-32B LLM with vLLM on Alibaba Cloud ACK and Configuring Intelligent Routing

Background : Alibaba Cloud recently released the QwQ-32B model (320 billion parameters) whose performance rivals DeepSeek‑R1 671B. The vLLM framework provides efficient inference with features such as PagedAttention, dynamic batching, and model quantization.

Prerequisites : A GPU‑enabled ACK Kubernetes cluster (e.g., ecs.gn7i-c32g1.32xlarge with 4 × A10 GPUs) and an OSS bucket for model storage.

Step 1 – Prepare Model Data :

GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/QwQ-32B.git
cd QwQ-32B
git lfs pull

Upload the downloaded model files to OSS:

ossutil mkdir oss://
/QwQ-32B
ossutil cp -r ./QwQ-32B oss://
/QwQ-32B

Configure a PersistentVolume (PV) and PersistentVolumeClaim (PVC) that use the OSS static volume (example configurations are shown in the original tables).

Step 2 – Deploy Inference Service (vLLM deployment):

kubectl apply -f- <

Step 3 – Deploy OpenWebUI :

kubectl apply -f- <

Step 4 – Verify Inference Service : Use kubectl port-forward svc/openwebui 8080:8080 and access http://localhost:8080 to log into OpenWebUI and test a prompt (e.g., "0.11和0.9谁大?").

Optional Step 5 – Benchmark Inference Service :

Deploy a benchmark pod, download the ShareGPT_V3 dataset, and run benchmark_serving.py with appropriate parameters. Sample output shows request throughput, token throughput, and latency metrics (e.g., TTFT ≈ 4.9 s, output token throughput ≈ 101.89 tok/s for 8‑concurrency).

Intelligent Routing with ACK Gateway :

Enable the ACK Gateway with AI Extension component, create a GatewayClass and Gateway with listeners on ports 8080 (standard HTTP) and 8081 (inference extension). Define HTTPRoute for the backend service and create InferencePool and InferenceModel CRDs to bind the QwQ‑32B model to the gateway.

Verify routing by sending a POST request to the gateway IP on the appropriate port and model name.

Observing Performance :

Collect vLLM metrics via Prometheus (e.g., gpu_cache_usage_perc, request_queue_time_seconds_sum, num_requests_running, avg_generation_throughput_toks_per_s, time_to_first_token_seconds_bucket). Import the provided Grafana JSON model to visualise these metrics.

Run comparative benchmarks against the default gateway (port 8080) and the inference‑extension gateway (port 8081). Results show the extension reduces mean TTFT by 26.8 % and P99 TTFT by 62.32 % while improving cache utilisation.

Conclusion : The tutorial demonstrates rapid deployment of the QwQ‑32B model on ACK with reduced resource requirements (bf16 precision on 64 GB GPU memory, 4 × A10 GPUs). The ACK Gateway with AI Extension provides superior routing for LLM workloads, yielding lower latency and higher throughput compared to traditional least‑request scheduling.

LLMKubernetesvLLMbenchmarkInferenceAckQwQ-32B
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.