Artificial Intelligence 17 min read

Deploying QwQ-32B LLM with vLLM on Alibaba Cloud ACK and Configuring Intelligent Routing

This guide explains how to deploy the QwQ-32B large language model using vLLM on an Alibaba Cloud ACK Kubernetes cluster, configure storage, set up OpenWebUI, enable ACK Gateway with AI Extension for intelligent routing, and benchmark the inference service performance.

Alibaba Cloud Infrastructure

Mar 8, 2025

Deploying QwQ-32B LLM with vLLM on Alibaba Cloud ACK and Configuring Intelligent Routing

Background : Alibaba Cloud recently released the QwQ-32B model (320 billion parameters) whose performance rivals DeepSeek‑R1 671B. The vLLM framework provides efficient inference with features such as PagedAttention, dynamic batching, and model quantization.

Prerequisites : A GPU‑enabled ACK Kubernetes cluster (e.g., ecs.gn7i-c32g1.32xlarge with 4 × A10 GPUs) and an OSS bucket for model storage.

Step 1 – Prepare Model Data :

GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/QwQ-32B.git
cd QwQ-32B
git lfs pull

Upload the downloaded model files to OSS:

ossutil mkdir oss://<Your-Bucket-Name>/QwQ-32B
ossutil cp -r ./QwQ-32B oss://<Your-Bucket-Name>/QwQ-32B

Configure a PersistentVolume (PV) and PersistentVolumeClaim (PVC) that use the OSS static volume (example configurations are shown in the original tables).

Step 2 – Deploy Inference Service (vLLM deployment):

kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwq-32b
  name: qwq-32b
  namespace: default
spec:
  replicas: 5 # adjust to GPU node count
  selector:
    matchLabels:
      app: qwq-32b
  template:
    metadata:
      labels:
        app: qwq-32b
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
    spec:
      volumes:
        - name: model
          persistentVolumeClaim:
            claimName: llm-model
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 30Gi
      containers:
        - command:
            - sh
            - -c
            - vllm serve /models/QwQ-32B --port 8000 --trust-remote-code --served-model-name qwq-32b --tensor-parallel=4 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
          image: registry-cn-hangzhou.ack.aliyuncs.com/dev/vllm:v0.7.2
          name: vllm
          ports:
            - containerPort: 8000
          readinessProbe:
            tcpSocket:
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 30
          resources:
            limits:
              nvidia.com/gpu: "4"
          volumeMounts:
            - mountPath: /models/QwQ-32B
              name: model
            - mountPath: /dev/shm
              name: dshm
---
apiVersion: v1
kind: Service
metadata:
  name: qwq-32b-v1
spec:
  type: ClusterIP
  ports:
    - port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    app: qwq-32b
EOF

Step 3 – Deploy OpenWebUI :

kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: openwebui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: openwebui
  template:
    metadata:
      labels:
        app: openwebui
    spec:
      containers:
        - env:
            - name: ENABLE_OPENAI_API
              value: "True"
            - name: ENABLE_OLLAMA_API
              value: "False"
            - name: OPENAI_API_BASE_URL
              value: http://qwq-32b-v1:8000/v1
            - name: ENABLE_AUTOCOMPLETE_GENERATION
              value: "False"
            - name: ENABLE_TAGS_GENERATION
              value: "False"
          image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/open-webui:main
          name: openwebui
          ports:
            - containerPort: 8080
          volumeMounts:
            - mountPath: /app/backend/data
              name: data-volume
      volumes:
        - emptyDir: {}
          name: data-volume
---
apiVersion: v1
kind: Service
metadata:
  name: openwebui
spec:
  type: ClusterIP
  ports:
    - port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    app: openwebui
EOF

Step 4 – Verify Inference Service : Use kubectl port-forward svc/openwebui 8080:8080 and access http://localhost:8080 to log into OpenWebUI and test a prompt (e.g., "0.11和0.9谁大？").

Optional Step 5 – Benchmark Inference Service :

Deploy a benchmark pod, download the ShareGPT_V3 dataset, and run benchmark_serving.py with appropriate parameters. Sample output shows request throughput, token throughput, and latency metrics (e.g., TTFT ≈ 4.9 s, output token throughput ≈ 101.89 tok/s for 8‑concurrency).

Intelligent Routing with ACK Gateway :

Enable the ACK Gateway with AI Extension component, create a GatewayClass and Gateway with listeners on ports 8080 (standard HTTP) and 8081 (inference extension). Define HTTPRoute for the backend service and create InferencePool and InferenceModel CRDs to bind the QwQ‑32B model to the gateway.

Verify routing by sending a POST request to the gateway IP on the appropriate port and model name.

Observing Performance :

Collect vLLM metrics via Prometheus (e.g., gpu_cache_usage_perc, request_queue_time_seconds_sum, num_requests_running, avg_generation_throughput_toks_per_s, time_to_first_token_seconds_bucket). Import the provided Grafana JSON model to visualise these metrics.

Run comparative benchmarks against the default gateway (port 8080) and the inference‑extension gateway (port 8081). Results show the extension reduces mean TTFT by 26.8 % and P99 TTFT by 62.32 % while improving cache utilisation.

Conclusion : The tutorial demonstrates rapid deployment of the QwQ‑32B model on ACK with reduced resource requirements (bf16 precision on 64 GB GPU memory, 4 × A10 GPUs). The ACK Gateway with AI Extension provides superior routing for LLM workloads, yielding lower latency and higher throughput compared to traditional least‑request scheduling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Kubernetes vLLM benchmark inference ACK QwQ-32B

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.