Artificial Intelligence 19 min read

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

This article presents a step‑by‑step guide for deploying and optimizing large‑language‑model inference across multiple GPU‑enabled nodes using ACK Gateway with Inference Extension, vLLM’s tensor‑ and pipeline‑parallel techniques, and Kubernetes resources such as LeaderWorkerSet, PVCs, and custom routing policies, followed by performance benchmarking and analysis.

Alibaba Cloud Infrastructure

Apr 16, 2025

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

The ACK Gateway with Inference Extension component is designed for LLM inference scenarios, offering four‑layer/seven‑layer traffic routing and load‑balancing based on model‑server load awareness, while allowing custom traffic‑splitting strategies like model gray‑release and traffic mirroring via the InferencePool and InferenceModel CRDs.

vLLM provides high‑performance inference for massive language models by employing tensor parallelism (splitting weight matrices across GPUs) and pipeline parallelism (partitioning model layers across devices), enabling efficient multi‑node deployment.

Environment preparation includes creating a GPU‑enabled Kubernetes cluster, ensuring at least four GPUs across nodes, and installing the LeaderWorkerSet controller.

Step 1 – Model data : download the QwQ‑32B model, push it to OSS, and configure a PersistentVolume (PV) and PersistentVolumeClaim (PVC) for the model files.

GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/QwQ-32B.git
cd QwQ-32B
git lfs pull

ossutil mkdir oss://<Your-Bucket-Name>/QwQ-32B
ossutil cp -r ./QwQ-32B oss://<Your-Bucket-Name>/QwQ-32B

Step 2 – Deploy inference service : apply a LeaderWorkerSet YAML that creates a leader pod and a worker pod (each with 2 GPUs) forming a Ray cluster, and runs vLLM with tensor‑parallel‑size 2 and pipeline‑parallel‑size 2.

kubectl apply -f- <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: qwq-dist-v1-cm
  labels:
    app: distributed-serving
    release: qwq-dist-v1
    role: leader
    servingName: qwq-dist
    servingType: distributed-serving
    servingVersion: v1
data:
  hostfile-0: |-
    qwq-dist-v1.qwq-dist-v1-0.default
    qwq-dist-v1.qwq-dist-v1-0-0.default
...
EOF

Step 3 – ACK Gateway configuration : create a GatewayClass and a Gateway exposing ports 8080 (standard HTTP routing) and 8081 (inference‑extension routing), then define a BackendTrafficPolicy, ClientTrafficPolicy, and an HTTPRoute that forwards traffic to the distributed‑serving Service.

kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-gateway
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: inference-gateway
  listeners:
  - name: http
    protocol: HTTP
    port: 8080
  - name: llm-gw
    protocol: HTTP
    port: 8081
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: reasoning-backend
spec:
  parentRefs:
  - name: inference-gateway
    sectionName: http
  rules:
  - backendRefs:
    - kind: Service
      name: qwq-dist-v1
      port: 8000
      weight: 1
    matches:
    - path:
        type: PathPrefix
        value: /
EOF

Step 4 – Enable inference extension : create an InferencePool that selects the leader pods and an InferenceModel that routes 100 % of requests for the model name qwq to the QwQ‑32B model.

kubectl apply -f- <<EOF
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferencePool
metadata:
  name: reasoning-pool
spec:
  selector:
    app: distributed-serving
    release: qwq-dist-v1
    role: leader
  targetPortNumber: 8000
EOF

kubectl apply -f- <<EOF
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferenceModel
metadata:
  name: reasoning-model
spec:
  modelName: qwq
  targetModels:
  - name: qwq
    weight: 100
EOF

Step 5 – Benchmarking : deploy a vLLM‑benchmark pod, download a ShareGPT dataset, and run the provided Python benchmark script against both ports (8080 and 8081). The results show that the intelligent routing of ACK Gateway reduces average TTFT from 10,909 ms to 7,336 ms (≈32 % improvement) and slightly increases token throughput.

Overall, the combination of ACK Gateway’s load‑aware routing, vLLM’s parallelism, and Kubernetes orchestration delivers lower latency, higher throughput, and better cache utilization for large‑scale LLM inference workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM distributed inference Kubernetes vLLM Pipeline Parallel ACK Gateway Tensor Parallel

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.