Artificial Intelligence 19 min read

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

This article presents a step‑by‑step guide for deploying and optimizing large‑language‑model inference across multiple GPU‑enabled nodes using ACK Gateway with Inference Extension, vLLM’s tensor‑ and pipeline‑parallel techniques, and Kubernetes resources such as LeaderWorkerSet, PVCs, and custom routing policies, followed by performance benchmarking and analysis.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

The ACK Gateway with Inference Extension component is designed for LLM inference scenarios, offering four‑layer/seven‑layer traffic routing and load‑balancing based on model‑server load awareness, while allowing custom traffic‑splitting strategies like model gray‑release and traffic mirroring via the InferencePool and InferenceModel CRDs.

vLLM provides high‑performance inference for massive language models by employing tensor parallelism (splitting weight matrices across GPUs) and pipeline parallelism (partitioning model layers across devices), enabling efficient multi‑node deployment.

Environment preparation includes creating a GPU‑enabled Kubernetes cluster, ensuring at least four GPUs across nodes, and installing the LeaderWorkerSet controller.

Step 1 – Model data : download the QwQ‑32B model, push it to OSS, and configure a PersistentVolume (PV) and PersistentVolumeClaim (PVC) for the model files.

GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/QwQ-32B.git
cd QwQ-32B
git lfs pull
ossutil mkdir oss://
/QwQ-32B
ossutil cp -r ./QwQ-32B oss://
/QwQ-32B

Step 2 – Deploy inference service : apply a LeaderWorkerSet YAML that creates a leader pod and a worker pod (each with 2 GPUs) forming a Ray cluster, and runs vLLM with tensor‑parallel‑size 2 and pipeline‑parallel‑size 2.

kubectl apply -f- <

Step 3 – ACK Gateway configuration : create a GatewayClass and a Gateway exposing ports 8080 (standard HTTP routing) and 8081 (inference‑extension routing), then define a BackendTrafficPolicy, ClientTrafficPolicy, and an HTTPRoute that forwards traffic to the distributed‑serving Service.

kubectl apply -f- <

Step 4 – Enable inference extension : create an InferencePool that selects the leader pods and an InferenceModel that routes 100 % of requests for the model name qwq to the QwQ‑32B model.

kubectl apply -f- <
kubectl apply -f- <

Step 5 – Benchmarking : deploy a vLLM‑benchmark pod, download a ShareGPT dataset, and run the provided Python benchmark script against both ports (8080 and 8081). The results show that the intelligent routing of ACK Gateway reduces average TTFT from 10,909 ms to 7,336 ms (≈32 % improvement) and slightly increases token throughput.

Overall, the combination of ACK Gateway’s load‑aware routing, vLLM’s parallelism, and Kubernetes orchestration delivers lower latency, higher throughput, and better cache utilization for large‑scale LLM inference workloads.

LLMDistributed InferenceKubernetesvLLMPipeline ParallelACK GatewayTensor Parallel
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.