Optimizing TorchServe Inference Service Architecture for High‑Performance AI Deployment
This article details the engineering practice of optimizing TorchServe‑based AI inference services, covering background challenges, framework selection, GPU‑accelerated Torch‑TRT integration, CPU‑side preprocessing improvements, and deployment on Kubernetes to achieve higher throughput and lower resource consumption.
1 Background
Zhuanzhuan, a second‑hand e‑commerce platform, applied AI to search, recommendation, quality inspection, and customer service, but faced insufficient GPU execution optimization, wasted compute resources, and high application costs, as well as duplicated online/offline development logic.
This document presents an engineering practice of optimizing inference service deployment using TorchServe.
2 Problems and Solution Ideas
2.1 Current Situation
Previous architecture separated CPU and GPU services, causing the CPU side (pre‑processing) to become a performance bottleneck.
2.2 Problems
Iterative efficiency suffers because custom pre‑/post‑processing logic on CPU requires separate development and language stacks.
Network communication overhead is high for large images used in quality inspection.
2.3 Solution Ideas
2.3.1 Framework Survey
Comparison of Triton, TorchServe, and TensorFlow Serving shows all meet performance requirements, but custom logic support varies.
Feature
Triton
TorchServe
TensorFlow Serving
Supported frameworks
TensorFlow, PyTorch, ONNX, TensorRT, OpenVINO, etc.
PyTorch only
TensorFlow only
Performance
High‑performance server with dynamic batching, model parallelism.
Good performance, multi‑threaded, strong GPU support.
Good performance, multi‑threaded, strong GPU support.
Ease of use
Complex configuration.
CLI tools and Python API, easier.
CLI tools and gRPC/REST API.
Community
NVIDIA, active.
Facebook, active.
Google, very active.
Custom logic support is critical; TensorFlow’s @tf.function has limitations, while Triton Python Backend and TorchServe custom handlers both allow flexible Python logic.
@tf.function
def fizzbuzz(n):
for i in tf.range(n):
if i % 3 == 0:
tf.print('Fizz')
elif i % 5 == 0:
tf.print('Buzz')
else:
tf.print(i)
fizzbuzz(tf.constant(15))Example of Triton Python Backend:
import numpy as
class PythonAddModel:
def initialize(self, args):
self.model_config = args['model_config']
def execute(self, requests):
responses = []
for request in requests:
out_0 = request.inputs[0].as_numpy() + request.inputs[1].as_numpy()
out_tensor_0 = pb_utils.Tensor("OUT0", out_0.astype(np.float32))
responses.append(pb_utils.InferenceResponse([out_tensor_0]))
return responsesExample of TorchServe custom handler:
from ts.torch_handler import TorchHandler
class ImageClassifierHandler(TorchHandler):
def initialize(self, params):
"""Initialize model"""
self.model = SimpleCNN()
self.model.load_state_dict(torch.load('model.pth', map_location=torch.device('cuda:0')))
self.model.eval()
def preprocess(self, batch):
"""Preprocess input data"""
images = [img.convert('RGB') for img in batch]
images = [img.resize((224, 224)) for img in images]
images = [torch.tensor(np.array(img)).permute(2,0,1).float() for img in images]
images = [img/255.0 for img in images]
return images
def postprocess(self, outputs):
"""Postprocess output"""
_, predicted = torch.max(outputs, 1)
return predicted2.3.2 Framework Selection
TensorFlow Serving was excluded due to limited framework support and declining popularity. TorchServe was chosen for its deep integration with PyTorch, ease of use, and sufficient support for custom logic.
3 TorchServe Practice
3.1 TorchServe Usage and Tuning
3.1.1 Workflow
Steps: package model weights and custom handler into a .mar file, register the .mar with TorchServe, and handle requests that trigger download, pre‑processing, inference, and post‑processing.
torch-model-archiver --model-name your_model_name --version 1.0 --serialized-file path_to_your_model.pth --handler custom_handler.py --extra-files path_to_any_extra_filesUsing TorchServe custom handlers saved roughly 32 person‑days of development effort.
3.1.2 torch‑trt
torch‑trt converts PyTorch models to TensorRT for accelerated inference.
import torch
import torch_tensorrt
# Load your PyTorch model
model = torch.load('path_to_your_model.pth')
# Convert the model to TensorRT
trt_model = torch_tensorrt.compile(model, inputs=[torch_tensorrt.Input((1,3,224,224))], enabled_precisions={torch.float32})
# Save the converted model
torch.save(trt_model, 'path_to_trt_model.pth')Performance comparison shows torch‑trt reduces GPU utilization to 10‑50% while increasing QPS from 10 to 17 and cutting memory usage from 2 GB to 680 MB.
3.2 Pre‑ and Post‑Processing Optimization
CPU‑heavy preprocessing (OpenCV, NumPy, pandas) was replaced with GPU‑accelerated equivalents (cvCuda, cuDF) to lower CPU load.
import cv2
import numpy as np
import cv2.cuda as cvcuda
# Read image
img = cv2.imread('your_image.jpg')
# Transfer to GPU
gpu_img = cvcuda.GpuMat(img)
# Apply Gaussian blur on GPU
gaussian_filter = cvcuda.createGaussianFilter(gpu_img.type(), -1, (5,5), 1.5)
blurred_gpu = gaussian_filter.apply(gpu_img)
# Download back to CPU
blurred_img = blurred_gpu.download()
cv2.imshow('Original Image', img)
cv2.imshow('Blurred Image (cvCuda)', blurred_img)
cv2.waitKey(0)
cv2.destroyAllWindows()Benchmarks demonstrated up to 4× throughput increase and significant GPU utilization gains while keeping CPU usage moderate.
3.3 TorchServe on Kubernetes
Using the official Helm chart, TorchServe was deployed on a Kubernetes cluster with Prometheus and Grafana monitoring, achieving high availability and elastic scaling.
kubectl get pods
NAME READY STATUS RESTARTS AGE
grafana-cbd8775fd-6f8l5 1/1 Running 0 4h12m
model-store-pod 1/1 Running 0 4h35m
prometheus-alertmanager-... 2/2 Running 0 4h42m
... (other monitoring pods) ...
torchserve-7d468f9894-fvmpj 1/1 Running 0 4h33m4 Future Work
The current solution balances development efficiency and system performance, but challenges remain such as CPU saturation during heavy pre‑/post‑processing and the need for a unified code path for online and offline pipelines.
Future plans include supporting more complex scenarios (multi‑model inference, LLM serving) and extending cloud‑native capabilities beyond the initial implementation.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.