Artificial Intelligence 14 min read

Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT

By isolating CPU preprocessing and post‑processing from GPU inference into separate processes and applying TensorRT’s FP16/INT8 optimizations, the custom Python framework boosts Python vision inference services from roughly 4.5 to 27.4 QPS—a 5‑10× speedup—while reducing GPU utilization and cost.

DeWu Technology
DeWu Technology
DeWu Technology
Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT

Background: Increasing use of computer‑vision algorithms in production creates a need to improve Python inference service performance to reduce cost.

Two key techniques are employed: separating CPU and GPU processes, and accelerating models with TensorRT, which together achieve a 5‑10× QPS improvement.

The article is organized into three parts: theory (CUDA architecture), framework & tools, and practical optimization tips.

Theory: CUDA provides a parallel computing platform with host (CPU) and device (GPU) concepts, kernel functions, streams, and a typical execution flow (host‑to‑device copy, kernel launch, device‑to‑host copy).

Traditional Python inference services often use Flask or KServe and run CPU preprocessing and GPU inference in the same thread/process. This design suffers from severe bottlenecks due to the Python GIL and limited GPU kernel scheduling, resulting in low QPS.

Solution: isolate CPU logic (pre‑/post‑processing) and GPU logic (model inference) into separate processes coordinated by a Proxy process. This eliminates GIL contention and allows independent scaling of CPU and GPU workers.

A custom Python framework implements this separation, requiring only the implementation of preprocess, inference, and postprocess interfaces while handling inter‑process communication automatically.

TensorRT acceleration: TensorRT converts optimized ONNX models into high‑performance inference engines, performing graph optimizations, node elimination, and multi‑precision support (FP32/FP16/INT8). The workflow includes model parsing, optimization, serialization, and runtime execution.

Performance results: the traditional multi‑threaded service achieves ~4.5 QPS with 2% GPU utilization, whereas the custom framework (6 CPU processes + 1 GPU process) reaches 27.4 QPS with 12% GPU utilization.

Practical tips: use CPU/GPU process separation, enable TensorRT FP16 with selective FP32 fallback, merge multiple models, and replicate GPU processes to fully utilize GPU memory and compute resources.

Conclusion: Combining CPU‑GPU process isolation with TensorRT optimization can deliver up to 10× QPS gains and substantial cost savings for large‑scale inference services.

performance optimizationPythonCPU-GPU SeparationCUDAGPU inferenceTensorRT
DeWu Technology
Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.