Artificial Intelligence 14 min read

Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT

By isolating CPU preprocessing and post‑processing from GPU inference into separate processes and applying TensorRT’s FP16/INT8 optimizations, the custom Python framework boosts Python vision inference services from roughly 4.5 to 27.4 QPS—a 5‑10× speedup—while reducing GPU utilization and cost.

DeWu Technology

Mar 8, 2023

Optimizing Python GPU Inference Services with CPU/GPU Process Separation and TensorRT

Background: Increasing use of computer‑vision algorithms in production creates a need to improve Python inference service performance to reduce cost.

Two key techniques are employed: separating CPU and GPU processes, and accelerating models with TensorRT, which together achieve a 5‑10× QPS improvement.

The article is organized into three parts: theory (CUDA architecture), framework & tools, and practical optimization tips.

Theory: CUDA provides a parallel computing platform with host (CPU) and device (GPU) concepts, kernel functions, streams, and a typical execution flow (host‑to‑device copy, kernel launch, device‑to‑host copy).

Traditional Python inference services often use Flask or KServe and run CPU preprocessing and GPU inference in the same thread/process. This design suffers from severe bottlenecks due to the Python GIL and limited GPU kernel scheduling, resulting in low QPS.

Solution: isolate CPU logic (pre‑/post‑processing) and GPU logic (model inference) into separate processes coordinated by a Proxy process. This eliminates GIL contention and allows independent scaling of CPU and GPU workers.

A custom Python framework implements this separation, requiring only the implementation of preprocess, inference, and postprocess interfaces while handling inter‑process communication automatically.

TensorRT acceleration: TensorRT converts optimized ONNX models into high‑performance inference engines, performing graph optimizations, node elimination, and multi‑precision support (FP32/FP16/INT8). The workflow includes model parsing, optimization, serialization, and runtime execution.

Performance results: the traditional multi‑threaded service achieves ~4.5 QPS with 2% GPU utilization, whereas the custom framework (6 CPU processes + 1 GPU process) reaches 27.4 QPS with 12% GPU utilization.

Practical tips: use CPU/GPU process separation, enable TensorRT FP16 with selective FP32 fallback, merge multiple models, and replicate GPU processes to fully utilize GPU memory and compute resources.

Conclusion: Combining CPU‑GPU process isolation with TensorRT optimization can deliver up to 10× QPS gains and substantial cost savings for large‑scale inference services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization CPU-GPU Separation CUDA GPU inference TensorRT

Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.