Artificial Intelligence 19 min read

Optimizing GPU Utilization for Multimedia AI Services with high_service

The article presents high_service, a high‑performance inference framework that boosts GPU utilization in multimedia AI services by separating CPU‑heavy preprocessing from GPU inference, employing priority‑based auto‑scaling, multi‑tenant sharing, and TensorRT‑accelerated models to eliminate GIL bottlenecks, reduce waste, and adapt to fluctuating traffic, with future work targeting automated bottleneck detection and further CPU‑GPU offloading.

Alimama Tech

Nov 2, 2022

Optimizing GPU Utilization for Multimedia AI Services with high_service

This article extends the previous discussion on deep‑learning advertising computation and focuses on the core problem of improving GPU utilization and avoiding GPU waste in multimedia AI service scenarios.

Challenges include low GPU usage caused by Python’s Global Interpreter Lock (GIL) in single‑process inference, high memory consumption of large models, and highly uneven traffic patterns that lead to either resource starvation during peaks or waste during low‑traffic periods.

high_service Framework is introduced as a high‑performance inference solution designed for Alibaba‑Mama’s multimedia online services. Its key components are:

Cluster‑level service scheduling with priority‑based auto‑scaling and multi‑tenant GPU sharing to handle traffic fluctuations and low‑traffic services.

Service‑architecture optimization using a CPU/GPU separated multi‑process design, which isolates the GPU inference process from CPU‑heavy preprocessing, thereby eliminating GIL‑related bottlenecks.

Model‑level acceleration via TensorRT (for PyTorch) and optimized TensorFlow pipelines, including automatic model conversion, engine caching, and removal of inference‑irrelevant operations such as spectral normalization.

The system architecture routes requests through Nginx, distributes preprocessing to multiple CPU processes, and feeds tensors to a dedicated GPU process. GPU kernels are scheduled on separate CUDA streams to maximize parallelism.

Performance results show that dynamic auto‑scaling can adjust GPU card count from a few dozen during peaks to a minimal number during off‑hours, dramatically improving overall GPU utilization.

Future Plans aim to automate bottleneck detection, further optimize CPU workloads by offloading image/video processing to GPU, and continue kernel‑level and scheduling optimizations for emerging hardware.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High Performance Computing Inference Optimization TensorRT Auto Scaling GPU utilization multimedia AI

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.