Artificial Intelligence 19 min read

Optimizing GPU Utilization for Multimedia AI Services with high_service

The article presents high_service, a high‑performance inference framework that boosts GPU utilization in multimedia AI services by separating CPU‑heavy preprocessing from GPU inference, employing priority‑based auto‑scaling, multi‑tenant sharing, and TensorRT‑accelerated models to eliminate GIL bottlenecks, reduce waste, and adapt to fluctuating traffic, with future work targeting automated bottleneck detection and further CPU‑GPU offloading.

Alimama Tech
Alimama Tech
Alimama Tech
Optimizing GPU Utilization for Multimedia AI Services with high_service

This article extends the previous discussion on deep‑learning advertising computation and focuses on the core problem of improving GPU utilization and avoiding GPU waste in multimedia AI service scenarios.

Challenges include low GPU usage caused by Python’s Global Interpreter Lock (GIL) in single‑process inference, high memory consumption of large models, and highly uneven traffic patterns that lead to either resource starvation during peaks or waste during low‑traffic periods.

high_service Framework is introduced as a high‑performance inference solution designed for Alibaba‑Mama’s multimedia online services. Its key components are:

Cluster‑level service scheduling with priority‑based auto‑scaling and multi‑tenant GPU sharing to handle traffic fluctuations and low‑traffic services.

Service‑architecture optimization using a CPU/GPU separated multi‑process design, which isolates the GPU inference process from CPU‑heavy preprocessing, thereby eliminating GIL‑related bottlenecks.

Model‑level acceleration via TensorRT (for PyTorch) and optimized TensorFlow pipelines, including automatic model conversion, engine caching, and removal of inference‑irrelevant operations such as spectral normalization.

The system architecture routes requests through Nginx, distributes preprocessing to multiple CPU processes, and feeds tensors to a dedicated GPU process. GPU kernels are scheduled on separate CUDA streams to maximize parallelism.

Performance results show that dynamic auto‑scaling can adjust GPU card count from a few dozen during peaks to a minimal number during off‑hours, dramatically improving overall GPU utilization.

Future Plans aim to automate bottleneck detection, further optimize CPU workloads by offloading image/video processing to GPU, and continue kernel‑level and scheduling optimizations for emerging hardware.

High Performance Computinginference optimizationTensorRTauto scalingGPU utilizationmultimedia AI
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.