GPU Utilization Analysis and Optimization for Alibaba's Intelligent Creative Video Service
The paper analyzes why Alibaba Mama’s intelligent creative video service suffers low GPU utilization—due to Python GIL blocking, lack of kernel fusion, and serialized CUDA streams—and details service‑level changes (separate CPU/GPU processes, shared‑memory queues, priority scheduling) and operator‑level kernel‑fusion techniques (channels‑last layouts, custom pooling, TensorRT conversion) that raise utilization from ~30 % to near 100 % and boost throughput by 75 %.
This article presents a comprehensive analysis of the GPU utilization problems in Alibaba Mama's intelligent creative service, which automatically generates short promotional videos for e‑commerce items using deep‑learning models.
Two major model families are employed: video models (3D ConvNet, Visual Transformer) and text models (Transformer/GPT‑2). The service extracts a 5‑second highlight from long videos based on similarity to product images and descriptions.
Root causes of low GPU utilization are identified:
Python GIL blocks the kernel‑launch thread, causing frequent suspension and context switches.
PyTorch models lack kernel fusion, leading to many small kernels and high launch overhead.
Data transfer and computation are serialized because CUDA streams are not overlapped, leaving SM units idle during memcpy.
Service‑level optimizations include:
Separating CPU and GPU work into distinct processes. CPU processes handle video download and preprocessing, while a dedicated GPU process performs inference.
Using shared memory for large tensor transfer and a torch.multiprocessing.Queue for synchronization, achieving sub‑millisecond latency for 12 MB payloads.
Introducing a priority queue so that earlier requests receive higher scheduling priority, reducing contention.
These changes raise GPU kernel‑launch availability, increase GPU utilization from ~30 % to near 100 % under multi‑concurrency, and improve overall throughput by 75 %.
Operator‑level optimizations focus on kernel fusion:
Modifying Conv3D and BatchNorm to support channels‑last layout, eliminating costly format‑conversion kernels.
Replacing PyTorch MaxUnpool3D with a custom CUDNN pooling‑backward implementation to retain index information without performance loss.
Converting PyTorch models to TensorRT, cutting GPT‑2 inference time from 1 s to 0.5 s.
Additional improvements include using OpenCV's grab and retrieve functions for sparse frame sampling, and parallel video reading threads to further reduce I/O latency.
Experimental results on a T4 GPU with 16‑core CPU show a 15 % latency reduction for the TASED‑Net model and a 41 % increase in GPU utilization after the full stack of optimizations.
The paper concludes with a discussion on future work, such as exploring lazy‑tensor frameworks (e.g., PyTorch‑XLA) and extending optimizations to newer hardware like A10 and NPU accelerators.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.