GPU Container Virtualization for AI Heterogeneous Computing: Architecture and Best Practices
The article surveys GPU container virtualization for AI heterogeneous computing, detailing utilization challenges, historical architectures, various virtualization methods, Baidu's dual-engine user- and kernel-space design with isolation and scheduling features, performance benefits, best‑practice scenarios, and deployment guidance, concluding with a technical Q&A.
This article provides a comprehensive overview of GPU container virtualization technology for AI heterogeneous computing, covering challenges, architecture, implementation details, and best practices. The content is based on InfoQ's Open Class and includes Q&A at the end.
The article begins by highlighting the growing demand for AI computing power, with model training requirements doubling every 3.4 months since 2012, while actual resource utilization in production environments remains below 30%. It identifies key constraints affecting GPU utilization including model characteristics, service SLA requirements, traffic patterns, optimization levels, and capacity redundancy.
The article then presents four typical utilization patterns observed in production: low average utilization, peak-valley fluctuations, short-term spikes, and periodic triggering. These patterns demonstrate the complexity of AI application scenarios and the need for flexible virtualization solutions.
A historical overview of GPU virtualization development is provided, tracing from early G80 Tesla architecture through Kepler, Pascal, Volta, Turing, and Ampere architectures. The article discusses various virtualization approaches including API hooking (rCUDA), hardware-based solutions (NVIDIA GRID vGPU, MIG), and software-based implementations.
The core of the article focuses on Baidu's dual-engine GPU container virtualization architecture, which combines user-space and kernel-space isolation engines. The user-space engine uses API hooking to intercept CUDA calls and provide features like memory isolation, compute isolation, encoding/decoding isolation, priority preemption, memory oversubscription, and memory pooling. The kernel-space engine implements isolation through system call interception and provides memory isolation, compute isolation, and multiple scheduling algorithms (Fixed Share, Equal Share, Weight Share, Burst Weight Share).
Performance evaluation shows that user-space virtualization with process fusion achieves superior tail latency compared to bare-metal and kernel-space approaches, particularly under high load. The article also discusses advanced features like remote GPU access, MPS (Multi-Process Service) optimization, priority preemption for online/offline task mixing, and time-sharing with memory swapping.
Best practices are presented for three common scenarios: shared mixing for low-utilization tasks, priority preemption for fluctuating workloads with short spikes, and time-sharing with memory swapping for intermittent compute tasks. The article concludes by mentioning that all these technologies are available on Baidu's AI heterogeneous computing platform (Baidu Baige) and can be deployed in both public and private clouds.
The Q&A section addresses technical questions about resource control mechanisms, NPU virtualization, coexistence of different virtualization approaches, scheduling extensions, and deployment requirements.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.