Considerations and Practices for Domesticating Large‑Model Inference Engines
This article examines the importance of domestic large‑model inference engines, compares Chinese and international chips, evaluates four architectural approaches, discusses practical challenges such as performance loss and model support, and outlines future expectations for high‑performance, heterogeneous‑chip inference solutions.
Overview The presentation titled “Domestic Large‑Model Inference Engine: Thoughts and Practices” focuses on the significance of domesticizing inference engines, the hardware‑software interplay, and the challenges and opportunities of using Chinese chips.
Background It begins with a comparison of domestic chips (e.g., Huawei 910B) and international GPUs (e.g., Nvidia B100), highlighting differences in compute, memory, and bandwidth that directly affect large‑model training and inference.
Hardware Comparison Tables show that Huawei 910B lacks FP8 TensorCore and has 64 GB HBM2E memory, while Nvidia B100 offers 192 GB HBM3E, illustrating the rapid evolution of memory bandwidth from A100 to H100 to B100.
Inference Engine Comparison Open‑source engines such as vLLM, Nvidia TRT‑LLM, SGLang, LMDeploy, and Huawei MindIE are discussed; vLLM is widely compatible but slower than TRT‑LLM, which is not fully open‑source, while MindIE targets Ascend NPU.
Architectural Approaches
1. Thin Bridge Layer – a lightweight wrapper over existing engines, suitable for small teams but limited by the underlying engine’s capabilities.
2. Deep Optimization of Open‑Source Engine – extensive Python‑level tuning of engines like vLLM; risks include divergence from upstream projects.
3. Hybrid GPU/NPU Support – deep integration of open‑source engines with NPU kernels (e.g., MindIE), requiring substantial manpower.
4. Fully Self‑Developed Engine – a ground‑up solution supporting both GPU and NPU, demanding a large, skilled team and close collaboration with hardware vendors.
Domestic Practice Practical issues include performance degradation, low utilization, and slow model‑iteration cycles when relying on external engines.
Future Outlook A high‑performance inference engine that simultaneously supports NPU and GPU is expected soon, making the construction of a domestic chip ecosystem crucial for competitiveness.
Q&A Highlights
Q1: Companies may use 910B with vLLM instead of MindIE, customizing the open‑source stack.
Q2: Building a basic engine can take a few months with a small team; a full‑featured framework requires significantly more resources.
Q3: Accuracy verification across heterogeneous chips relies on vendor‑provided tools and internal calibration methods.
Q4: Most inference frameworks expose a Python layer; reducing its thickness or rewriting in C++ can improve performance.
Thank you for attending.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.