dl_inference: Open‑Source Deep Learning Inference Service with TensorRT and MKL Acceleration
dl_inference is an open‑source, production‑grade deep learning inference platform that supports TensorFlow, PyTorch and Caffe models, offering GPU and CPU deployment, TensorRT and MKL acceleration, multi‑node load balancing, and extensive Q&A on model conversion, hardware requirements, INT8 quantization, and performance gains.
dl_inference is an open‑source, general‑purpose deep learning inference tool released by 58.com, capable of quickly deploying models trained with TensorFlow, PyTorch, and Caffe in production environments. It provides both GPU and CPU deployment options and implements load‑balancing strategies for multi‑node deployments, handling over one billion inference requests per day.
In November 2021 the tool was updated to integrate TensorRT acceleration and support the Intel Math Kernel Library (MKL) for TensorFlow Serving. A technical salon on December 10, 2021 presented detailed explanations of these enhancements.
The project is hosted at https://github.com/wuba/dl_inference , and users are encouraged to star, file issues, and submit pull requests.
TensorRT Acceleration : The Q&A session explained that TensorRT removes redundant operators, merges compatible layers, and can significantly speed up inference when such patterns exist. Successful conversion depends on model structure, supported TensorFlow versions (1.12‑1.15, 2.1‑2.7), and ONNX opset versions (9‑15). INT8 quantization is supported and can further improve performance, though it may require custom plugins for unsupported operators.
MKL Acceleration : MKL improves CPU inference by using mathematically equivalent operations and SIMD instruction sets. It is most effective for computer‑vision models; some recommendation models (e.g., Wide&Deep, DIEN, FM) see limited gains. Accuracy impact is minimal, typically affecting only the sixth or seventh decimal place.
Hardware requirements for TensorRT include matching GPU types between conversion and deployment and sufficient GPU memory. MKL can also run on ARM architectures, though the presented examples focus on x86/x64.
Performance results show that for ResNet‑50, MKL can increase QPS by over 60% and reduce latency by 40% on the same hardware. Similar improvements were observed for MobileNet and SSD models.
Presentation materials (video recordings and PPTs) are available by following the “58技术” or “58AILab” WeChat public accounts and replying with the keyword “dl_inference”.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.