dl_inference: A General Deep Learning Inference Service with TensorRT and Intel MKL Acceleration
The article introduces dl_inference, an open‑source deep learning inference platform that integrates TensorRT GPU acceleration, Intel MKL CPU optimization, and Caffe support, detailing its features, model conversion workflow, deployment steps, performance gains, and how developers can contribute.
dl_inference is a general‑purpose deep learning inference service launched by 58.com on March 26, 2020, capable of deploying models trained with TensorFlow, PyTorch, Caffe and other frameworks.
Since its 2021 update, dl_inference adds three major features: TensorRT‑based GPU acceleration for TensorFlow‑SavedModel and PyTorch‑Pth models, Intel Math Kernel Library (MKL) acceleration for TensorFlow Serving on CPUs, and native Caffe model inference with rich examples.
TensorRT Acceleration – TensorRT optimizes model precision (FP16/INT8), fuses layers, selects optimal kernels, manages dynamic memory, and supports multi‑stream execution, dramatically improving GPU inference speed. dl_inference bundles TensorRT 7.1.3 and Triton Inference Server (TIS) 20.08 to convert and serve models.
Model Conversion Workflow – Developers provide a TensorFlow SavedModel.pb or PyTorch Model.pth and a JSON metadata file (config.txt). dl_inference first converts the model to ONNX, then optimizes it into a TensorRT engine. Example metadata:
{
"batch_size":0,
"input":[
{
"name":"image",
"data_type":"float",
"dims":[-1,224,224,3],
"node_name":"input_1:0"
}
],
"output":[
{
"name":"probs",
"data_type":"float",
"dims":[-1,19],
"node_name":"dense_1/Softmax:0"
}
]
}After preparing the model and metadata, the conversion image is built:
cd DockerImage docker build -t tis-model-convert:lastest .Then the conversion container is run:
cd $MODEL_PATH docker run -v `pwd`:/workspace/source_model -e SOURCE_MODEL_PATH=/workspace/source_model -e TARGET_MODEL_PATH=/workspace/source_model -e MODEL_NAME=tensorflow-666 -e MODEL_TYPE=tensorflow tis-model-convert:lastestComments in the script explain each environment variable.
Model Deployment – The generated TensorRT model is served via Triton Inference Server:
docker pull nvcr.io/nvidia/tritonserver:20.08-py3 docker run -v ${TARGET_MODEL_PATH}:/workspace -p 8001:8001 nvcr.io/nvidia/tritonserver:20.08-py3 /opt/tritonserver/bin/tritonserver --model-repository=/workspaceClients can invoke the service through the provided RPC interface TisPredict and a Java demo client.
Intel MKL Acceleration – By using an MKL‑enabled TensorFlow Serving image, dl_inference speeds up CPU inference (e.g., 60% QPS increase, 40% latency reduction on an Intel Xeon E5‑2620). Key MKL environment variables such as KMP_BLOCKTIME , KMP_AFFINITY , KMP_SETTINGS , and OMP_NUM_THREADS are documented.
Caffe Support – dl_inference v1.1 adds a Seldon‑wrapped Caffe RPC service with customizable pre‑process and post‑process hooks, enabling flexible data handling and multi‑stage inference.
The project also provides numerous example models (e.g., qa_match, MMoE baseline) and a rich set of reference links. The source code is hosted at https://github.com/wuba/dl_inference , and contributors are encouraged to submit PRs or issues.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.