Artificial Intelligence 12 min read

dl_inference: A General Deep Learning Inference Service with TensorRT and Intel MKL Acceleration

The article introduces dl_inference, an open‑source deep learning inference platform that integrates TensorRT GPU acceleration, Intel MKL CPU optimization, and Caffe support, detailing its features, model conversion workflow, deployment steps, performance gains, and how developers can contribute.

58 Tech

Dec 8, 2021

dl_inference: A General Deep Learning Inference Service with TensorRT and Intel MKL Acceleration

dl_inference is a general‑purpose deep learning inference service launched by 58.com on March 26, 2020, capable of deploying models trained with TensorFlow, PyTorch, Caffe and other frameworks.

Since its 2021 update, dl_inference adds three major features: TensorRT‑based GPU acceleration for TensorFlow‑SavedModel and PyTorch‑Pth models, Intel Math Kernel Library (MKL) acceleration for TensorFlow Serving on CPUs, and native Caffe model inference with rich examples.

TensorRT Acceleration – TensorRT optimizes model precision (FP16/INT8), fuses layers, selects optimal kernels, manages dynamic memory, and supports multi‑stream execution, dramatically improving GPU inference speed. dl_inference bundles TensorRT 7.1.3 and Triton Inference Server (TIS) 20.08 to convert and serve models.

Model Conversion Workflow – Developers provide a TensorFlow SavedModel.pb or PyTorch Model.pth and a JSON metadata file (config.txt). dl_inference first converts the model to ONNX, then optimizes it into a TensorRT engine. Example metadata:

{</code><code>    "batch_size":0,</code><code>    "input":[</code><code>        {</code><code>            "name":"image",</code><code>            "data_type":"float",</code><code>            "dims":[-1,224,224,3],</code><code>            "node_name":"input_1:0"</code><code>        }</code><code>    ],</code><code>    "output":[</code><code>        {</code><code>            "name":"probs",</code><code>            "data_type":"float",</code><code>            "dims":[-1,19],</code><code>            "node_name":"dense_1/Softmax:0"</code><code>        }</code><code>    ]</code><code>}

After preparing the model and metadata, the conversion image is built:

cd DockerImage

docker build -t tis-model-convert:lastest .

Then the conversion container is run:

cd $MODEL_PATH

docker run -v `pwd`:/workspace/source_model -e SOURCE_MODEL_PATH=/workspace/source_model -e TARGET_MODEL_PATH=/workspace/source_model -e MODEL_NAME=tensorflow-666 -e MODEL_TYPE=tensorflow tis-model-convert:lastest

Comments in the script explain each environment variable.

Model Deployment – The generated TensorRT model is served via Triton Inference Server:

docker pull nvcr.io/nvidia/tritonserver:20.08-py3

docker run -v ${TARGET_MODEL_PATH}:/workspace -p 8001:8001 nvcr.io/nvidia/tritonserver:20.08-py3 /opt/tritonserver/bin/tritonserver --model-repository=/workspace

Clients can invoke the service through the provided RPC interface TisPredict and a Java demo client.

Intel MKL Acceleration – By using an MKL‑enabled TensorFlow Serving image, dl_inference speeds up CPU inference (e.g., 60% QPS increase, 40% latency reduction on an Intel Xeon E5‑2620). Key MKL environment variables such as KMP_BLOCKTIME, KMP_AFFINITY, KMP_SETTINGS, and OMP_NUM_THREADS are documented.

Caffe Support – dl_inference v1.1 adds a Seldon‑wrapped Caffe RPC service with customizable pre‑process and post‑process hooks, enabling flexible data handling and multi‑stage inference.

The project also provides numerous example models (e.g., qa_match, MMoE baseline) and a rich set of reference links. The source code is hosted at https://github.com/wuba/dl_inference , and contributors are encouraged to submit PRs or issues.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

TensorRT model conversion inference Intel MKL

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.