dl_inference: Open‑Source General Deep Learning Inference Service
dl_inference is an open‑source inference platform that simplifies deployment of TensorFlow and PyTorch models in production, offering unified gRPC access, load‑balanced multi‑node serving, GPU/CPU options, customizable pre‑ and post‑processing, and extensible architecture for future AI workloads.
dl_inference is an open‑source general deep‑learning inference service launched by 58.com, designed to quickly bring TensorFlow and PyTorch models into production.
Project details
GitHub repository: https://github.com/wuba/dl_inference
Supports both GPU and CPU deployment modes.
Handles multi‑node deployment with dynamic weighted round‑robin load balancing, serving over a billion online requests daily.
Key features
Simplifies deployment of deep‑learning model inference services.
Supports multi‑node deployment with built‑in load‑balancing.
Provides a unified RPC service interface.
Offers both GPU and CPU deployment options.
For PyTorch models, includes pre‑ and post‑processing and open model invocation.
Architecture
The system consists of three modules: a unified access service (gRPC entry point), a TensorFlow inference service, and a PyTorch inference service. The unified service defines common interfaces for both frameworks and performs dynamic load balancing based on node health.
TensorFlow inference
Uses TensorFlow Serving (Docker or bare‑metal) to serve SavedModel files, supports hot model updates, gRPC/REST APIs, and can be extended with custom operators by recompiling TensorFlow‑Serving.
PyTorch inference
Since PyTorch lacks a native serving component, dl_inference wraps PyTorch models with Seldon, exposing a gRPC SeldonMessage protocol. It provides optional pre‑ and post‑processing scripts and allows custom model execution logic.
Deployment steps
Both TensorFlow and PyTorch models are deployed via Docker containers. For TensorFlow, prepare a SavedModel, pull the TensorFlow‑Serving image, mount the model directory, and run the container. For PyTorch, place the model file ( model.pth ) and custom interface scripts in a directory, build the provided Dockerfile, and start the service with the supplied script.
Future roadmap
Support Caffe models on GPU and CPU.
Accelerate CPU inference using Intel MKL, OpenVINO, etc.
Accelerate GPU inference with NVIDIA TensorRT.
Contribution & feedback
Contributions are welcomed via pull requests or issues on the GitHub repository, or by emailing [email protected] .
Authors
Feng Yu – Senior Backend Engineer, AI Lab, 58.com Chen Xingzhen – Backend Architect, AI Lab, 58.com Chen Zelong – Backend Engineer, AI Lab, 58.com
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.