Artificial Intelligence 11 min read

Design and Implementation of an Online Inference Service for Risk‑Control Algorithms

This article describes the architecture, key features, dynamic deployment, performance optimizations, and real‑world results of a high‑throughput online inference platform that serves deep‑learning models for JD.com’s risk‑control decision engine, achieving near‑hundred‑fold latency improvements.

JD Tech Talk

Nov 24, 2022

Design and Implementation of an Online Inference Service for Risk‑Control Algorithms

Background and Goals – The risk‑control intelligent system requires massive deep‑learning models to be deployed as fast, low‑latency online services for decision‑engine calls, with sub‑50 ms tp99 latency and the ability to handle large‑scale traffic without hardware over‑provisioning.

Platform Overview – A modular platform abstracts various model frameworks (Python, Groovy, PyTorch, TensorFlow, MXNet, XGBoost, PMML, TensorRT) behind a unified engine interface exposing load and predict methods, supporting dynamic engine extension.

Core Features

Multi‑engine support with custom script engines (Python, Groovy) and machine‑learning engines.

High‑performance native engine integration and Python GIL mitigation via multi‑process sockets.

Dynamic deployment via service‑gateway discovery, model registration, and automatic version rollout.

Flexible data‑source ingestion (Redis‑cluster r2m, HBase, streaming platforms).

Resource‑aware grouping, isolated thread‑pools, and CPU/GPU fine‑tuning.

Batch aggregation for deep models and asynchronous handling of non‑inference logic.

Online Inference Module Design – Models are packaged as micro‑services registered in Spring Cloud (Nacos, Ribbon, Feign). Translators implement preProcess and postProcess for data preparation and result handling, with dynamic compilation of Groovy/bytecode.

Implementation Details

Gateway registration and routing based on model discovery.

Dynamic model deployment with configuration‑driven translator code.

Service nodes pull model files, start engines, and register themselves.

Model invocation flow: data preprocessing → engine inference → post‑processing.

Performance Optimizations

Native C++ inference libraries for low‑latency execution.

Multi‑process Python workers to bypass GIL.

CPU core limiting per inference to improve throughput.

Dedicated thread‑pools and queues to reduce context switches.

Batch processing of requests to exploit convolutional model efficiency.

GPU acceleration using containerized CUDA/cuDNN images.

Asynchronous handling of routing, logging, and MQ fallback.

Performance Comparison – Using a CNN‑based slider CAPTCHA model, the new platform achieved nearly a hundred‑fold latency reduction compared with the legacy system, sustaining >100 k TPS at 55 % CPU utilization with tp99 ≈ 11 ms during peak traffic.

Conclusion – The platform now powers real‑time model inference for multiple JD.com scenarios (risk‑control engine, CAPTCHA, insurance), delivering stable high‑throughput services during large promotions, and the design is open for further discussion and extension.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization microservices AI risk control online inference Model Serving

Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.