Artificial Intelligence 25 min read

Risk Detection Model Service Framework and Acceleration for Alibaba Content Risk Control

Alibaba’s new RiskDetection service framework replaces the bulky Inference‑kgb engine with a Triton‑based, Python‑driven kernel that unifies multiple back‑ends, standardizes tensor APIs, and accelerates image, text, and video risk models via HighService and EAS, delivering real‑time content risk control, scalable caching/batching, and significant GPU speedups for Double‑11 promotions.

Alimama Tech
Alimama Tech
Alimama Tech
Risk Detection Model Service Framework and Acceleration for Alibaba Content Risk Control

1. Business background and problems

Content is a key carrier in advertising. High exposure amplifies the impact of risk leakage, making content risk‑control essential. The existing Inference‑kgb engine has become bulky due to the growing number of models, leading to capability, efficiency, quality, and cost issues.

2. Industry comparison and selection

We evaluated Alibaba Cloud EAS, DAMO‑Aquila, CRO Lingjing, Alibaba Mom HighService, and the open‑source Triton Server. The comparison table (language support, quantization, batching, model types, SDK, cloud deployment, etc.) shows that EAS and Triton provide the most comprehensive features for our needs.

3. Model service framework (RD)

The new RiskDetection (RD) kernel follows the NVIDIA Triton Server design and defines a standard business API (Model, Version, Tensor) with dynamic batching. It abstracts multiple back‑ends (EAS, HighService, Aquila) to provide unified model serving and acceleration.

3.1 Standardized interfaces

Business‑side interfaces support image, text, and video inputs/outputs. Data‑side interfaces adopt the Tensor‑in‑Tensor‑out pattern compatible with KServe. Key data structures include TensorDataType, TensorShape, TensorDataContent, InferParameter, TensorEntity, PredictRequest, and PredictResponse.

3.2 RD kernel technical solution

RD consists of three logical components: Predictor (standard Tensor inference), Transformer (pre‑ and post‑processing), and Backends (actual serving engines). This modular design enables flexible deployment and scaling.

3.3 Data consistency guarantees

We ensure feature‑level consistency across online, near‑line, and offline scenarios by synchronizing model images, versions, and resources. Business‑level consistency is achieved by allowing different resource mixes (GPU vs. CPU) where strict numerical parity is not required.

4. Model inference acceleration

Previous Inference‑kgb relied on native TensorRT and required C++ re‑implementation for every model update. The new framework uses Python for model development and supports multiple acceleration back‑ends (HighService, EAS), dramatically reducing development cycles.

4.1 HighService backend integration

HighService is an internal heterogeneous‑computing framework that decouples GPU and CPU workloads, uses multi‑process CPU execution, and integrates TensorRT for PyTorch models, achieving large performance gains.

4.2 EAS backend integration

EAS provides seamless PAI integration, Blade acceleration, and comprehensive service/operation features. We adopted the Mediaflow SDK for lightweight DAG‑based model deployment. Example code:

# pipeline construction
with graph.as_default():
    mediaflow.MediaData() \
        .map(tensorflow_op.tensorflow, args=cfg) \
        .output("output")
# invocation
results = engine.run(data_frame, ctx, graph)

Mediaflow enables DAG‑style model composition and inherits Blade acceleration.

5. Service features and effects

Key service/operation features include Caching, Batching, and Scaling, configurable per model. The unified three‑tuple (image + model file + config) simplifies rapid service launch and version management.

Offline support is provided via the Starling scheduler on the Drogo platform, leveraging Hippo resources and ODPS for large‑scale feature extraction.

Business impact: after RD rollout, GPU‑accelerated models achieved dozens‑fold speedup, enabling real‑time risk detection during Double‑11 promotions. The InferenceProxy layer adds QoS‑aware traffic steering based on business tags, ensuring stable service under high load.

6. Future outlook

We plan to continue expanding GPU/CPU acceleration, explore PPU solutions, standardize caching/batching/scaling across back‑ends, streamline model lifecycle management, and improve cost efficiency through better resource utilization and ROI‑driven model selection.

AIBackendIntegrationInferenceEngineModelServingRiskDetection
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.