Artificial Intelligence 44 min read

GPU Model Inference Optimization Practices in NetEase News Recommendation System

The article outlines practical GPU inference optimization for NetEase’s news recommendation, covering model analysis with Netron, multi‑GPU parallelism, memory‑copy reduction, batch sizing, TensorRT conversion and tuning, custom plugins, and the GRPS serving framework to achieve significant latency and utilization gains.

NetEase Media Technology Team

Aug 9, 2023

GPU Model Inference Optimization Practices in NetEase News Recommendation System

1. Model and Performance Analysis Tools

The article begins by emphasizing the need for proper tools to analyze and optimize deep‑learning models before inference. It recommends using model.summary() (TensorFlow/Keras) or print(model) (PyTorch) for a textual view, but suggests the netron visualizer for clearer network diagrams.

pip3 install netron
netron resnet50.onnx   # view ONNX model
netron resnet50.pt    # view PyTorch model
netron resnet50.pb    # view TensorFlow SavedModel

1.1 Network Structure Analysis

Netron can display layers, connections, and parameters for models such as ResNet‑50, helping engineers quickly locate bottlenecks.

1.2 Complexity Analysis

Model complexity is measured by total parameter count and FLOPs (floating‑point operations). Profilers in TensorFlow, PyTorch, and ONNX can report these metrics, which are essential for memory and latency estimation.

# TensorFlow 1.x example
with tf.Session() as sess:
    run_meta = tf.RunMetadata()
    run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    input_data = tf.random.normal([1, 256, 256])
    sess.run(y, feed_dict={x: input_data}, options=run_options, run_metadata=run_meta)
    params_count = tf.profiler.profile(graph=sess.graph, run_meta=run_meta, cmd='op',
        options=tf.profiler.ProfileOptionBuilder.trainable_variables_parameter())
    flops = tf.profiler.profile(graph=sess.graph, run_meta=run_meta, cmd='op',
        options=tf.profiler.ProfileOptionBuilder.float_operation())
    print('total parameters:', params_count.total_parameters)
    print('total FLOPs:', flops.total_float_ops)

Similar scripts are shown for TensorFlow 2.x (using keras_flops) and PyTorch (using torchstat).

1.3 Inference Timeline

Timeline tracing records per‑operator execution time and tensor flow. The trace is saved in Chrome‑trace format and visualized with perfetto (or perftto in the article).

# Export tracing file (TensorFlow 1.x)
export LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
with tf.Session() as sess:
    run_meta = tf.RunMetadata()
    run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    sess.run(y, options=run_options, run_metadata=run_meta)
    with open('tracing', 'wb') as f:
        f.write(run_meta.step_stats.SerializeToString())
# Convert to Chrome trace
import sys, timeline
step_stats = StepStats()
with open(sys.argv[1], 'rb') as f:
    step_stats.ParseFromString(f.read())
json_trace = timeline.Timeline(step_stats).generate_chrome_trace_format()
with open(f'{sys.argv[1]}-data.json', 'w') as out:
    out.write(json_trace)

1.4 Multi‑Device Operator Distribution

For large models, operators can be placed on different devices using tf.device (TensorFlow) or .to('cuda:x') (PyTorch). Scripts are provided to dump the SavedModel meta‑graph or to print a PyTorch tensor’s device.

2. Multi‑GPU Inference Optimization

When a single GPU cannot meet latency or memory requirements, parallel strategies such as Data Parallel (DP), Tensor Parallel (TP), Pipeline Parallel (PP), and ZeRO are introduced. The article focuses on a practical TP‑like strategy that evenly splits operators across two GPUs, achieving ~15% latency reduction.

2.1 Practical Case

By distributing duplicated sub‑networks across two GPUs and ensuring fair load‑balancing, the overall inference time decreased by about 15%.

2.2 Considerations

Maintain high GPU utilization on all cards.

Minimize inter‑GPU (D2D) memory copies.

Avoid TensorFlow’s automatic device placement for ops that cannot run on GPU (e.g., HashTableLookup).

3. Memory‑Copy Optimization

Three types of memory movement are defined: H2D (host‑to‑device), D2D (device‑to‑device), and D2H (device‑to‑host). Reducing the number of small copy operations by aggregating tensors into larger buffers can cut copy overhead by up to 34%.

4. Batch Processing

Increasing batch size improves GPU utilization. The article warns that when GPU usage is already >90%, larger batches bring diminishing returns.

5. Operator Optimization – TensorRT Practice

5.1 TensorRT Overview

TensorRT converts trained models into highly optimized inference engines, applying layer fusion, precision calibration, and kernel auto‑tuning.

5.2 Installation

# Download matching TensorRT tarball from NVIDIA
# Extract to /usr/local/TensorRT-x.x.x
export TRT_HOME=/usr/local/TensorRT-x.x.x
export LD_LIBRARY_PATH=$TRT_HOME/lib:$LD_LIBRARY_PATH
export PATH=$TRT_HOME/targets/x86_64-linux-gnu/bin:$PATH
pip install tensorrt-*.whl
pip install pycuda

5.3 TensorRT Workflow

Typical flow: train → export (ONNX) → build TensorRT engine → inference.

5.3.1 ONNX Conversion

TensorFlow models are converted with tf2onnx, PyTorch with torch.onnx.export:

# TensorFlow → ONNX
import tf2onnx
model_proto, _ = tf2onnx.convert.from_keras(model, opset=12, custom_ops={'MatMul': 'zcc'}, output_path='model.onnx')
# PyTorch → ONNX
dummy_input = torch.randn(1,3,224,224, device='cuda')
torch.onnx.export(model, dummy_input, 'model.onnx', input_names=['input'], output_names=['output'], dynamic_axes={'input':[0], 'output':[0]})

5.3.2 Building a TensorRT Engine (Python API)

import tensorrt as trt
logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)
parser.parse_from_file('model.onnx')
config = builder.create_builder_config()
config.max_workspace_size = 1 << 30  # 1 GB
config.set_flag(trt.BuilderFlag.FP16)  # enable mixed‑precision
profile = builder.create_optimization_profile()
profile.set_shape(network.get_input(0).name, min=trt.Dims([1,3,224,224]), opt=trt.Dims([64,3,224,224]), max=trt.Dims([128,3,224,224]))
config.add_optimization_profile(profile)
engine_bytes = builder.build_serialized_network(network, config)
with open('model.trt', 'wb') as f:
    f.write(engine_bytes)

5.3.3 Inference with the Engine

import pycuda.driver as cuda
import pycuda.autoinit
runtime = trt.Runtime(trt.Logger(trt.Logger.ERROR))
with open('model.trt','rb') as f:
    engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
# Allocate buffers
d_input = cuda.mem_alloc(input.nbytes)
d_output = cuda.mem_alloc(output.nbytes)
stream = cuda.Stream()
# Transfer, execute, retrieve
cuda.memcpy_htod_async(d_input, input, stream)
context.execute_async_v2(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
cuda.memcpy_dtoh_async(output, d_output, stream)
stream.synchronize()

5.4 TensorRT Tuning

Batch size: set an optimal batch in the optimization profile (e.g., opt=trt.Dims([64,3,224,224])).

Mixed precision: enable FP16 (or INT8 with a calibrator) via config.set_flag(trt.BuilderFlag.FP16).

Layer fusion: ensure supported operator patterns; redesign the network if necessary.

Multiple CUDA streams: from TensorRT 8.6, config.max_aux_streams = 7 and bind them with context.set_aux_streams([...]) to overlap independent layers.

5.5 Dealing with Precision Issues

When FP16 causes large output errors, selectively keep problematic layers in FP32:

for i in range(network.num_layers):
    layer = network.get_layer(i)
    if layer.precision == trt.DataType.FLOAT and layer.type in [trt.LayerType.UNARY, trt.LayerType.REDUCE]:
        layer.precision = trt.DataType.FLOAT
        for j in range(layer.num_outputs):
            layer.set_output_type(j, trt.DataType.FLOAT)

5.6 Custom TensorRT Plugins

For unsupported operators, create a plugin by inheriting nvinfer1::IPluginV2DynamicExt (implementation) and nvinfer1::IPluginCreator (registration). The plugin name must match the ONNX custom‑op domain (e.g., zcc::MyMatMul).

// Simplified C++ skeleton
class MyMatMulPlugin : public nvinfer1::IPluginV2DynamicExt {
public:
    // clone, getOutputDimensions, enqueue, getPluginType ("MyMatMul"), etc.
};
REGISTER_TENSORRT_PLUGIN_EXT(MyMatMulPluginCreator, "zcc");

Python/TensorFlow examples show how to rename existing ops (e.g., MatMul) to a custom domain so that the plugin is invoked during engine building.

6. General Real‑Time Prediction Service (GRPS)

GRPS (Generic Realtime Prediction Service) is an internal framework designed to host various AI models (TF, PyTorch, TensorRT) behind unified REST/RPC APIs. Its goals are:

Generality – supports built‑in back‑ends and custom pre/post‑processing.

Performance – concurrency, GPU utilization, and profiling optimizations.

Scalability – single‑node and distributed deployment.

Resource control – limits on concurrency and GPU memory.

Observability – logging and metrics.

Deployments of >15 models using GRPS have shown 20‑30% average latency reduction (some >70%) and more stable response times.

7. Experience Summary

Beyond model‑level tricks, engineering improvements such as caching, I/O optimization, service‑logic refactoring, pre‑allocation of memory, moving critical services to C++, and multi‑process/multi‑thread concurrency contribute significantly to overall latency reductions.

8. References

Netron visualizer – https://github.com/lutzroeder/netron

TensorRT developer guide – https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html

Custom ONNX ops for PyTorch – https://cloud.tencent.com/developer/article/2010629

Amirstan TensorRT plugin library – https://github.com/grimoire/amirstan_plugin

TensorFlow custom op guide – https://www.tensorflow.org/guide/create_op

TensorRT FP16 debugging – https://oldpan.me/archives/tensorrt-fp16-debug

CUDA C++ programming guide – https://docs.nvidia.com/cuda/archive/11.0/cuda-c-programming-guide/index.html

PyCUDA tutorial – https://documen.tician.de/pycuda/tutorial.html

Seldon core – https://docs.seldon.io/projects/seldon-core/en/latest/

TorchServe – https://pytorch.org/serve/

NVIDIA Triton Inference Server – https://developer.nvidia.com/nvidia-triton-inference-server

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance model optimization GPU inference TensorRT Profiling multi‑GPU

Written by

NetEase Media Technology Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.