GPU Model Inference Optimization Practices in NetEase News Recommendation System
The article outlines practical GPU inference optimization for NetEase’s news recommendation, covering model analysis with Netron, multi‑GPU parallelism, memory‑copy reduction, batch sizing, TensorRT conversion and tuning, custom plugins, and the GRPS serving framework to achieve significant latency and utilization gains.
1. Model and Performance Analysis Tools
The article begins by emphasizing the need for proper tools to analyze and optimize deep‑learning models before inference. It recommends using model.summary() (TensorFlow/Keras) or print(model) (PyTorch) for a textual view, but suggests the netron visualizer for clearer network diagrams.
pip3 install netron
netron resnet50.onnx # view ONNX model
netron resnet50.pt # view PyTorch model
netron resnet50.pb # view TensorFlow SavedModel1.1 Network Structure Analysis
Netron can display layers, connections, and parameters for models such as ResNet‑50, helping engineers quickly locate bottlenecks.
1.2 Complexity Analysis
Model complexity is measured by total parameter count and FLOPs (floating‑point operations). Profilers in TensorFlow, PyTorch, and ONNX can report these metrics, which are essential for memory and latency estimation.
# TensorFlow 1.x example
with tf.Session() as sess:
run_meta = tf.RunMetadata()
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
input_data = tf.random.normal([1, 256, 256])
sess.run(y, feed_dict={x: input_data}, options=run_options, run_metadata=run_meta)
params_count = tf.profiler.profile(graph=sess.graph, run_meta=run_meta, cmd='op',
options=tf.profiler.ProfileOptionBuilder.trainable_variables_parameter())
flops = tf.profiler.profile(graph=sess.graph, run_meta=run_meta, cmd='op',
options=tf.profiler.ProfileOptionBuilder.float_operation())
print('total parameters:', params_count.total_parameters)
print('total FLOPs:', flops.total_float_ops)Similar scripts are shown for TensorFlow 2.x (using keras_flops ) and PyTorch (using torchstat ).
1.3 Inference Timeline
Timeline tracing records per‑operator execution time and tensor flow. The trace is saved in Chrome‑trace format and visualized with perfetto (or perftto in the article).
# Export tracing file (TensorFlow 1.x)
export LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
with tf.Session() as sess:
run_meta = tf.RunMetadata()
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
sess.run(y, options=run_options, run_metadata=run_meta)
with open('tracing', 'wb') as f:
f.write(run_meta.step_stats.SerializeToString())
# Convert to Chrome trace
import sys, timeline
step_stats = StepStats()
with open(sys.argv[1], 'rb') as f:
step_stats.ParseFromString(f.read())
json_trace = timeline.Timeline(step_stats).generate_chrome_trace_format()
with open(f'{sys.argv[1]}-data.json', 'w') as out:
out.write(json_trace)1.4 Multi‑Device Operator Distribution
For large models, operators can be placed on different devices using tf.device (TensorFlow) or .to('cuda:x') (PyTorch). Scripts are provided to dump the SavedModel meta‑graph or to print a PyTorch tensor’s device.
2. Multi‑GPU Inference Optimization
When a single GPU cannot meet latency or memory requirements, parallel strategies such as Data Parallel (DP), Tensor Parallel (TP), Pipeline Parallel (PP), and ZeRO are introduced. The article focuses on a practical TP‑like strategy that evenly splits operators across two GPUs, achieving ~15% latency reduction.
2.1 Practical Case
By distributing duplicated sub‑networks across two GPUs and ensuring fair load‑balancing, the overall inference time decreased by about 15%.
2.2 Considerations
Maintain high GPU utilization on all cards.
Minimize inter‑GPU (D2D) memory copies.
Avoid TensorFlow’s automatic device placement for ops that cannot run on GPU (e.g., HashTableLookup ).
3. Memory‑Copy Optimization
Three types of memory movement are defined: H2D (host‑to‑device), D2D (device‑to‑device), and D2H (device‑to‑host). Reducing the number of small copy operations by aggregating tensors into larger buffers can cut copy overhead by up to 34%.
4. Batch Processing
Increasing batch size improves GPU utilization. The article warns that when GPU usage is already >90%, larger batches bring diminishing returns.
5. Operator Optimization – TensorRT Practice
5.1 TensorRT Overview
TensorRT converts trained models into highly optimized inference engines, applying layer fusion, precision calibration, and kernel auto‑tuning.
5.2 Installation
# Download matching TensorRT tarball from NVIDIA
# Extract to /usr/local/TensorRT-x.x.x
export TRT_HOME=/usr/local/TensorRT-x.x.x
export LD_LIBRARY_PATH=$TRT_HOME/lib:$LD_LIBRARY_PATH
export PATH=$TRT_HOME/targets/x86_64-linux-gnu/bin:$PATH
pip install tensorrt-*.whl
pip install pycuda5.3 TensorRT Workflow
Typical flow: train → export (ONNX) → build TensorRT engine → inference.
5.3.1 ONNX Conversion
TensorFlow models are converted with tf2onnx , PyTorch with torch.onnx.export :
# TensorFlow → ONNX
import tf2onnx
model_proto, _ = tf2onnx.convert.from_keras(model, opset=12, custom_ops={'MatMul': 'zcc'}, output_path='model.onnx')
# PyTorch → ONNX
dummy_input = torch.randn(1,3,224,224, device='cuda')
torch.onnx.export(model, dummy_input, 'model.onnx', input_names=['input'], output_names=['output'], dynamic_axes={'input':[0], 'output':[0]})5.3.2 Building a TensorRT Engine (Python API)
import tensorrt as trt
logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)
parser.parse_from_file('model.onnx')
config = builder.create_builder_config()
config.max_workspace_size = 1 << 30 # 1 GB
config.set_flag(trt.BuilderFlag.FP16) # enable mixed‑precision
profile = builder.create_optimization_profile()
profile.set_shape(network.get_input(0).name, min=trt.Dims([1,3,224,224]), opt=trt.Dims([64,3,224,224]), max=trt.Dims([128,3,224,224]))
config.add_optimization_profile(profile)
engine_bytes = builder.build_serialized_network(network, config)
with open('model.trt', 'wb') as f:
f.write(engine_bytes)5.3.3 Inference with the Engine
import pycuda.driver as cuda
import pycuda.autoinit
runtime = trt.Runtime(trt.Logger(trt.Logger.ERROR))
with open('model.trt','rb') as f:
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
# Allocate buffers
d_input = cuda.mem_alloc(input.nbytes)
d_output = cuda.mem_alloc(output.nbytes)
stream = cuda.Stream()
# Transfer, execute, retrieve
cuda.memcpy_htod_async(d_input, input, stream)
context.execute_async_v2(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
cuda.memcpy_dtoh_async(output, d_output, stream)
stream.synchronize()5.4 TensorRT Tuning
Batch size: set an optimal batch in the optimization profile (e.g., opt=trt.Dims([64,3,224,224]) ).
Mixed precision: enable FP16 (or INT8 with a calibrator) via config.set_flag(trt.BuilderFlag.FP16) .
Layer fusion: ensure supported operator patterns; redesign the network if necessary.
Multiple CUDA streams: from TensorRT 8.6, config.max_aux_streams = 7 and bind them with context.set_aux_streams([...]) to overlap independent layers.
5.5 Dealing with Precision Issues
When FP16 causes large output errors, selectively keep problematic layers in FP32:
for i in range(network.num_layers):
layer = network.get_layer(i)
if layer.precision == trt.DataType.FLOAT and layer.type in [trt.LayerType.UNARY, trt.LayerType.REDUCE]:
layer.precision = trt.DataType.FLOAT
for j in range(layer.num_outputs):
layer.set_output_type(j, trt.DataType.FLOAT)5.6 Custom TensorRT Plugins
For unsupported operators, create a plugin by inheriting nvinfer1::IPluginV2DynamicExt (implementation) and nvinfer1::IPluginCreator (registration). The plugin name must match the ONNX custom‑op domain (e.g., zcc::MyMatMul ).
// Simplified C++ skeleton
class MyMatMulPlugin : public nvinfer1::IPluginV2DynamicExt {
public:
// clone, getOutputDimensions, enqueue, getPluginType ("MyMatMul"), etc.
};
REGISTER_TENSORRT_PLUGIN_EXT(MyMatMulPluginCreator, "zcc");Python/TensorFlow examples show how to rename existing ops (e.g., MatMul ) to a custom domain so that the plugin is invoked during engine building.
6. General Real‑Time Prediction Service (GRPS)
GRPS (Generic Realtime Prediction Service) is an internal framework designed to host various AI models (TF, PyTorch, TensorRT) behind unified REST/RPC APIs. Its goals are:
Generality – supports built‑in back‑ends and custom pre/post‑processing.
Performance – concurrency, GPU utilization, and profiling optimizations.
Scalability – single‑node and distributed deployment.
Resource control – limits on concurrency and GPU memory.
Observability – logging and metrics.
Deployments of >15 models using GRPS have shown 20‑30% average latency reduction (some >70%) and more stable response times.
7. Experience Summary
Beyond model‑level tricks, engineering improvements such as caching, I/O optimization, service‑logic refactoring, pre‑allocation of memory, moving critical services to C++, and multi‑process/multi‑thread concurrency contribute significantly to overall latency reductions.
8. References
Netron visualizer – https://github.com/lutzroeder/netron
TensorRT developer guide – https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html
Custom ONNX ops for PyTorch – https://cloud.tencent.com/developer/article/2010629
Amirstan TensorRT plugin library – https://github.com/grimoire/amirstan_plugin
TensorFlow custom op guide – https://www.tensorflow.org/guide/create_op
TensorRT FP16 debugging – https://oldpan.me/archives/tensorrt-fp16-debug
CUDA C++ programming guide – https://docs.nvidia.com/cuda/archive/11.0/cuda-c-programming-guide/index.html
PyCUDA tutorial – https://documen.tician.de/pycuda/tutorial.html
Seldon core – https://docs.seldon.io/projects/seldon-core/en/latest/
TorchServe – https://pytorch.org/serve/
NVIDIA Triton Inference Server – https://developer.nvidia.com/nvidia-triton-inference-server
NetEase Media Technology Team
NetEase Media Technology Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.