Artificial Intelligence 10 min read

Optimizing GPU Inference for CTR Models: Kernel Fusion, Multi‑Stream Execution, and Batch Merging

By fusing sparse‑feature operators, enabling multi‑stream execution, consolidating data copies, and merging inference batches, iQIYI reduced GPU CTR model latency to CPU‑level, boosted throughput over sixfold, and cut operational costs by more than 40%, overcoming launch‑overhead bottlenecks.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Optimizing GPU Inference for CTR Models: Kernel Fusion, Multi‑Stream Execution, and Batch Merging

GPU is widely used in iQIYI's deep‑learning platform because its hundreds of cores can execute massive parallel instructions, making it suitable for training and inference of computer‑vision and NLP models. CTR (Click‑Through‑Rate) models, which predict the probability of a user clicking an ad or video, also rely heavily on GPUs during training to reduce time and cost.

When the trained CTR model is deployed for online inference via TensorFlow‑Serving on a GPU, several issues appear: (1) high inference latency, which is critical for end‑user experience, and (2) low GPU utilization because many kernels are launched but spend most of the time in launch overhead.

Analysis Tools

TensorFlow Board – visualizes stage‑wise execution time and aggregates operator costs.

Nsight – NVIDIA’s low‑level profiling suite for CUDA programs.

Analysis Conclusions

CTR models contain a large number of sparse features (e.g., device ID, recent video IDs). TensorFlow’s FeatureColumn transforms each feature through identity/hash, embedding lookup, and pooling, generating a separate CUDA kernel for each operation. Because these kernels perform very little computation, the launch overhead dominates the total execution time. With dozens to hundreds of sparse features, the model may launch hundreds of kernels per inference, which becomes the main performance bottleneck.

During training this overhead is hidden by large batch sizes, but in online inference the need to process a single request quickly makes the launch cost unacceptable.

Optimization Strategies

1. Operator Fusion

Automatic fusion – tried TVM, TensorRT, and XLA. XLA was enabled via tf.ConfigProto() to fuse a few consecutive ops (e.g., MatMul‑Add‑ReLU). However, sparse‑feature ops did not benefit much.

Manual fusion – created a custom operator BatchIdentityEmbeddingLookup that processes multiple identical FeatureColumns in one kernel. Wrapped it in a new FusedFeatureLayer that is used only during inference. The layer also sorts features, generates an index array for variable‑length inputs, and preserves the original output shape.

2. Multi‑Stream Execution

TensorFlow normally uses a single CUDA stream group, causing kernels to execute serially. By adopting NVIDIA’s multi‑stream branch of TensorFlow and enabling Nvidia MPS, multiple stream groups can run concurrently, reducing kernel launch latency.

3. Small Data‑Copy Optimization

Instead of copying each feature individually from host to device, features are concatenated on the host and transferred in a single cudaMemcpy call, which greatly reduces the number of small copies.

4. Batch Merging

Enabled TensorFlow‑Serving’s enable_batching option and tuned the batch configuration:

max_batch_size : set slightly larger than typical request bursts.

batch_timeout_micros : kept below 5 ms to meet latency requirements.

num_batch_threads : set to 1‑4 after enabling MPS.

Adjusted padding logic for variable‑length sparse features: instead of padding with zeros (which changes semantics), padding now uses -1 to represent missing values, preserving the meaning of the original data.

Final Results

Throughput increased by more than 6× compared with the native TensorFlow GPU container.

Inference latency became comparable to CPU latency, satisfying business requirements.

For the same QPS, operational cost dropped by over 40%.

The optimizations have been deployed in iQIYI’s personalized push and waterfall‑flow recommendation services.

ctrdeep learninginference optimizationTensorFlowGPUKernel Fusion
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.