Artificial Intelligence 11 min read

Optimizing TensorFlow Serving Model Hot‑Update to Eliminate Latency Spikes in CTR Recommendation Systems

By adding model warm‑up files, separating load/unload threads, switching to the Jemalloc allocator, and isolating TensorFlow’s parameter memory from RPC request buffers, iQIYI’s engineers reduced TensorFlow Serving hot‑update latency spikes in high‑throughput CTR recommendation services from over 120 ms to about 2 ms, eliminating jitter.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Optimizing TensorFlow Serving Model Hot‑Update to Eliminate Latency Spikes in CTR Recommendation Systems

In CTR recommendation scenarios, hot‑updating large models with TensorFlow Serving caused brief latency spikes (referred to as “jitter” or “spike”), leading to client time‑outs and degraded algorithm performance.

The iQIYI deep‑learning platform investigated the root causes and applied a series of optimizations to the TF Serving and TensorFlow source code.

Background

TensorFlow Serving is an open‑source high‑performance inference system that supports gRPC/HTTP, multi‑model, versioning, and hot‑update. CTR services require uninterrupted service, so hot‑update is essential.

Observed spike

During model update, the 99.9th percentile latency (p999) jumped from <30 ms to >120 ms for about 10 seconds, causing request failures.

Initial optimizations

1. Enable model warm‑up by adding a tf_serving_warmup_requests file to the model directory.

2. Configure separate load/unload threads ( num_load_threads , num_unload_threads ) to isolate model loading from inference.

Reference: TensorFlow SavedModel Warmup documentation.

Further analysis

The team examined two possible causes:

Computational overhead from model initialization and warm‑up.

Memory allocation / deallocation contention during model load/unload.

Experiments showed that moving session‑run loading to an independent thread pool did not alleviate the spike.

Memory‑related optimization

Linux glibc ptmalloc2 is used for CPU memory allocation. Alternative allocators such as Google Tcmalloc and Facebook Jemalloc were evaluated.

Benchmarks indicated that Jemalloc reduced the spike to <50 ms while also improving normal latency.

Deep memory‑isolation solution

The root cause was identified as contention between model‑parameter memory allocation (during Restore Op) and RPC request memory allocation. The solution separates these two memory spaces:

Model parameters are allocated using TensorFlow’s BFC allocator.

RPC request buffers continue to use ptmalloc2.

Code changes were made to ProcessState , ThreadPoolDevice , and Allocator in TensorFlow source.

// If set, session run calls use a separate threadpool for restore and init // ops as part of loading the session-bundle. The value of this field should // correspond to the index of the tensorflow::ThreadPoolOptionProto defined as // part of `session_config.session_inter_op_thread_pool`. google.protobuf.Int32Value session_run_load_threadpool_index = 4;

Results

After applying warm‑up, Jemalloc, and memory‑isolation, latency spikes were reduced to ~2 ms (≈5 ms under peak load), effectively solving the hot‑update jitter problem.

Conclusion

The combined optimizations—model warm‑up, Jemalloc memory allocator, and separate memory pools for model parameters—eliminate the latency spikes caused by TensorFlow Serving model hot‑updates in high‑throughput CTR recommendation services.

latency optimizationAI inferencejemallocTensorFlow Servingmemory allocationModel Hot UpdateWarmup
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.