Artificial Intelligence 11 min read

Performance Optimization of Advertising Coarse‑Ranking Training on the Light Framework

This article analyzes the bottlenecks of advertising coarse‑ranking training on the Light framework and presents a series of optimizations—including parallel data download, thread‑queue buffering, integer‑to‑string conversion with fmt, and zlib replacement with czlib—that together achieve up to 58% QPS improvement and notable CPU efficiency gains.

Tencent Architect
Tencent Architect
Tencent Architect
Performance Optimization of Advertising Coarse‑Ranking Training on the Light Framework

Advertising coarse‑ranking is a low‑latency, small‑model scenario. Using the Light training‑acceleration framework, the authors built a synchronous distributed data‑parallel training mode on GPUs, eliminating parameter servers and performing gradient reduction with LightCC.

The pipeline’s main bottlenecks were identified as data download from HDFS and subsequent parsing, which in the baseline implementation involved a separate download process, temporary disk buffers, and sequential reading by the training process.

Data‑download optimization : By replacing the baseline with a tf.data.parallel_interleave based pipeline, multiple worker threads read TFRecord chunks directly into memory buffers, allowing configurable cycle_length and buffer_size . This removed the disk‑write step, introduced a network buffer to absorb HDFS load fluctuations, and reduced download‑related latency.

Thread‑queue buffer : Observing periodic QPS drops caused by workers stalling after filling their download buffers, the authors added a prefetch thread per worker. The prefetch thread and worker thread ping‑pong, keeping both download and buffer‑filling stages active. The implementation uses a wrapper function prefetch_inputbuffer_fn that calls .prefetch(num_examples_twig_prefetch) when enabled.

Integer‑to‑string conversion : The original TensorFlow AsString op relied on vsnprintf , which was CPU‑intensive. Replacing it with the high‑performance fmt library reduced the operator’s runtime by ~200%, yielding about a 10% overall QPS gain.

Decompression optimization : Switching TensorFlow’s zlib dependency to czlib cut inflate time by ~40%, though the impact on total QPS was modest because download time dominates.

CPU resource scaling : Deploying machines with more CPU cores alleviated thread contention, further improving per‑GPU QPS and balancing CPU/GPU utilization.

Combined, these optimizations delivered up to 58% QPS improvement in some workloads, with typical gains ranging from 0% to 23% depending on model complexity, network conditions, and HDFS load.

performance optimizationadvertisingTensorFlowdistributed trainingData ParallelismCPU/GPU efficiencytraining pipeline
Tencent Architect
Written by

Tencent Architect

We share insights on storage, computing, networking and explore leading industry technologies together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.