Artificial Intelligence 11 min read

Performance Optimization of Advertising Coarse‑Ranking Training on the Light Framework

This article analyzes the bottlenecks of advertising coarse‑ranking training on the Light framework and presents a series of optimizations—including parallel data download, thread‑queue buffering, integer‑to‑string conversion with fmt, and zlib replacement with czlib—that together achieve up to 58% QPS improvement and notable CPU efficiency gains.

Tencent Architect

Jul 29, 2021

Performance Optimization of Advertising Coarse‑Ranking Training on the Light Framework

Advertising coarse‑ranking is a low‑latency, small‑model scenario. Using the Light training‑acceleration framework, the authors built a synchronous distributed data‑parallel training mode on GPUs, eliminating parameter servers and performing gradient reduction with LightCC.

The pipeline’s main bottlenecks were identified as data download from HDFS and subsequent parsing, which in the baseline implementation involved a separate download process, temporary disk buffers, and sequential reading by the training process.

Data‑download optimization : By replacing the baseline with a tf.data.parallel_interleave based pipeline, multiple worker threads read TFRecord chunks directly into memory buffers, allowing configurable cycle_length and buffer_size. This removed the disk‑write step, introduced a network buffer to absorb HDFS load fluctuations, and reduced download‑related latency.

Thread‑queue buffer : Observing periodic QPS drops caused by workers stalling after filling their download buffers, the authors added a prefetch thread per worker. The prefetch thread and worker thread ping‑pong, keeping both download and buffer‑filling stages active. The implementation uses a wrapper function prefetch_inputbuffer_fn that calls .prefetch(num_examples_twig_prefetch) when enabled.

Integer‑to‑string conversion : The original TensorFlow AsString op relied on vsnprintf, which was CPU‑intensive. Replacing it with the high‑performance fmt library reduced the operator’s runtime by ~200%, yielding about a 10% overall QPS gain.

Decompression optimization : Switching TensorFlow’s zlib dependency to czlib cut inflate time by ~40%, though the impact on total QPS was modest because download time dominates.

CPU resource scaling : Deploying machines with more CPU cores alleviated thread contention, further improving per‑GPU QPS and balancing CPU/GPU utilization.

Combined, these optimizations delivered up to 58% QPS improvement in some workloads, with typical gains ranging from 0% to 23% depending on model complexity, network conditions, and HDFS load.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Advertising TensorFlow distributed training Data Parallelism CPU/GPU efficiency training pipeline

Written by

Tencent Architect

We share insights on storage, computing, networking and explore leading industry technologies together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.