Artificial Intelligence 14 min read

How We Accelerated Feature Hashing for Ad Ranking on GPUs

This article explains how Tencent's Light platform reduced the massive overhead of feature hashing in ad‑ranking by moving integer‑to‑string conversion and hash computation to the GPU, introducing custom contiguous string tensors, and achieving up to 12× speed‑up on V100 GPUs.

Tencent Architect

Aug 4, 2021

How We Accelerated Feature Hashing for Ad Ranking on GPUs

Background

In ad‑ranking (coarse‑ranking) the model is shallow but consumes a huge number of features, so feature parsing and embedding dominate the latency. The baseline implementation assembled data on the CPU, then copied it to GPU for hash, with the CPU step taking over 90% of the hash time.

Integer Feature Hash Optimization

Because integer features are first converted to strings (AsString) before hashing, the string Tensor created on the CPU is highly fragmented, causing costly host‑to‑device copies. To avoid this, we moved the entire int‑to‑string and hash pipeline to the GPU.

Scheme 1 – Parallel CPU Compute & Host‑to‑Device Copy

Copy the int64 tensor to GPU memory while the CPU simultaneously computes each element’s size and offset. The GPU kernel then converts integers to strings in a contiguous buffer and performs hashing using the pre‑computed offsets.

Scheme 2 – Full GPU Computation

All size calculations and integer‑to‑string (itoa) conversions are performed on the GPU, eliminating CPU bottlenecks. Two methods were explored:

Remainder‑based conversion that writes characters into a thread‑local buffer.

Pre‑computing decimal length using logarithmic bounds to avoid per‑character loops.

After benchmarking, the remainder‑based method (Scheme 2.1) was chosen for production.

Custom String Tensor

We introduced a custom ConsecString type stored in a DynamicBuffer and wrapped it in TensorFlow’s Variant. This yields a contiguous string tensor that can be copied to GPU memory in one shot, removing the fragmented copies required by the original string Tensor.

struct ConsecString {
  DynamicBuffer* buf;
  explicit ConsecString();
  explicit ConsecString(DynamicBuffer* buf);
  void Encode(VariantTensorData* data) const;
  bool Decode(const VariantTensorData& data);
  string TypeName() const {return "ConsecString";}
  string DebugString() const {return "DebugString: ConsecString";}
};

struct DynamicBuffer {
  std::vector<string> buf_list;
  std::vector<std::vector<int32>> sizes_list;
  void merge();
  void reset();
};

Performance Results

Running the optimized pipeline on a V100 GPU showed dramatic reductions in end‑to‑end latency. For example, processing 1 000 000 features dropped from 189 s (baseline) to 15.14 s (optimized), a 1248× speed‑up. Similar gains were observed across various batch sizes and feature counts.

Additional micro‑benchmarks on a single feature’s hash operation demonstrated speed‑ups ranging from 103% to over 1200% depending on the batch size.

Key Takeaways

Moving integer‑to‑string conversion and hash computation to the GPU, combined with a custom contiguous string tensor, eliminates fragmented memory copies and fully utilizes GPU parallelism, yielding order‑of‑magnitude performance improvements for large‑scale ad‑ranking workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning Performance tuning TensorFlow ad ranking GPU optimization feature hashing

Written by

Tencent Architect

We share technical insights on storage, computing, and access, and explore industry-leading product technologies together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.