Applying Large Language Models to Search Advertising: End‑to‑End Generative Recall and System Optimizations
This report details how large language models (LLMs) were integrated into Tencent's search advertising pipeline—from early extraction‑distillation experiments in 2023 to a 2024 end‑to‑end generative recall architecture—showing significant improvements in relevance, diversity, and revenue through knowledge injection, supervised fine‑tuning, constrained beam‑search decoding, and high‑performance inference services.
Abstract Large language models (LLMs) possess extensive world knowledge and, when fine‑tuned with supervised instructions and human‑feedback reinforcement learning, achieve strong zero‑shot interaction capabilities that can be leveraged to enhance search advertising without altering existing user‑experience flows.
Initial Exploration (2023) Early experiments applied LLMs via extraction‑distillation to existing recall and relevance modules, conducting AB tests that demonstrated notable performance gains while preserving the current advertising chain.
System Redesign (2024) Building on the 2023 experience, the recall system was re‑architected to replace multiple cascade modules with an end‑to‑end online generative recall model, directly generating ad titles from user queries and achieving substantial revenue uplift.
LLM‑Enhanced Recall and Relevance A new LLM‑based recall path trains a domain‑specific LLM on search data, converts multi‑intent ad information into embeddings, and performs vector retrieval. For relevance, LLM‑derived query and ad intents augment a four‑layer BERT model, enriching feature sets and improving discrimination.
End‑to‑End Generative Recall Architecture The redesigned pipeline moves title selection and relevance filtering upstream into the recall stage, using offline KV‑cached large models (13B) for head queries and real‑time small models (1B) for tail queries, enabling direct generation of ad titles.
Knowledge Injection & Supervised Fine‑Tuning High‑value commercial data and encyclopedic sources are injected into the base LLM to improve domain understanding, followed by DPO‑based fine‑tuning on curated query‑title pairs, allowing few‑shot adaptation and better generation quality.
Constrained Beam‑Search Decoding A CUDA‑accelerated beam‑search with output constraints ensures generated titles belong to a predefined ad set; additional optimizations such as threshold truncation and diversity beam search improve accuracy and variety.
LLM Inference Service Leveraging the AngelHCF framework, a real‑time inference service supports constrained decoding with performance enhancements: FP8/INT8 quantization, Softmax kernel vectorization, Prefill latency reduction, and InsertUnfinishedPath kernel acceleration, delivering over 50% throughput gains and sub‑millisecond latency improvements.
Conclusion & Outlook Continuous exploration of LLMs in search advertising has yielded significant matching improvements; future work will focus on domain‑specific models, reinforcement feedback loops, and integrated indexing‑generation techniques.
Tencent Advertising Technology
Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.