Artificial Intelligence 11 min read

Applying Large Language Models to Search Advertising: End‑to‑End Generative Recall and System Optimizations

This report details how large language models (LLMs) were integrated into Tencent's search advertising pipeline—from early extraction‑distillation experiments in 2023 to a 2024 end‑to‑end generative recall architecture—showing significant improvements in relevance, diversity, and revenue through knowledge injection, supervised fine‑tuning, constrained beam‑search decoding, and high‑performance inference services.

Tencent Advertising Technology
Tencent Advertising Technology
Tencent Advertising Technology
Applying Large Language Models to Search Advertising: End‑to‑End Generative Recall and System Optimizations

Abstract Large language models (LLMs) possess extensive world knowledge and, when fine‑tuned with supervised instructions and human‑feedback reinforcement learning, achieve strong zero‑shot interaction capabilities that can be leveraged to enhance search advertising without altering existing user‑experience flows.

Initial Exploration (2023) Early experiments applied LLMs via extraction‑distillation to existing recall and relevance modules, conducting AB tests that demonstrated notable performance gains while preserving the current advertising chain.

System Redesign (2024) Building on the 2023 experience, the recall system was re‑architected to replace multiple cascade modules with an end‑to‑end online generative recall model, directly generating ad titles from user queries and achieving substantial revenue uplift.

LLM‑Enhanced Recall and Relevance A new LLM‑based recall path trains a domain‑specific LLM on search data, converts multi‑intent ad information into embeddings, and performs vector retrieval. For relevance, LLM‑derived query and ad intents augment a four‑layer BERT model, enriching feature sets and improving discrimination.

End‑to‑End Generative Recall Architecture The redesigned pipeline moves title selection and relevance filtering upstream into the recall stage, using offline KV‑cached large models (13B) for head queries and real‑time small models (1B) for tail queries, enabling direct generation of ad titles.

Knowledge Injection & Supervised Fine‑Tuning High‑value commercial data and encyclopedic sources are injected into the base LLM to improve domain understanding, followed by DPO‑based fine‑tuning on curated query‑title pairs, allowing few‑shot adaptation and better generation quality.

Constrained Beam‑Search Decoding A CUDA‑accelerated beam‑search with output constraints ensures generated titles belong to a predefined ad set; additional optimizations such as threshold truncation and diversity beam search improve accuracy and variety.

LLM Inference Service Leveraging the AngelHCF framework, a real‑time inference service supports constrained decoding with performance enhancements: FP8/INT8 quantization, Softmax kernel vectorization, Prefill latency reduction, and InsertUnfinishedPath kernel acceleration, delivering over 50% throughput gains and sub‑millisecond latency improvements.

Conclusion & Outlook Continuous exploration of LLMs in search advertising has yielded significant matching improvements; future work will focus on domain‑specific models, reinforcement feedback loops, and integrated indexing‑generation techniques.

AILLMbeam searchSearch AdvertisingOnline Inferencegenerative recallKnowledge Injection
Tencent Advertising Technology
Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.