Content Understanding for Personalized Feed Recommendation: Interest Graph and Techniques
This article explains how Tencent tackles content understanding for personalized feed recommendation by combining traditional classification, keyword, and entity methods with deep learning embeddings, introducing an interest graph composed of taxonomy, concept, entity, and event layers to capture full context and infer user consumption intent.
In modern feed recommendation, content understanding consists of two main parts: legacy technologies from the portal and search eras (classification, keywords, knowledge graphs) and deep‑learning‑driven embeddings. While classification is coarse and embeddings lack interpretability, Tencent proposes a solution that overcomes these issues.
1. Evolution of Content Understanding
The portal era (1995‑2002) relied on manually curated content types and later automated text classification. The search/social era (2003‑present) added keyword extraction and knowledge graphs to resolve entity ambiguity. The intelligent era (2012‑present) introduced personalized recommendation, demanding richer content understanding.
2. Recommendation vs. Search
Search sorts documents by the intersection of query terms, preserving full context. Recommendation sorts by the union of user interest terms, which can lose the contextual relationship between terms (e.g., "Wang Baoqiang" and "Ma Rong" become separate interests). Therefore, recommendation requires preserving the complete context of an interest point.
3. Why Users Consume Content
Traditional methods answer "what the article is" but ignore "why a user consumes it". Understanding the underlying intent (e.g., brand preference, safety concerns) is essential for effective recommendation.
4. Limitations of Traditional NLP Techniques
Classification: coarse granularity, limited to thousands of categories.
Keyword extraction: massive scale but suffers from ambiguity.
Entity words: precise but can create filter bubbles.
LDA: similar granularity issues as classification.
Embedding: unlimited scale but hard to interpret.
5. Interest Graph
The interest graph consists of four layers:
Category layer – a strict tree built by product managers (~1,000 nodes).
Concept layer – groups of entities sharing attributes (e.g., "fuel‑efficient cars").
Entity layer – knowledge‑graph entities such as "Liu Dehua".
Event layer – specific events like "Wang Baoqiang divorce".
This structure captures both operational needs (category layer) and reasoning about user intent (concept layer), while entities and events provide fine‑grained recall.
6. Concept Mining
Concepts are short phrases lacking labeled training data, so a weak‑supervision approach is used: search click data provides semi‑supervised signals, and UGC data helps determine appropriate granularity.
7. Hot Event Mining
Queries with bursty search volume indicate hot events. A DTW‑based similarity to a predefined trend template identifies bursts, followed by clustering similar queries into topics and filtering non‑event topics using URL‑based features.
8. Association Relationships
Entity co‑occurrence and sequential search behavior provide positive samples; random negative sampling yields a 1:3 ratio. Pairwise loss trains entity embeddings, enabling association scoring even for rarely co‑occurring pairs.
9. Content Understanding Components
9.1 Text Classification
PM‑defined taxonomy is refined using user click clustering and subsequent PM labeling.
9.2 Keyword Extraction
Traditional features + GBRank are used, followed by a re‑ranking layer that incorporates association‑relationship embeddings to demote misleading high‑score terms.
9.3 Semantic Matching
Concept and event tags are retrieved via a two‑stage recall (relationship recall from the interest graph and semantic vector recall) and then ranked using interaction‑based features.
10. Online Results
Adding concept and event layers to the baseline (which only used entities and categories) yields a significant lift in key metrics, confirming the effectiveness of the proposed interest‑graph‑based content understanding.
Overall, the talk demonstrates how a multi‑layer interest graph combined with weak‑supervision mining and embedding‑based association can overcome the shortcomings of traditional NLP techniques and substantially improve personalized recommendation performance.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.