Building a Comprehensive Tagging System for Real‑Estate Recommendation at Beike
This article explains how Beike, China’s largest residential service platform, leverages its massive house, client, and text data to design a multi‑layered tag architecture, detailing data sources, tag construction methods—including classification, keyword, geographic, anonymous topic, and temporal tags—and their application to improve personalized house search and recommendation.
1. Data Background
Beike’s data assets, accumulated over more than a decade, include property attributes, community information, surrounding POI entities, user search logs, conversational queries, reading histories, and various textual contents such as articles, news, Q&A, and reviews.
Because user intents in house‑search are highly diverse and often vague, simple attribute matching is insufficient; a flexible tag system can capture multi‑dimensional user needs.
2. Tag System Architecture
The tag system is built bottom‑up across five layers:
Entity Layer : three core entities – people (clients and agents), houses (listings and communities), and content (text data).
Data Layer : sources such as DMP profiles, search logs, conversation logs, article and Q&A texts, and property dictionaries combined with Baidu POI data.
Computation Layer : structured data uses inductive reasoning; unstructured text employs regex extraction, topic induction, classification models, keyword extraction, and multi‑pattern matching.
Output Layer : tags are stored in offline Hive tables and exported to Elasticsearch or Redis; text‑tag models are exposed via API services (forward, reverse, and similarity‑expansion).
Application Layer : tags support C‑side natural language search, smart assistants, feed recommendation, and SEO navigation.
3. Construction Methods
3.1 Content Tags
The text‑tag hierarchy covers coarse categories, city/region POI tags, dynamic keyword tags, anonymous topic clusters, and temporal tags. Keyword and geographic tags can be linked to property attributes for embedding, while classification and keyword tags align with user reading history for preference inference and recommendation.
3.1.1 Classification Tag Construction
We built a three‑level classification system using the following workflow:
Cluster mixed articles via LDA (preferred) or TF‑IDF + k‑means to discover topics.
Domain experts refine topics and map them to an initial category tree.
Generate an initial training set by regex‑filtering based on topics and keywords, then manually label only incorrect predictions.
Train a FastText model on the labeled set, produce pseudo‑labels for a larger corpus, evaluate, calibrate, and iterate.
Continuously merge low‑distinguishability categories and split overly coarse ones to improve the taxonomy.
Challenges include severe class imbalance (addressed by up/down‑sampling or merging/splitting categories) and multi‑label articles that blur boundaries; we therefore prioritize single‑label samples for training.
3.1.2 Other Text Tags
Keyword Tags : tf‑idf + TextRank; top‑N keywords are selected and filtered through a dynamic whitelist.
City/Province Tags : Aho‑Corasick multi‑pattern matching; title weighting, abbreviation expansion, hierarchical matching (e.g., Haidian → Beijing), and highest‑score selection.
POI Tags : Same Aho‑Corasick approach; restrict matches to the same city to reduce false positives.
Anonymous Topic Tags : LDA or doc2vec + k‑means; used directly for similarity‑based retrieval without predefined categories.
Temporal Tags : Rule‑based short‑term detection followed by FastText classification for long‑term vs. short‑term content.
3.2 Client & Property Inference Tags
For vague user intents (e.g., “quiet house for elderly”), we infer high‑level reasoning tags from DMP profiles, query history, and dialogue data for clients, and from property attributes, surrounding POIs, reviews, and community guides for listings. Scoring rules derived from external knowledge bases (e.g., articles about “houses suitable for seniors”) map these tags to both client and property sides, enabling direct ID‑based matching in Elasticsearch or Redis.
4. Summary and Outlook
The paper presented a text‑centric tag construction pipeline for real‑estate recommendation, emphasizing low algorithmic complexity at the early stage (rule‑based) and gradual transition to data‑driven methods as volume grows. Future work includes expanding tag varieties, improving accuracy and coverage, and strengthening inter‑tag relationships to build a more robust ecosystem.
作者东坡(企业代号名),2018年毕业于日本早稻田大学,9月加入贝壳,主要从事文本类挖掘工作。Beike Product & Technology
As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.