Artificial Intelligence 14 min read

Building a Comprehensive Tagging System for Real‑Estate Recommendation at Beike

This article explains how Beike, China’s largest residential service platform, leverages its massive house, client, and text data to design a multi‑layered tag architecture, detailing data sources, tag construction methods—including classification, keyword, geographic, anonymous topic, and temporal tags—and their application to improve personalized house search and recommendation.

Beike Product & Technology

Jun 28, 2019

Building a Comprehensive Tagging System for Real‑Estate Recommendation at Beike

1. Data Background

Beike’s data assets, accumulated over more than a decade, include property attributes, community information, surrounding POI entities, user search logs, conversational queries, reading histories, and various textual contents such as articles, news, Q&A, and reviews.

Because user intents in house‑search are highly diverse and often vague, simple attribute matching is insufficient; a flexible tag system can capture multi‑dimensional user needs.

2. Tag System Architecture

The tag system is built bottom‑up across five layers:

Entity Layer : three core entities – people (clients and agents), houses (listings and communities), and content (text data).

Data Layer : sources such as DMP profiles, search logs, conversation logs, article and Q&A texts, and property dictionaries combined with Baidu POI data.

Computation Layer : structured data uses inductive reasoning; unstructured text employs regex extraction, topic induction, classification models, keyword extraction, and multi‑pattern matching.

Output Layer : tags are stored in offline Hive tables and exported to Elasticsearch or Redis; text‑tag models are exposed via API services (forward, reverse, and similarity‑expansion).

Application Layer : tags support C‑side natural language search, smart assistants, feed recommendation, and SEO navigation.

3. Construction Methods

3.1 Content Tags

The text‑tag hierarchy covers coarse categories, city/region POI tags, dynamic keyword tags, anonymous topic clusters, and temporal tags. Keyword and geographic tags can be linked to property attributes for embedding, while classification and keyword tags align with user reading history for preference inference and recommendation.

3.1.1 Classification Tag Construction

We built a three‑level classification system using the following workflow:

Cluster mixed articles via LDA (preferred) or TF‑IDF + k‑means to discover topics.

Domain experts refine topics and map them to an initial category tree.

Generate an initial training set by regex‑filtering based on topics and keywords, then manually label only incorrect predictions.

Train a FastText model on the labeled set, produce pseudo‑labels for a larger corpus, evaluate, calibrate, and iterate.

Continuously merge low‑distinguishability categories and split overly coarse ones to improve the taxonomy.

Challenges include severe class imbalance (addressed by up/down‑sampling or merging/splitting categories) and multi‑label articles that blur boundaries; we therefore prioritize single‑label samples for training.

3.1.2 Other Text Tags

Keyword Tags : tf‑idf + TextRank; top‑N keywords are selected and filtered through a dynamic whitelist.

City/Province Tags : Aho‑Corasick multi‑pattern matching; title weighting, abbreviation expansion, hierarchical matching (e.g., Haidian → Beijing), and highest‑score selection.

POI Tags : Same Aho‑Corasick approach; restrict matches to the same city to reduce false positives.

Anonymous Topic Tags : LDA or doc2vec + k‑means; used directly for similarity‑based retrieval without predefined categories.

Temporal Tags : Rule‑based short‑term detection followed by FastText classification for long‑term vs. short‑term content.

3.2 Client & Property Inference Tags

For vague user intents (e.g., “quiet house for elderly”), we infer high‑level reasoning tags from DMP profiles, query history, and dialogue data for clients, and from property attributes, surrounding POIs, reviews, and community guides for listings. Scoring rules derived from external knowledge bases (e.g., articles about “houses suitable for seniors”) map these tags to both client and property sides, enabling direct ID‑based matching in Elasticsearch or Redis.

4. Summary and Outlook

The paper presented a text‑centric tag construction pipeline for real‑estate recommendation, emphasizing low algorithmic complexity at the early stage (rule‑based) and gradual transition to data‑driven methods as volume grows. Future work includes expanding tag varieties, improving accuracy and coverage, and strengthening inter‑tag relationships to build a more robust ecosystem.

作者东坡（企业代号名），2018年毕业于日本早稻田大学，9月加入贝壳，主要从事文本类挖掘工作。

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

recommendation NLP classification Tagging Real Estate text mining

Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.