Streaming Data Platform Practices and Challenges at Beike Real Estate
This article presents an in‑depth overview of Beike's four‑layer streaming data platform, covering the foundational infrastructure, capability aggregation, data content, and output layers, as well as the challenges of metadata management, real‑time processing, and productization through the Ark and Tianyan systems.
Today we share Beike's platform‑wide practice and challenges of streaming data, describing how a streaming data platform is built to meet business needs.
Overall Architecture
The Beike big‑data architecture consists of four layers from bottom to top:
1. Foundation Platform Layer – Uses common technologies such as HDFS for distributed storage, YARN for resource scheduling, HBase for storage, and compute engines like Hive, Tez, Spark, Presto, Kylin, ClickHouse, and SparkML to satisfy various basic requirements, supported by high‑performance clusters.
Key work of this layer includes integrated monitoring, strong security guarantees, high stability and efficiency, and low‑cost solutions.
2. Capability Aggregation Layer – Consolidates open‑source components for business consumption. It provides:
QueryEngine: unified query access to underlying storage via Hive, Tez, Spark, Presto.
OLAP Platform: wraps Kylin, HBase, Phoenix, Presto, ClickHouse to serve diverse analytical needs.
Streaming Compute Platform: includes the Tianyan (formerly "Second X") streaming product and the DataBus data‑ingress platform for real‑time and batch ingestion.
Accelerated Compute Platform: leverages high‑performance clusters for machine‑learning and deep‑learning workloads.
Summarized work of this layer: unified entry, flexible analysis, and capability efficiency.
3. Data Content Layer – Implements a unified data lake and other data‑governance mechanisms to build the data warehouse that serves upper‑level business applications.
4. Capability Output Layer – Provides adhoc query capabilities (e.g., SQL via QueryEngine), Tableau for BI, unified metric platforms, and domain‑specific tools such as "Turing" for agents and "Compass" for traffic analysis.
Summarized work of this layer: basic support, data composition ability, and business empowerment.
Data Flow Process
Streaming data processing follows three steps: data ingestion, real‑time ETL, and data output.
Data ingestion faces variability (multiple sources like DB, MySQL, Oracle, Redis), anomalies (e.g., night‑time data refreshes), and efficiency challenges for both offline and real‑time streams. Beike addresses these with the Databus (DataBus) that reliably transports behavior and business data into the compute layer.
Real‑time ETL is handled by the Ark platform, which separates data governance (upper layer) from compute (lower layer) and uses Spark Streaming and Flink for processing.
Data output includes products like the Tianyan log‑analysis platform, which consolidates log ingestion, processing, and visualization.
Challenges
Streaming metadata management – need a unified catalog for diverse streams (Kafka logs, MySQL binlog, etc.) covering source, type, and schema.
Streaming processing platform – simplify configuration for Flink/Spark Streaming via the Ark platform.
Streaming application products – address varied scenarios such as log analysis, data mining, AI‑driven real‑time profiling.
Ark Streaming Processing Platform
Architecture consists of three layers:
Engine Layer – sources (Kafka, binlog, dig) feed data; processing via Stream SQL, real‑time rule matching, or templates built on Spark Streaming and Flink; results are written to sinks like Druid, ES, Kafka, HBase.
Compute Platform Layer – provides task scheduling, auto‑tuning diagnostics, integration of Databus data, a SQL IDE that auto‑generates most queries, and monitoring/alerting.
Application Layer – exposes capabilities via APIs for data cleaning, real‑time dashboards, feature extraction, recommendation, and risk control.
Tianyan Product
Collects business, access, device, and third‑party logs via Databus, processes them, and stores results in Druid, HDFS, ES, Hive, or MySQL. It offers visual analytics, full‑link monitoring, business monitoring, fault management, and open APIs for diverse use cases.
Summary
Building a clear, complete, visual streaming data dictionary is essential for effective stream data applications.
Productized streaming capabilities (e.g., Spark Streaming, Flink) wrapped by Ark simplify configuration and usage.
Multi‑scenario adaptability enables applications ranging from log analysis to AI‑driven real‑time user profiling.
Thank you for attending; the talk covered Beike's three platforms and practical experiences with streaming data processing.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.