Big Data 14 min read

Streaming Data Platform Practices and Challenges at Beike Real Estate

This article presents an in‑depth overview of Beike's four‑layer streaming data platform, covering the foundational infrastructure, capability aggregation, data content, and output layers, as well as the challenges of metadata management, real‑time processing, and productization through the Ark and Tianyan systems.

DataFunTalk

Aug 1, 2019

Streaming Data Platform Practices and Challenges at Beike Real Estate

Today we share Beike's platform‑wide practice and challenges of streaming data, describing how a streaming data platform is built to meet business needs.

Overall Architecture

The Beike big‑data architecture consists of four layers from bottom to top:

1. Foundation Platform Layer – Uses common technologies such as HDFS for distributed storage, YARN for resource scheduling, HBase for storage, and compute engines like Hive, Tez, Spark, Presto, Kylin, ClickHouse, and SparkML to satisfy various basic requirements, supported by high‑performance clusters.

Key work of this layer includes integrated monitoring, strong security guarantees, high stability and efficiency, and low‑cost solutions.

2. Capability Aggregation Layer – Consolidates open‑source components for business consumption. It provides:

QueryEngine: unified query access to underlying storage via Hive, Tez, Spark, Presto.

OLAP Platform: wraps Kylin, HBase, Phoenix, Presto, ClickHouse to serve diverse analytical needs.

Streaming Compute Platform: includes the Tianyan (formerly "Second X") streaming product and the DataBus data‑ingress platform for real‑time and batch ingestion.

Accelerated Compute Platform: leverages high‑performance clusters for machine‑learning and deep‑learning workloads.

Summarized work of this layer: unified entry, flexible analysis, and capability efficiency.

3. Data Content Layer – Implements a unified data lake and other data‑governance mechanisms to build the data warehouse that serves upper‑level business applications.

4. Capability Output Layer – Provides adhoc query capabilities (e.g., SQL via QueryEngine), Tableau for BI, unified metric platforms, and domain‑specific tools such as "Turing" for agents and "Compass" for traffic analysis.

Summarized work of this layer: basic support, data composition ability, and business empowerment.

Data Flow Process

Streaming data processing follows three steps: data ingestion, real‑time ETL, and data output.

Data ingestion faces variability (multiple sources like DB, MySQL, Oracle, Redis), anomalies (e.g., night‑time data refreshes), and efficiency challenges for both offline and real‑time streams. Beike addresses these with the Databus (DataBus) that reliably transports behavior and business data into the compute layer.

Real‑time ETL is handled by the Ark platform, which separates data governance (upper layer) from compute (lower layer) and uses Spark Streaming and Flink for processing.

Data output includes products like the Tianyan log‑analysis platform, which consolidates log ingestion, processing, and visualization.

Challenges

Streaming metadata management – need a unified catalog for diverse streams (Kafka logs, MySQL binlog, etc.) covering source, type, and schema.

Streaming processing platform – simplify configuration for Flink/Spark Streaming via the Ark platform.

Streaming application products – address varied scenarios such as log analysis, data mining, AI‑driven real‑time profiling.

Ark Streaming Processing Platform

Architecture consists of three layers:

Engine Layer – sources (Kafka, binlog, dig) feed data; processing via Stream SQL, real‑time rule matching, or templates built on Spark Streaming and Flink; results are written to sinks like Druid, ES, Kafka, HBase.

Compute Platform Layer – provides task scheduling, auto‑tuning diagnostics, integration of Databus data, a SQL IDE that auto‑generates most queries, and monitoring/alerting.

Application Layer – exposes capabilities via APIs for data cleaning, real‑time dashboards, feature extraction, recommendation, and risk control.

Tianyan Product

Collects business, access, device, and third‑party logs via Databus, processes them, and stores results in Druid, HDFS, ES, Hive, or MySQL. It offers visual analytics, full‑link monitoring, business monitoring, fault management, and open APIs for diverse use cases.

Summary

Building a clear, complete, visual streaming data dictionary is essential for effective stream data applications.

Productized streaming capabilities (e.g., Spark Streaming, Flink) wrapped by Ark simplify configuration and usage.

Multi‑scenario adaptability enables applications ranging from log analysis to AI‑driven real‑time user profiling.

Thank you for attending; the talk covered Beike's three platforms and practical experiences with streaming data processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

streaming data Metadata Management real-time ETL big data platform Beike Tianyan Ark platform

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.