Big Data 27 min read

Comprehensive Overview of Data Middle Platform Architecture, Components, and Practices

This article provides an extensive summary of data middle platform concepts, covering data aggregation, collection tools, offline and real‑time development, data governance, service layers, warehouse construction, and operational practices, illustrating how enterprises build and manage a unified data ecosystem.

Architects' Tech Alliance

Aug 11, 2020

This article summarizes the theory of the data middle platform, referencing "Data Middle Platform".

Data Middle Platform

Data Aggregation

Data aggregation is a core capability of the data middle platform, collecting heterogeneous network and data source information into a centralized storage for downstream processing and modeling. Aggregation methods include database synchronization, embedded tracking, web crawling, and message queues; they can be classified as offline batch aggregation or real‑time collection based on timeliness.

Data Collection Tools

Canal, DataX, Sqoop

Data Development

The data development module serves developers and analysts, offering offline, real‑time, and algorithm development tools.

Offline Development

Job Scheduling

• Dependency scheduling: a job starts only after all its parent jobs have completed. For example, Job B can be scheduled only after Jobs A and C finish. • Time scheduling: a job can be set to start at a specific time, e.g., Job B starts after 05:00.

Baseline Control

In long‑running big‑data offline jobs, predictive algorithms estimate completion times; when a job cannot finish on time, the scheduler alerts operations staff for early intervention.

Heterogeneous Storage

Enterprise storage engines are diverse. The offline development center builds specific plugins for each engine (e.g., Oracle plugin, Hive/Spark/MR plugins for Hadoop). Users create jobs via the UI, and the system automatically selects the appropriate plugin at execution time.

Code Validation

SQL tasks undergo strict pre‑execution checks to detect issues early.

Multi‑Environment Cascading

Supports various environment needs with isolated Hive databases, YARN queues, or even separate Hadoop clusters. Environments include:

• Single environment: only one production environment. • Classic environment: development with masked data, production with real data. • Complex environment: external users get a masked environment; after testing, models are promoted to internal development.

Recommended Dependencies

As business depth grows, developers need to manage accumulating jobs. The system helps locate upstream jobs and avoid circular dependencies by analyzing table‑level lineage graphs, performing loop detection, and returning suitable node lists.

Data Permissions

Multiple engines have separate permission systems (e.g., Oracle, HANA, LibrA), making permission requests cumbersome. Strategies include:

• RBAC (Role‑Based Access Control) – e.g., Cloudera Sentry, Huawei FI. • PBAC (Policy‑Based Access Control) – e.g., Hortonworks Ranger.

Permissions are usually managed by big‑data or database ops staff; developers request access through a centralized permission‑management portal, which records approvals for auditing.

Real‑Time Development

• Metadata management • SQL‑driven development • Component‑based development

Intelligent Operations

Integrated tools for job management, code deployment, operations, monitoring, and alerting improve efficiency. Features include re‑run, downstream re‑run, and data back‑fill.

Data System

With data aggregation and development modules, the middle platform provides core data‑warehouse capabilities, enabling a comprehensive enterprise data system characterized by full‑domain coverage, clear hierarchical structure, consistent accuracy, performance improvement, cost reduction, and ease of use across industries (real estate, securities, retail, manufacturing, media, etc.).

ODS Layer (Raw Data)

Collects source system data, preserving original business process information with minimal transformation; supports incremental sync with delta tables for large datasets.

Unified Data Warehouse Layer (DW)

Includes detailed (DWD) and summary (DWS) layers, reorganizing source data into standardized metrics and dimensions for unified business reporting.

Application Data Layer (ADS)

Extracts data from DW/TDM to serve specific business needs, providing tailored datasets for downstream applications.

Data Asset Management

Manages catalogs, metadata, quality, lineage, and lifecycle, presenting assets visually to enhance data awareness and support value‑driven applications.

Data Governance

Covers standard management, metadata, quality, security, and lifecycle governance.

Data Service System

Transforms data into service capabilities, exposing APIs for query and analysis.

Query Service

Accepts query conditions and returns data via API, supporting indexed identifiers, filter items, sorting, and pagination.

Analysis Service

Provides high‑performance multi‑source analysis (Hive, ES, Greenplum, MySQL, Oracle, files) with instant queries, multi‑dimensional analysis, and flexible business integration.

Recommendation Service

Delivers personalized recommendations by mining user‑item behavior, supporting industry‑specific logic and various scenarios (cold start, active browsing), with continuous model optimization.

Crowd‑Targeting Service

Filters users based on tag combinations, supports audience sizing, and integrates with multiple channels (SMS, email, marketing platforms).

Offline Platform

Includes product function diagram, scheduling module, overall architecture, FTP‑based task dependency, and diagnostic platform "Huatuo" for task analysis.

Real‑Time Platform

Meituan‑Dianping

Uses Grafana for embedded monitoring.

Bilibili

Features SQL‑based programming, DAG drag‑and‑drop, integrated operations; built on BSQL, YARN, Flink, Kafka, HBase, Redis, RocksDB, and supports AI, search, recommendation, and real‑time ETL scenarios.

NetEase

Real‑time stream processing covers advertising, e‑commerce, search, and recommendation workloads.

Event Management

Coordinates Server (request initiator), Kernel (executor), and Admin (result confirmer) modules to ensure reliable distributed task execution with high availability.

Platform Task State Management

Server handles initial state; Admin manages YARN‑related interactions.

Task Debugging

SQL tasks support debugging with custom CSV inputs; sloth‑server assembles requests, invokes kernels, and collects logs.

Log Retrieval

Filebeat ships logs to Kafka, Logstash parses them, and Elasticsearch stores them for Kibana visualization and search.

Monitoring

Uses InfluxDB metric‑report component and NetEase‑built NTSDB for metric collection, viewable via Grafana and alerting modules.

Alerting

Sloth stream platform supports failure, latency, and custom rule alerts, delivering notifications through internal chat, email, phone, or SMS.

Real‑Time Data Warehouse

Collects logs and event data into Kafka, processes them in real‑time, extracts ODS details, aggregates into Redis, Kudu, etc., and serves data via APIs to front‑end applications.

E‑Commerce Applications – Data Analysis

Real‑time activity analysis, homepage resource analysis, funnel metrics, and profit calculations.

E‑Commerce Applications – Search & Recommendation

Handles real‑time user footprints, features, CTR/CVR modeling, homepage carousel, and activity selection with UV/PV statistics.

Offline vs. Real‑Time Data Warehouse

Building an Offline Warehouse

Defines data warehouse as a subject‑oriented, integrated, time‑variant, read‑only collection for decision support. Goals include data assets and decision information. ETL bridges offline and real‑time pipelines, enabling data flow across layers.

ETL

Supports diverse sources (text, logs, RDBMS, NoSQL) using tools like DataX, Sqoop, Kettle, Informatica, ensuring scheduled, non‑blocking data sync.

Layered Architecture

ODS (raw), Stage (buffer), DWD (detail), DIM (dimension), DW (fact), DM (application) layers each serve specific processing and storage purposes.

Code Standards

Enforces script header comments, naming conventions, and field naming consistency across models.

Differences Between Offline and Real‑Time Warehouses

Offline warehouses use Sqoop/DataX/Hive for T+1 data with daily batch jobs; real‑time warehouses ingest raw data via Canal into Kafka, store in OLAP systems like HBase, and provide minute‑level or sub‑second query capabilities.

Data Middle Platform Solutions

Industry‑specific implementations (retail, securities, etc.) with metrics such as RPS (Revenue Per Search) and ROI (Return on Investment).

Disclaimer: Thanks to the original author for the content. If there are copyright issues, please contact us.

Recommended Reading: For more architecture‑related knowledge, refer to the "Architect Technical Alliance Library" (32 books) and obtain the original article.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing Data Warehouse ETL data governance Data Middle Platform

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.