Big Data 7 min read

Architecture and Design of the Home Data Integration Governance Platform

The article describes the background, architecture, and design principles of a unified big‑data scheduling and data‑exchange platform, detailing its data ingestion “direct‑train”, centralized scheduling engine, and DataX‑based data‑exchange components along with monitoring, alerting, and security features.

HomeTech

Dec 12, 2019

Architecture and Design of the Home Data Integration Governance Platform

The rapid growth of the Home business and the variety of big‑data processing frameworks (MapReduce, Hive, Java, Shell, Python) created a need for a unified platform that can handle diverse tasks, flexible dependencies, and heterogeneous data sources.

Before the unified scheduling platform, tasks relied on Azkaban and Crontab for timing, while data synchronization used Sqoop, which could not accommodate the many source systems such as MySQL, SQL Server, Oracle, MongoDB, Hive, Elasticsearch, HBase, FTP, etc., leading to high manual development costs and low efficiency.

The Home Data Integration Governance Platform was built to address these issues, consisting of three main modules: Data Direct‑Train, Data Exchange, and Distributed Task Scheduling.

Data Direct‑Train

This module serves as the entry point for business data ingestion, automatically generating data‑exchange task configurations and scheduling jobs based on table metadata, and creating Hive tables and FDM processing scripts.

Unified Scheduling Platform

Acting as the central hub of the big‑data pipeline, the scheduler orchestrates data ingestion, processing, and exchange according to task dependencies, ensuring correct execution order, resource utilization, and high‑throughput processing of hundreds of thousands of jobs.

Key features include a simple visual task configuration interface, job history, log viewing, multiple trigger modes (dependency, timed, manual, recovery, self‑dependency), batch backfill, API callbacks, support for various task types (Shell, Python, Java, DataX, Jar), version management, high‑availability distributed clusters with real‑time node monitoring, automatic failover, horizontal scaling, and resource isolation.

The platform also provides monitoring dashboards, alerting via DingTalk, phone, email, SMS, SLA task boards, and integrated data‑source management that hides sensitive information, supports data subscription, workflow integration, and unified permission control across tasks, tables, and data sources.

Data Exchange Platform

Built on DataX, this component acts as a bridge for heterogeneous data sources across departments, offering plug‑in extensions and seamless integration with the scheduler. It transforms complex mesh‑like synchronization topologies into a star‑shaped architecture, allowing new sources to be added by simply connecting them to DataX.

DataX follows a Framework + Plugin architecture, abstracting source reading and target writing into Reader and Writer plugins that communicate through a central Framework handling buffering, flow control, concurrency, and data transformation.

Overall, the platform delivers a unified, automated, and scalable solution for big‑data task orchestration, data exchange, and governance within the Home ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Task Scheduling platform architecture DataX data integration

Written by

HomeTech

HomeTech tech sharing

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.