Architecture and Design of the Home Data Integration Governance Platform
The article describes the background, architecture, and design principles of a unified big‑data scheduling and data‑exchange platform, detailing its data ingestion “direct‑train”, centralized scheduling engine, and DataX‑based data‑exchange components along with monitoring, alerting, and security features.
The rapid growth of the Home business and the variety of big‑data processing frameworks (MapReduce, Hive, Java, Shell, Python) created a need for a unified platform that can handle diverse tasks, flexible dependencies, and heterogeneous data sources.
Before the unified scheduling platform, tasks relied on Azkaban and Crontab for timing, while data synchronization used Sqoop, which could not accommodate the many source systems such as MySQL, SQL Server, Oracle, MongoDB, Hive, Elasticsearch, HBase, FTP, etc., leading to high manual development costs and low efficiency.
The Home Data Integration Governance Platform was built to address these issues, consisting of three main modules: Data Direct‑Train, Data Exchange, and Distributed Task Scheduling.
Data Direct‑Train
This module serves as the entry point for business data ingestion, automatically generating data‑exchange task configurations and scheduling jobs based on table metadata, and creating Hive tables and FDM processing scripts.
Unified Scheduling Platform
Acting as the central hub of the big‑data pipeline, the scheduler orchestrates data ingestion, processing, and exchange according to task dependencies, ensuring correct execution order, resource utilization, and high‑throughput processing of hundreds of thousands of jobs.
Key features include a simple visual task configuration interface, job history, log viewing, multiple trigger modes (dependency, timed, manual, recovery, self‑dependency), batch backfill, API callbacks, support for various task types (Shell, Python, Java, DataX, Jar), version management, high‑availability distributed clusters with real‑time node monitoring, automatic failover, horizontal scaling, and resource isolation.
The platform also provides monitoring dashboards, alerting via DingTalk, phone, email, SMS, SLA task boards, and integrated data‑source management that hides sensitive information, supports data subscription, workflow integration, and unified permission control across tasks, tables, and data sources.
Data Exchange Platform
Built on DataX, this component acts as a bridge for heterogeneous data sources across departments, offering plug‑in extensions and seamless integration with the scheduler. It transforms complex mesh‑like synchronization topologies into a star‑shaped architecture, allowing new sources to be added by simply connecting them to DataX.
DataX follows a Framework + Plugin architecture, abstracting source reading and target writing into Reader and Writer plugins that communicate through a central Framework handling buffering, flow control, concurrency, and data transformation.
Overall, the platform delivers a unified, automated, and scalable solution for big‑data task orchestration, data exchange, and governance within the Home ecosystem.
HomeTech
HomeTech tech sharing
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.