Big Data 10 min read

Overview of Core Technologies in a Big Data Platform Architecture

This article explains the main layers of a typical big data platform—data collection, storage and analysis, sharing, and application—detailing common tools such as Flume, DataX, Hive, Spark, SparkSQL, Impala, and Spark Streaming, and discusses task scheduling and monitoring in the ecosystem.

Architecture Digest

May 23, 2022

Overview of Core Technologies in a Big Data Platform Architecture

We start by looking at a typical big data platform architecture diagram used by many companies.

The core layers of big data are, from top to bottom, the data collection layer, the data storage and analysis layer, the data sharing layer, and the data application layer. Names may vary, but the roles are essentially the same.

1. Data Collection

The task of data collection is to gather data from various sources and store it in the data storage, possibly performing simple cleaning.

Data source types are numerous:

Website logs: In the internet industry, website logs occupy the largest share. Logs are stored on multiple log servers, each running a Flume agent to collect logs in real time and store them in HDFS.

Business databases: Databases such as MySQL, Oracle, SQL Server, etc. A tool is needed to sync data from these databases to HDFS. Sqoop is heavy and always launches a MapReduce job; DataX (open‑sourced by Taobao) is a lighter alternative and can be extended if resources allow.

Ftp/Http data sources: Some partners provide data via FTP/HTTP that needs periodic fetching; DataX can also handle this.

Other data sources: Manually entered data can be provided through a simple API or mini‑program.

Flume can also be configured or developed to sync data from databases to HDFS in real time.

2. Data Storage and Analysis

Undoubtedly, HDFS is the most suitable storage solution for a data warehouse/platform in a big data environment.

For offline data analysis and computation (where real‑time requirements are low), Hive is the primary choice because of its rich data types, built‑in functions, high‑compression ORC file format, and convenient SQL support, which makes it far more efficient than writing raw MapReduce jobs.

Hadoop also provides a MapReduce interface for those who prefer Java programming over SQL.

Spark, which has become very popular in recent years, offers significantly better performance than MapReduce and integrates well with Hive and YARN, allowing Spark or SparkSQL to be used for analysis without deploying a separate Spark cluster.

3. Data Sharing

Data sharing refers to the storage location for results after analysis and computation, typically relational databases or NoSQL databases.

Since business applications cannot directly read from HDFS, a data sharing layer is needed to synchronize results from HDFS to other target data stores; DataX can also fulfill this role.

Some real‑time computation results may be written directly to the sharing layer by the real‑time computation module.

4. Data Application

Business products (CRM, ERP, etc.) – consume data from the sharing layer directly.

Reports (FineReport, business reports) – also use pre‑aggregated data stored in the sharing layer.

Ad‑hoc queries – users such as developers, operations staff, analysts, or managers need to query raw data directly from the storage layer when existing reports do not satisfy their needs.

Ad‑hoc queries are usually performed with SQL. Hive can be slow, so SparkSQL is preferred for its faster response while remaining compatible with Hive.

Impala is another option if adding another framework to the platform is acceptable.

OLAP

Many OLAP tools cannot directly read from HDFS and rely on syncing data to relational databases, which does not scale for massive data volumes. Therefore, custom development is required to fetch data from HDFS or HBase for OLAP purposes.

Other data interfaces – generic or customized. For example, a generic interface to fetch user attributes from Redis can be used by all business services.

5. Real‑Time Computing

Business demands for real‑time data are increasing (e.g., real‑time website traffic, ad exposure/click statistics). Traditional databases cannot handle the required throughput and latency; a distributed, high‑throughput, low‑latency, reliable framework is needed. Storm is mature, but Spark Streaming is chosen here to avoid adding another framework, and its latency is comparable.

Our current implementation uses Spark Streaming to provide real‑time website traffic statistics and real‑time ad effectiveness statistics.

Flume collects website and ad logs from front‑end log servers, streams them to Spark Streaming, which processes the data and stores the results in Redis; business services retrieve real‑time data from Redis.

6. Task Scheduling and Monitoring

A data warehouse/platform contains many tasks such as data collection, synchronization, and analysis.

These tasks have complex dependencies (e.g., analysis must wait for collection to finish). A comprehensive scheduling and monitoring system is required to orchestrate and supervise all tasks.

Source: http://lxw1234.com/archives/2015/08/471.htm

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data collection Real-time Processing Data Platform DataX Spark Hadoop

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.