Big Data 9 min read

Big Data Platform Architecture: Core Layers, Technologies, and Practices

This article outlines a typical big data platform architecture, detailing its core layers—data acquisition, storage and analysis, sharing, application, real‑time computation, and task scheduling—while introducing key technologies such as Flume, HDFS, Hive, Spark, DataX, and monitoring considerations.

IT Architects Alliance
IT Architects Alliance
IT Architects Alliance
Big Data Platform Architecture: Core Layers, Technologies, and Practices

We first look at a typical big data platform architecture diagram used by many companies, which consists of four core layers: data acquisition, data storage and analysis, data sharing, and data application.

1. Data Acquisition

The acquisition layer collects data from various sources and stores it in the data storage layer, often performing light cleaning. Common sources include website logs, business databases (MySQL, Oracle, SQL Server), FTP/HTTP feeds, and manually entered data via simple interfaces.

Website logs are usually collected by deploying Flume agents on log servers to stream logs into HDFS. For relational databases, tools like Sqoop were historically used, but they are heavyweight; Alibaba’s open‑source DataX provides a lighter solution for syncing data to HDFS. DataX can also pull data from FTP/HTTP sources or custom APIs.

2. Data Storage and Analysis

HDFS is the de‑facto storage solution for big data platforms. Offline analysis that does not require low latency is typically performed with Hive, which offers rich data types, built‑in functions, the high‑compression ORC format, and convenient SQL support—often reducing hundreds of lines of MapReduce code to a single query.

For faster processing, Spark (and SparkSQL) is preferred over traditional MapReduce, and it integrates smoothly with YARN, eliminating the need for a separate Spark cluster. When low‑latency SQL queries are needed, SparkSQL or Impala can be used.

3. Data Sharing

After analysis, results reside in HDFS but need to be accessible to downstream applications. Data sharing is achieved by synchronizing processed data from HDFS to relational or NoSQL stores (e.g., MySQL, HBase, Redis) using tools like DataX.

4. Data Application

Business systems (CRM, ERP), reporting tools (FineReport), and ad‑hoc queries consume data from the sharing layer. Ad‑hoc queries often require direct access to the storage layer and are best served by SparkSQL for better response times compared to Hive.

5. Real‑Time Computation

Increasing business demand for real‑time insights (site traffic, ad impressions) leads to the adoption of streaming frameworks. Although Storm is mature, Spark Streaming is chosen here to avoid adding another framework; its latency is acceptable for the use cases.

Log data is collected by Flume, streamed to Spark Streaming, processed in real time, and the results are stored in Redis for fast retrieval by business services.

6. Task Scheduling and Monitoring

A comprehensive scheduling and monitoring system is essential to orchestrate the many jobs in a data platform—data ingestion, synchronization, analysis, and real‑time processing—handling complex dependencies (e.g., analysis must wait for ingestion to finish).

Such a system acts as the central hub, ensuring tasks are dispatched, monitored, and retried as needed.

Original source: https://blog.csdn.net/yuanziok/article/details/117030031

Big DataReal-time ProcessingTask Schedulingdata platformSparkHadoopData ingestion
IT Architects Alliance
Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.