Big Data 23 min read

Evolution of Next‑Generation Cloud Data Platform Architecture

This technical presentation reviews the historical development of big data platforms, outlines the four generations of cloud data platform architectures, details the modern cloud‑native stack—including unified metadata, scheduling, and integration systems—and showcases a real‑world industrial manufacturing case with a Q&A session.

DataFunTalk

Mar 15, 2023

The talk, organized by Zhejiang Shuxin Network Co., titled “The Road to Evolution of Next‑Generation Cloud Data Platform Architecture,” is divided into four parts: a review of big‑data development, trends in cloud data platform evolution, the technical architecture of a modern cloud data platform, and practical case studies.

Big Data Development Review : Data platforms provide end‑to‑end lifecycle capabilities for data integration, storage, processing, analysis, and services. Their evolution is described in three stages: the traditional data era (80s‑2000, dominated by Oracle/Teradata and BI tools), the big‑data era (post‑2000, driven by Hadoop and vendor‑specific data middle‑platforms), and the cloud data era (cloud‑native services such as Redshift, Snowflake, Databricks, and Alibaba MaxCompute, PAI, EMR, DataWorks). The presenter also introduces the domestic open‑source cloud data platform DataCyber.

Cloud Data Platform Evolution Trends : Four architectural generations are identified—shared‑storage, massive‑parallel‑processing (MPP), Hadoop/Spark, and cloud‑native. Key trends include multi‑engine support (storage, stream‑batch, real‑time analytics), stream‑batch integration, lake‑warehouse convergence, cloud‑native design with storage‑compute separation, and multi‑cloud/hybrid‑cloud capabilities.

Technical Architecture : The overall stack is layered from data sources, lake‑warehouse storage engines (HDFS, object storage), resource scheduling frameworks (YARN, Kubernetes), compute engines (Hive, Flink, Spark, TensorFlow, MPP, federated query), to a cloud data operating‑system kernel that provides unified metadata, engine gateway, task scheduler, data integration, and cross‑network transmission services. A data governance platform sits on top, offering full‑lifecycle data development, quality, security, and management, while tenant, account, and permission services support multi‑tenant operation.

Core Technical Components : 1. Unified Metadata System – integrates with Hive Metastore and extends catalog support for Spark/Flink, offering metadata management, permission control, and governance (lake‑table management, lineage, lifecycle). 2. Unified Scheduling System – consists of a Coordinator cluster (job, resource, and API management) and Worker clusters (execution), designed for high stability, concurrency, and horizontal scalability, with support for both YARN and Kubernetes. 3. Data Integration System – enables high‑speed heterogeneous source integration using Spark/Flink, supporting batch, stream, full‑load, and incremental sync, with elastic scaling and cross‑network transmission for hybrid‑cloud scenarios.

Practice Case : An industrial manufacturing data governance platform built on the CyberMeta cloud data platform demonstrates a one‑stop solution for data development, stream‑batch integration, and customized pipelines across offline, real‑time, and lake‑warehouse use cases, improving production efficiency and data asset management.

Q&A Highlights : Topics covered include decentralised scheduling design, VPC connectivity across clouds, metadata service capabilities, cloud‑native versus traditional architectures, hybrid‑cloud security (LDAP, Kerberos, Ranger, encryption), stream‑batch handling (Lambda/Kappa), Delta Lake readiness, combined YARN/K8s deployments, analysis‑ready schemas, metadata‑driven development, cloud‑native storage options, lake‑warehouse sharing models, distinction between cloud data platforms and cloud‑native lakes, Spark/Flink on Kubernetes, intelligent scheduling, and differences between StarRocks and Doris.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

metadata Scheduling Data Architecture Cloud Data Platform

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.