Evolution of Next‑Generation Cloud Data Platform Architecture
This technical presentation reviews the historical development of big data platforms, outlines the four generations of cloud data platform architectures, details the modern cloud‑native stack—including unified metadata, scheduling, and integration systems—and showcases a real‑world industrial manufacturing case with a Q&A session.
The talk, organized by Zhejiang Shuxin Network Co., titled “The Road to Evolution of Next‑Generation Cloud Data Platform Architecture,” is divided into four parts: a review of big‑data development, trends in cloud data platform evolution, the technical architecture of a modern cloud data platform, and practical case studies.
Big Data Development Review : Data platforms provide end‑to‑end lifecycle capabilities for data integration, storage, processing, analysis, and services. Their evolution is described in three stages: the traditional data era (80s‑2000, dominated by Oracle/Teradata and BI tools), the big‑data era (post‑2000, driven by Hadoop and vendor‑specific data middle‑platforms), and the cloud data era (cloud‑native services such as Redshift, Snowflake, Databricks, and Alibaba MaxCompute, PAI, EMR, DataWorks). The presenter also introduces the domestic open‑source cloud data platform DataCyber.
Cloud Data Platform Evolution Trends : Four architectural generations are identified—shared‑storage, massive‑parallel‑processing (MPP), Hadoop/Spark, and cloud‑native. Key trends include multi‑engine support (storage, stream‑batch, real‑time analytics), stream‑batch integration, lake‑warehouse convergence, cloud‑native design with storage‑compute separation, and multi‑cloud/hybrid‑cloud capabilities.
Technical Architecture : The overall stack is layered from data sources, lake‑warehouse storage engines (HDFS, object storage), resource scheduling frameworks (YARN, Kubernetes), compute engines (Hive, Flink, Spark, TensorFlow, MPP, federated query), to a cloud data operating‑system kernel that provides unified metadata, engine gateway, task scheduler, data integration, and cross‑network transmission services. A data governance platform sits on top, offering full‑lifecycle data development, quality, security, and management, while tenant, account, and permission services support multi‑tenant operation.
Core Technical Components : 1. Unified Metadata System – integrates with Hive Metastore and extends catalog support for Spark/Flink, offering metadata management, permission control, and governance (lake‑table management, lineage, lifecycle). 2. Unified Scheduling System – consists of a Coordinator cluster (job, resource, and API management) and Worker clusters (execution), designed for high stability, concurrency, and horizontal scalability, with support for both YARN and Kubernetes. 3. Data Integration System – enables high‑speed heterogeneous source integration using Spark/Flink, supporting batch, stream, full‑load, and incremental sync, with elastic scaling and cross‑network transmission for hybrid‑cloud scenarios.
Practice Case : An industrial manufacturing data governance platform built on the CyberMeta cloud data platform demonstrates a one‑stop solution for data development, stream‑batch integration, and customized pipelines across offline, real‑time, and lake‑warehouse use cases, improving production efficiency and data asset management.
Q&A Highlights : Topics covered include decentralised scheduling design, VPC connectivity across clouds, metadata service capabilities, cloud‑native versus traditional architectures, hybrid‑cloud security (LDAP, Kerberos, Ranger, encryption), stream‑batch handling (Lambda/Kappa), Delta Lake readiness, combined YARN/K8s deployments, analysis‑ready schemas, metadata‑driven development, cloud‑native storage options, lake‑warehouse sharing models, distinction between cloud data platforms and cloud‑native lakes, Spark/Flink on Kubernetes, intelligent scheduling, and differences between StarRocks and Doris.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.