Tagged articles

3697 articles

Page 18 of 37

Apr 9, 2022 · Big Data

Optimizing Apache Iceberg Query Performance with Z‑Order Data Organization

This talk explains how Apache Iceberg’s DataSkipping can lose efficiency with many filter columns, and presents a data‑organization redesign using space‑filling curves and Z‑Order to improve query I/O, detailing the OPTIMIZE syntax, implementation steps, performance benchmarks, and future roadmap.

Apache IcebergBig DataData Skipping

0 likes · 12 min read

Optimizing Apache Iceberg Query Performance with Z‑Order Data Organization

Bilibili Tech

Apr 9, 2022 · Big Data

Bilibili Presto on Hadoop: Architecture, Scaling, and Performance Enhancements

Bilibili’s Presto on Hadoop combines a multi‑engine offline platform with Kubernetes‑managed YARN scheduling, Ranger security, and a custom dispatcher, scaling to over 400 nodes handling 160 k daily queries on 10 PB, while adding coordinator HA, resource‑group punishment, query limits, Alluxio caching, dynamic filtering, and numerous SQL‑level enhancements, with future auto‑scaling and materialized‑view automation.

Big DataHadoopcluster scaling

0 likes · 30 min read

Bilibili Presto on Hadoop: Architecture, Scaling, and Performance Enhancements

ByteDance Data Platform

Apr 8, 2022 · Operations

How Baseline Monitoring Transforms Data Pipeline Reliability at ByteDance

This article explains ByteDance's baseline monitoring system for data pipelines, detailing its motivation, core concepts, architecture, instance generation, alert types, and handling of complex task dependencies to reduce operational costs and improve SLA compliance across hundreds of projects.

Big Dataalertingbaseline monitoring

0 likes · 21 min read

How Baseline Monitoring Transforms Data Pipeline Reliability at ByteDance

Big Data Technology & Architecture

Apr 7, 2022 · Big Data

Understanding Kafka Producer Idempotence: Mechanisms and Implementation Details

This article explains how Kafka achieves producer idempotence by assigning unique producer IDs and sequence numbers, describes the broker’s validation process, and walks through the relevant producer‑side and broker‑side code paths, highlighting configuration considerations and limitations.

Big DataBrokerIdempotence

0 likes · 13 min read

Understanding Kafka Producer Idempotence: Mechanisms and Implementation Details

DataFunTalk

Apr 7, 2022 · Big Data

Apache Kyuubi: Architecture, Use Cases, Community, and Mobile Cloud Deployment

This article introduces Apache Kyuubi—a multi‑tenant Thrift JDBC/ODBC service built on Spark—detailing its architecture, advantages over Spark Thrift Server, real‑world use cases, open‑source community progress, and practical deployment strategies on mobile cloud, Kubernetes, and with Trino.

Apache SparkBig DataKyuubi

0 likes · 16 min read

Apache Kyuubi: Architecture, Use Cases, Community, and Mobile Cloud Deployment

DataFunSummit

Apr 6, 2022 · Big Data

Real-time Dimension Modeling with Flink SQL: Challenges and Solutions

This article presents a JD.com case study on applying Flink SQL for real‑time dimension modeling, detailing two complex streaming scenarios—full‑join of multiple streams and full‑group aggregation—along with the associated challenges of historical data handling, state management, and performance optimization, and proposes component‑based architectural solutions.

Big DataFlinkStreaming

0 likes · 14 min read

Real-time Dimension Modeling with Flink SQL: Challenges and Solutions

MaGe Linux Operations

Apr 5, 2022 · Big Data

Recreating Google Ngram Trends with Python: Analyzing 1.4 Billion Rows Efficiently

This article demonstrates how to use Python, the PyTubes library, and NumPy to load, process, and visualize the massive Google Ngram 1‑gram dataset—over 1.4 billion records—showing performance considerations, data‑cleaning steps, and comparative language trends for Python, Pascal, and Perl.

Big DataData AnalysisNGram

0 likes · 10 min read

Recreating Google Ngram Trends with Python: Analyzing 1.4 Billion Rows Efficiently

Big Data Technology & Architecture

Apr 5, 2022 · Big Data

Using ElasticsearchSink with Apache Flink: Configuration, Retry Strategies, and Failure Handling

This article introduces the ElasticsearchSink for Apache Flink, explains how to add Maven dependencies, implement the sink with configuration and retry settings, details failure handlers, and highlights important considerations such as exception handling and checkpoint requirements for reliable streaming pipelines.

Big DataElasticsearchFailure Handling

0 likes · 9 min read

Using ElasticsearchSink with Apache Flink: Configuration, Retry Strategies, and Failure Handling

DataFunTalk

Apr 4, 2022 · Big Data

Impala Deployment and Optimization in Sensors Data's Multi-Dimensional Analytics Platform

This article details the architecture of Sensors Data's analytics platform, the implementation of a real‑time Impala query engine, multiple query‑performance optimizations—including storage redesign, user‑behavior sequence tuning, join elimination and expression push‑down—and a resource‑estimation framework that dramatically reduces query failures and latency.

Big DataData PlatformImpala

0 likes · 16 min read

Impala Deployment and Optimization in Sensors Data's Multi-Dimensional Analytics Platform

DataFunTalk

Apr 2, 2022 · Big Data

SuperSQL: A High‑Performance Cross‑Engine, Cross‑Data‑Center SQL Middleware for Big Data

The article introduces SuperSQL, a federated SQL middleware that unifies heterogeneous data sources across multiple data centers, leverages Apache Calcite for cost‑based optimization, pushes down operators to various engines, manages metadata with a Trie model, and demonstrates significant performance gains over traditional solutions.

Big DataCross‑Data‑CenterDistributed computing

0 likes · 27 min read

SuperSQL: A High‑Performance Cross‑Engine, Cross‑Data‑Center SQL Middleware for Big Data

DataFunTalk

Apr 1, 2022 · Operations

Integrated Digital Supply Chain: JD Logistics' Intelligent Planning, Algorithm Platform, and Digital Twin Practices

This article explores JD Logistics' integrated digital supply chain, detailing its evolution, the construction of an algorithm middle‑platform, engineering platforms, digital twin system, real‑world case studies, and future talent and ecosystem directions, illustrating how AI and big‑data technologies drive end‑to‑end logistics optimization.

Algorithm PlatformBig DataDigital Twin

0 likes · 16 min read

Integrated Digital Supply Chain: JD Logistics' Intelligent Planning, Algorithm Platform, and Digital Twin Practices

Big Data Technology & Architecture

Mar 31, 2022 · Big Data

Bilibili’s Lakehouse Architecture: Integrating Data Lake and Warehouse with Apache Iceberg

To address the high cost and low efficiency of traditional Hadoop‑based data pipelines, Bilibili designed a lakehouse solution using Apache Iceberg, integrating Spark, Flink, Trino, and Alluxio to unify flexible data lake storage with warehouse‑level query performance, reducing data duplication and improving interactive analytics.

Big DataData WarehouseIceberg

0 likes · 17 min read

Bilibili’s Lakehouse Architecture: Integrating Data Lake and Warehouse with Apache Iceberg

DataFunTalk

Mar 30, 2022 · Big Data

NetEase Big Data Platform: HDFS Optimization and Practice

This article presents NetEase's big data platform architecture, detailing multi‑layer storage and compute design, HDFS deployment challenges, NameNode and NameSpace performance optimizations, cluster scaling strategies, data tiering, hardware upgrades, and real‑world business use cases, illustrating practical large‑scale big data engineering.

Big DataCluster OptimizationData Management

0 likes · 23 min read

NetEase Big Data Platform: HDFS Optimization and Practice

21CTO

Mar 30, 2022 · Big Data

What Drives Taobao App Users? Insights from AARRR and RFM Analyses

This article analyzes 2 million Taobao app user‑behavior records using AARRR funnel metrics and RFM segmentation, revealing daily and hourly usage patterns, conversion bottlenecks, product‑search mismatches, and offering data‑driven marketing recommendations to boost retention and sales.

AARRRBig DataRFM

0 likes · 25 min read

What Drives Taobao App Users? Insights from AARRR and RFM Analyses

Bilibili Tech

Mar 30, 2022 · Big Data

HDFS Architecture, Optimizations, and Future Plans at Bilibili

Bilibili’s HDFS now runs a three‑tier architecture—access, metadata, and data layers—enhanced with a custom MergeFS router, observer NameNode, dynamic load balancing, fast‑failover pipelines, and storage‑aware policies, while future work targets transparent erasure coding, tiered data routing, lock refinements, and a Hadoop 3.x migration.

Big DataDistributed File SystemHDFS

0 likes · 22 min read

HDFS Architecture, Optimizations, and Future Plans at Bilibili

Efficient Ops

Mar 29, 2022 · Big Data

How Tencent Cloud Boosted APM Metric Computation Speed 2‑3× with Flink Optimizations

This article explains how Tencent Cloud's APM metric calculation, which transforms massive Span data into aggregated metrics using Flink, faced performance bottlenecks and was optimized through job splitting, batch merging, and dimension pruning, ultimately achieving a 2‑3× speed increase and cutting resource usage to about 30% of the original.

APMBig DataFlink

0 likes · 10 min read

How Tencent Cloud Boosted APM Metric Computation Speed 2‑3× with Flink Optimizations

DataFunTalk

Mar 29, 2022 · Big Data

FlinkX Multi-Source Heterogeneous Data Synchronization Framework: Architecture, Features, and Cloud‑Native Enhancements

This article introduces the FlinkX framework for multi‑source heterogeneous data synchronization, detailing its background, core functions such as checkpoint‑based resume, metric monitoring, rate limiting, plugin architecture, cloud‑native K8s deployment, Hudi integration, and future roadmap, while also addressing common Q&A topics.

BatchBig DataData Lake

0 likes · 14 min read

FlinkX Multi-Source Heterogeneous Data Synchronization Framework: Architecture, Features, and Cloud‑Native Enhancements

58 Tech

Mar 29, 2022 · Big Data

Design and Implementation of the 58 Group Penalty Data Center

This article presents the design, architecture, and implementation of a unified penalty data center for 58 Group, detailing the challenges of heterogeneous data sources, the selection of Flink for real‑time ETL, the use of a DSL and LRU aggregation, and the adoption of MVEL for feature recognition to achieve standardized, high‑performance penalty data processing.

Big DataData engineeringETL

0 likes · 13 min read

Design and Implementation of the 58 Group Penalty Data Center

NetEase Smart Enterprise Tech+

Mar 29, 2022 · Big Data

Automating Consumer Insight Testing with Spark, Hive, and ClickHouse

This article explains how to build a big‑data consumer insight platform using Spark applications, Hive, MySQL and ClickHouse, and how to automate data validation and algorithm testing to improve coverage, efficiency, and reliability of insight services.

Big DataClickHouseSpark

0 likes · 8 min read

Automating Consumer Insight Testing with Spark, Hive, and ClickHouse

Big Data Technology & Architecture

Mar 28, 2022 · Big Data

Real-time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

This article presents JD's real-time dimension modeling case using Flink SQL, detailing two complex streaming scenarios, the difficulties of handling historical data and state management, and a component‑based solution that leverages external KV stores and optimized Flink operators to improve performance and scalability.

Big DataFlinkStreaming

0 likes · 13 min read

Real-time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

Architects' Tech Alliance

Mar 28, 2022 · Artificial Intelligence

Digital Twin: Ten Fundamental Questions and Insights for Researchers, Decision‑Makers, and Practitioners

This article analyzes ten fundamental questions about digital twins, covering definitions, stakeholders, global interest, relationship with smart manufacturing, integration with New IT, scientific challenges, standards, and commercial tools, aiming to guide researchers, policymakers, and practitioners in understanding and applying digital twin technology.

AIBig DataDigital Twin

0 likes · 22 min read

Digital Twin: Ten Fundamental Questions and Insights for Researchers, Decision‑Makers, and Practitioners

Bilibili Tech

Mar 25, 2022 · Big Data

Bilibili's YARN Scheduling Optimization Practice: From Heartbeat-Driven to Global Scheduling

Bilibili transformed its YARN CapacityScheduler from a heartbeat‑driven design to a multi‑threaded global scheduler by separating lock handling, adopting Weighted Round‑Robin with DRF, adding batch node selection, fixing proposal inconsistencies, tuning GC and logging, and thereby reduced application allocation time by about 38 % on clusters of up to 8,000 nodes.

Big DataCapacitySchedulerHadoop

0 likes · 15 min read

Bilibili's YARN Scheduling Optimization Practice: From Heartbeat-Driven to Global Scheduling

GuanYuan Data Tech Team

Mar 24, 2022 · Big Data

Why Do Spark Card Queries Take 10 Seconds? Uncovering a NAS Mount Issue

A customer’s Spark card queries were consistently taking around 10 seconds, prompting a step‑by‑step investigation that revealed a misconfigured NAS mount option (lookupcache=none) as the root cause of the severe slowdown.

ArthasBig DataDebugging

0 likes · 7 min read

Why Do Spark Card Queries Take 10 Seconds? Uncovering a NAS Mount Issue

DataFunTalk

Mar 24, 2022 · Big Data

Real‑time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

This article presents a JD.com BI engineer's case study on applying Flink SQL to real‑time dimension modeling, detailing two complex streaming scenarios, the technical difficulties of handling historical data and performance, and a component‑based solution architecture with future roadmap considerations.

Big DataFlinkdimension modeling

0 likes · 13 min read

IT Architects Alliance

Mar 23, 2022 · Big Data

How Elasticsearch’s Cluster Architecture Powers Scalable Search and Analytics

This article explains Elasticsearch’s distributed cluster design, covering core concepts such as nodes, indices, shards, and replicas, compares mixed and tiered deployment models, examines data‑layer storage options, and discusses two typical distributed system architectures with their trade‑offs.

Big DataCluster ArchitectureData Storage

0 likes · 15 min read

How Elasticsearch’s Cluster Architecture Powers Scalable Search and Analytics

StarRocks

Mar 23, 2022 · Databases

Accelerating Zepp Health’s Analytics with StarRocks: An OLAP Case Study

Facing inflexible point‑lookup limits and slow query times on HBase, Zepp Health redesigned its massive event‑tracking data pipeline—migrating ingestion through Kafka, Flink, and Hudi to a StarRocks‑based OLAP layer—achieving sub‑100 ms average query latency, 20 % storage savings, and dramatically faster multi‑dimensional analytics.

Big DataFlinkHudi

0 likes · 9 min read

Accelerating Zepp Health’s Analytics with StarRocks: An OLAP Case Study

DataFunTalk

Mar 23, 2022 · Big Data

Iceberg Data Lake Query Optimization Practices and Governance

This talk by Tencent senior engineer Chen Liang covers Iceberg table format fundamentals, data lake ingestion, query processing, hidden partitioning, time‑travel, major features, optimization techniques such as compaction, bin‑packing, sorting and Z‑ordering, and outlines a future roadmap for improving performance and governance in big‑data environments.

Big DataData LakeFlink

0 likes · 12 min read

Iceberg Data Lake Query Optimization Practices and Governance

Big Data Technology & Architecture

Mar 22, 2022 · Big Data

Integrating Hive Data Warehouse with ClickHouse Using Seatunnel: A Step‑by‑Step Guide

This article provides a comprehensive, hands‑on tutorial for connecting a Hive data warehouse to ClickHouse via Seatunnel, covering environment setup, Hive and ClickHouse table creation, full and incremental data import scripts, execution examples, and practical troubleshooting tips.

Big DataClickHouseData Integration

0 likes · 10 min read

Integrating Hive Data Warehouse with ClickHouse Using Seatunnel: A Step‑by‑Step Guide

Tencent Tech

Mar 21, 2022 · R&D Management

Inside Tencent’s 2021 R&D Report: Coding Trends, AI Advances & Innovation

Tencent’s 2021 R&D Report details a 41% rise in engineering staff, 32 billion new code lines, Go becoming the top language, massive growth in open‑source contributions, breakthroughs in cloud OS, databases, AI, and a commitment to carbon‑neutral technology‑driven social impact.

AIBig DataR&D

0 likes · 8 min read

Inside Tencent’s 2021 R&D Report: Coding Trends, AI Advances & Innovation

DataFunTalk

Mar 18, 2022 · Big Data

Scaling LinkedIn’s Hadoop YARN Cluster Beyond 10,000 Nodes: Challenges and Solutions

This article examines how LinkedIn tackled severe scheduling slowdowns when its Hadoop YARN cluster grew to nearly 10,000 nodes, analyzes the root causes of resource‑manager bottlenecks, and describes the fairness‑redefinition and scheduling‑logic patches that restored throughput and scalability.

Big DataHadoopScheduling

0 likes · 13 min read

Scaling LinkedIn’s Hadoop YARN Cluster Beyond 10,000 Nodes: Challenges and Solutions

Big Data Technology & Architecture

Mar 16, 2022 · Big Data

End‑to‑End Streaming Data Pipeline with Kafka, Flink, and Apache Griffin

This tutorial demonstrates how to build a complete streaming data pipeline by configuring JDK, MySQL, Hadoop, Hive, Spark, Kafka, and Griffin, generating test data with shell scripts, processing it with Flink, and validating data quality using Apache Griffin in a Spark‑based deployment.

Apache GriffinBig DataData Quality

0 likes · 13 min read

End‑to‑End Streaming Data Pipeline with Kafka, Flink, and Apache Griffin

Big Data Technology & Architecture

Mar 15, 2022 · Big Data

Using Flink CDC to Capture MySQL Changes and Sync Them to ClickHouse

This article introduces Change Data Capture (CDC), compares query‑based and log‑based CDC, explains Debezium and ClickHouse, and provides step‑by‑step Flink CDC and Flink SQL CDC examples—including full Java code—to stream MySQL binlog changes into ClickHouse for real‑time analytics.

Big DataCDCClickHouse

0 likes · 17 min read

Using Flink CDC to Capture MySQL Changes and Sync Them to ClickHouse

Alibaba Cloud Developer

Mar 15, 2022 · Big Data

How Modern Data Lake Engines Accelerate Analytics: Inside StarRocks Architecture

This article explains why data lakes are essential for today’s analytics, outlines the three main user demands, defines data lakes, compares rule‑based and cost‑based optimizers, explores record‑oriented versus block‑oriented processing, and details StarRocks’ frontend‑backend architecture and benchmark results.

Analytics EngineBig DataData Lake

0 likes · 17 min read

How Modern Data Lake Engines Accelerate Analytics: Inside StarRocks Architecture

Volcano Engine Developer Services

Mar 15, 2022 · Big Data

How ByteDance Designs Scalable Data Lineage for Big Data Governance

This article explains ByteDance's data lineage architecture, covering data sources, processing pipelines, graph‑based modeling, key application scenarios, quality metrics such as accuracy, coverage and timeliness, and future directions for improving and standardizing lineage across its massive data platform.

Big DataData LineageMetadata

0 likes · 14 min read

How ByteDance Designs Scalable Data Lineage for Big Data Governance

DataFunTalk

Mar 15, 2022 · Big Data

Bilibili's Billion‑Scale Data Synchronization Using Apache SeaTunnel

This article details Bilibili's implementation of a hundred‑terabyte‑per‑day data synchronization pipeline, covering tool selection between DataX‑based Rider and SeaTunnel‑based AlterEgo, architecture design, performance tuning, logging optimization, rate‑limiting strategies, and comprehensive monitoring for large‑scale offline data ingestion and export.

Apache SeaTunnelBig DataClickHouse

0 likes · 13 min read

Bilibili's Billion‑Scale Data Synchronization Using Apache SeaTunnel

IT Architects Alliance

Mar 14, 2022 · Big Data

Comprehensive Guide to Kafka Architecture, Core Concepts, and Production Deployment

This article provides an in‑depth overview of Kafka, covering why messaging systems are needed, core concepts, cluster architecture, performance optimizations such as sequential disk writes and zero‑copy, hardware sizing, replication, consumer groups, offset management, rebalance strategies, and practical deployment and operational guidelines.

Big DataCluster DeploymentDistributed Messaging

0 likes · 35 min read

Comprehensive Guide to Kafka Architecture, Core Concepts, and Production Deployment

BaiPing Technology

Mar 14, 2022 · Big Data

Mastering DataWorks & MaxCompute: A Complete Guide to Big Data Architecture and Governance

DataWorks, Alibaba Cloud’s comprehensive PaaS platform, combined with the serverless MaxCompute data warehouse, offers an integrated solution for data integration, development, quality, and services, while detailed naming and layer conventions ensure scalable, maintainable big‑data architectures and effective governance across ODS, CDM, DWD, DWS, and ADS layers.

Big DataDataWorksMaxCompute

0 likes · 8 min read

Mastering DataWorks & MaxCompute: A Complete Guide to Big Data Architecture and Governance

DataFunTalk

Mar 13, 2022 · Big Data

Tencent Data Lake Metadata Governance Practice and Architecture

This article presents Tencent's data lake metadata governance practice, covering data lake fundamentals, the 3+2 architecture of storage, compute and unified metadata, multi‑tenant design, the re‑implemented Hive Metastore for online catalog, performance optimizations, and offline data‑governance capabilities.

Big DataCloud ComputingData Lake

0 likes · 18 min read

Tencent Data Lake Metadata Governance Practice and Architecture

DevOps

Mar 11, 2022 · Cloud Computing

Informationization vs. Digital Transformation: Definitions, Differences, and Their Impact on Chinese Enterprises

The article explains the definitions of informationization and digital transformation, compares their technical, demand, core‑goal, and ecosystem differences, and analyzes how digital technologies such as cloud, big data and AI are reshaping industries, enterprise strategies, talent needs, and overall competitiveness in China.

Big DataChinaDigital Transformation

0 likes · 14 min read

Informationization vs. Digital Transformation: Definitions, Differences, and Their Impact on Chinese Enterprises

vivo Internet Technology

Mar 9, 2022 · Big Data

Incremental Synchronization of Massive HBase Data to a Data Warehouse: Solution Overview and Performance Evaluation

The paper proposes a generic, timeRange‑based incremental extraction method for synchronizing tens of billions of HBase rows to a data warehouse, demonstrating that it avoids full‑table scans, automatically detects schema changes, and delivers significantly lower latency than Hive mapping or timestamp‑based approaches, and has been integrated into a unified big‑data platform.

Big DataHBasePerformance evaluation

0 likes · 8 min read

Incremental Synchronization of Massive HBase Data to a Data Warehouse: Solution Overview and Performance Evaluation

Big Data Technology & Architecture

Mar 7, 2022 · Big Data

Apache Griffin: An Overview of the Big Data Data‑Quality Monitoring Tool

This article introduces Apache Griffin, a model‑driven big‑data data‑quality monitoring platform, explains its key features, architecture, installation requirements, and provides step‑by‑step usage examples with Hive, Kafka and Spark integration.

Apache GriffinBig DataData Quality

0 likes · 9 min read

Apache Griffin: An Overview of the Big Data Data‑Quality Monitoring Tool

Python Programming Learning Circle

Mar 7, 2022 · Big Data

Analyzing 1.4 Billion N‑gram Rows with Python, NumPy and PyTubes

This article demonstrates how to download Google’s massive N‑gram dataset, load the 1.4 billion 1‑gram records with Python and the PyTubes library, use NumPy to efficiently compute yearly word frequencies, and reproduce Google Ngram Viewer charts for Python and other programming languages.

Big DataData AnalysisNGram

0 likes · 7 min read

Analyzing 1.4 Billion N‑gram Rows with Python, NumPy and PyTubes

Big Data Technology & Architecture

Mar 5, 2022 · Databases

Understanding ClickHouse Distributed Tables, Replication, and Sharding

This article explains the concepts of ClickHouse local and distributed tables, why writing directly to distributed tables can be problematic, and how replication, sharding, and the ReplicatedMergeTree engine work together with ZooKeeper to provide high‑availability and scalable query processing.

Big DataClickHouseDatabase Architecture

0 likes · 9 min read

Understanding ClickHouse Distributed Tables, Replication, and Sharding

Big Data Technology & Architecture

Mar 4, 2022 · Big Data

Managing Small Files in Apache Hudi and Spark Optimization Guide

The article explains how Apache Hudi automatically manages file sizes to avoid small‑file issues, details key configuration parameters, provides a step‑by‑step example of file merging, and offers practical Spark tuning recommendations for optimal performance in data‑lake workloads.

Apache HudiBig DataData Lake

0 likes · 11 min read

Managing Small Files in Apache Hudi and Spark Optimization Guide

DataFunTalk

Mar 3, 2022 · Big Data

Youzan Data Platform and DP Data Development Platform: Architecture, Core Modules, and Scheduling System Upgrade

This article presents an in‑depth overview of Youzan's data platform, introduces the DP data development platform with its key features and workflow, details the core module architecture—including service, scheduling, and component layers—and explains the migration from Airflow to DolphinScheduler to improve performance, stability, and scalability.

Big DataData DevelopmentData Platform

0 likes · 14 min read

Youzan Data Platform and DP Data Development Platform: Architecture, Core Modules, and Scheduling System Upgrade

IT Xianyu

Mar 3, 2022 · Databases

Introducing SPL: An Open‑Source Structured Data Processing Language with Full SQL‑92 Capabilities

SPL is an open‑source structured data processing language that extends full SQL‑92 functionality to a wide range of data sources—including CSV, Excel, JSON, NoSQL and Hadoop—allowing developers to perform complex queries, multi‑step calculations, and mixed‑source analytics without a traditional relational database.

Big DataData IntegrationSPL

0 likes · 14 min read

Introducing SPL: An Open‑Source Structured Data Processing Language with Full SQL‑92 Capabilities

AntTech

Mar 1, 2022 · Big Data

Graph Computing at Ant Group: From Fraud Prevention to Industry‑Wide Impact

The article explains how Ant Group leverages large‑scale graph computing—through its GeaBase and TuGraph platforms and a dedicated research team—to enhance real‑time fraud detection, drive industry standards, and explore future applications across finance, energy, and public services.

Ant GroupBig DataTuGraph

0 likes · 7 min read

Graph Computing at Ant Group: From Fraud Prevention to Industry‑Wide Impact

DataFunTalk

Mar 1, 2022 · Cloud Native

Alibaba Cloud Native Data Lake with Apache Iceberg: Architecture, Challenges, and Solutions

The presentation outlines Alibaba Cloud's native data lake solution built on Apache Iceberg, covering data lake fundamentals, cloud migration challenges, Iceberg's architecture and features, real‑time ingestion with Flink, unified metadata management, security guarantees, and testing practices to ensure reliable, scalable big‑data analytics.

Apache IcebergBig DataData Lake

0 likes · 16 min read

Alibaba Cloud Native Data Lake with Apache Iceberg: Architecture, Challenges, and Solutions

Big Data Technology & Architecture

Feb 28, 2022 · Big Data

Integrating Apache Hudi with Hive, Presto, and Spark SQL: Installation, Operations, and Query Examples

This article provides a step‑by‑step guide on integrating Apache Hudi with Hive and Presto, demonstrates core Hudi operations such as insert, upsert, delete, query, and Hive synchronization using Scala code, and shows how to manage Hudi tables through Spark SQL DDL/DML commands.

Apache HudiBig DataData Lake

0 likes · 16 min read

Integrating Apache Hudi with Hive, Presto, and Spark SQL: Installation, Operations, and Query Examples

Architects Research Society

Feb 26, 2022 · Big Data

Introduction to Azure Data Lake Analytics (ADLA) and Its Architecture

This article introduces Azure Data Lake Analytics, explains how data lakes differ from traditional warehouses, outlines the ETL process, highlights the benefits of schema‑on‑read storage, and describes the four‑stage Azure data platform architecture for ingesting, storing, processing, and analyzing massive datasets.

AzureBig DataU-SQL

0 likes · 5 min read

Introduction to Azure Data Lake Analytics (ADLA) and Its Architecture

Kuaishou Big Data

Feb 25, 2022 · Big Data

How Kuaishou Scales Data Sync: Architecture, Challenges, and Future Plans

This article details the design, evolution, and optimization of Kuaishou's data synchronization platform, covering business overview, architecture, key technologies, performance tuning, data source protection, incremental data lake integration, and future roadmap for a unified data fabric.

Big DataReal-time Processingarchitecture

0 likes · 15 min read

How Kuaishou Scales Data Sync: Architecture, Challenges, and Future Plans

DataFunTalk

Feb 25, 2022 · Big Data

Tencent's Application of Apache Iceberg for Real‑Time Data Lake Ingestion, Governance, and Query Optimization

This article explains how Tencent leverages Apache Iceberg together with Flink to build a real‑time data lake pipeline, covering data ingestion, Iceberg's snapshot‑based read/write model, compaction and governance services, Z‑order based query optimization, performance results, and future roadmap.

Apache IcebergBig DataData Lake

0 likes · 24 min read

Tencent's Application of Apache Iceberg for Real‑Time Data Lake Ingestion, Governance, and Query Optimization

Big Data Technology & Architecture

Feb 23, 2022 · Big Data

Understanding Mini‑Batch Streaming Aggregation in Flink SQL

This article explains Flink SQL’s streaming aggregation Mini‑Batch feature, covering its purpose, configuration parameters, underlying optimizer rules, operator implementations, watermark handling, buffer processing, and the optional Local‑Global two‑phase aggregation optimization for improving throughput and reducing state overhead in large‑scale data pipelines.

Big DataFlinkMini-Batch

0 likes · 10 min read

Understanding Mini‑Batch Streaming Aggregation in Flink SQL

DataFunTalk

Feb 23, 2022 · Big Data

NetEase Data Platform DataOps Practices for Improving Data Quality

This article details how NetEase's DataFunTalk presentation explores the growing data quality challenges in data development and demonstrates the application of DataOps principles—including pre‑ and post‑control mechanisms, sandbox environments, and automated governance tools—to systematically reduce defects, optimize resources, and ensure reliable data delivery across the company's diverse business lines.

Big DataData PlatformDataOps

0 likes · 14 min read

NetEase Data Platform DataOps Practices for Improving Data Quality

Architects' Tech Alliance

Feb 22, 2022 · Cloud Computing

Understanding China's “East Data West Computing” Initiative: Goals, Rationale, and Implementation

The “East Data West Computing” program is a national strategy that relocates computing workloads from data‑intensive eastern regions to resource‑rich western areas by building a network of data‑center hubs and clusters, aiming to balance supply and demand, improve energy efficiency, and boost overall computing capacity.

Big DataData CentersEast Data West Computing

0 likes · 7 min read

Understanding China's “East Data West Computing” Initiative: Goals, Rationale, and Implementation

IT Architects Alliance

Feb 22, 2022 · Big Data

Understanding Kafka's Core Design: Topics, Partitions, Consumer Groups, and Cluster Architecture

This article explains Kafka's fundamental concepts—including topics, partitions, producers, consumers, replication, consumer groups, and the role of Zookeeper—while also covering performance optimizations such as sequential writes, zero‑copy, log segmentation, and its reactor‑style network design.

Big DataKafkaStreaming

0 likes · 11 min read

Understanding Kafka's Core Design: Topics, Partitions, Consumer Groups, and Cluster Architecture

ByteDance Data Platform

Feb 21, 2022 · Big Data

Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL

This article examines how to design enterprise‑grade data warehouses by evaluating development convenience, ecosystem, decoupling, performance and security, compares Hive and SparkSQL along with other engines such as Presto, Doris and ClickHouse, and outlines best‑practice component selections for long‑running batch and interactive analytics.

Big DataData WarehouseETL

0 likes · 19 min read

Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL

DataFunTalk

Feb 19, 2022 · Big Data

Fundamentals of Data Middle Platform: Logic, Principles, and Practice

This article explains what a data middle platform is, why organizations need it, its core principles, technical architecture, and practical implementation guidelines, highlighting how it solves issues like inconsistent metrics, duplicate data construction, low query efficiency, poor data quality, and high development costs.

Big DataData ArchitectureData Middle Platform

0 likes · 14 min read

Fundamentals of Data Middle Platform: Logic, Principles, and Practice

Big Data Technology & Architecture

Feb 19, 2022 · Big Data

Apache Flink 1.13.6 Release: Bug Fixes, Improvements, and Updated Maven Dependencies

Apache Flink 1.13.6, the latest patch release, addresses 99 bugs and vulnerabilities, upgrades Log4j to 2.17.1, provides new Maven dependencies, and introduces numerous fixes and enhancements across SQL, checkpointing, state backend, and Kubernetes integration, urging users to upgrade promptly.

Apache FlinkBig DataBug Fixes

0 likes · 10 min read

Apache Flink 1.13.6 Release: Bug Fixes, Improvements, and Updated Maven Dependencies

Bilibili Tech

Feb 18, 2022 · Big Data

Evolution of Bilibili's Data Retrieval Services and Lakehouse Architecture

Bilibili’s data retrieval journey progressed from a fragmented, chimney‑style pipeline to a unified Flink‑based service layer with the Ark construction system and Akuya SQL engine, and finally to an Iceberg‑driven lakehouse that eliminates data duplication, streamlines cross‑engine optimization, and offers platformized, low‑latency analytics.

Big DataBilibiliData Retrieval

0 likes · 14 min read

Evolution of Bilibili's Data Retrieval Services and Lakehouse Architecture

Big Data Technology & Architecture

Feb 17, 2022 · Big Data

Comprehensive Guide to Installing and Configuring Apache Atlas with Hive and Sqoop Hooks

This article provides a step‑by‑step tutorial on using Apache Atlas for data lineage, including SQL execution, custom data maps, tagging, field search, detailed installation procedures, runtime commands, and the configuration of Hive and Sqoop hooks for a complete big‑data governance solution.

Apache AtlasBig DataHive Hook

0 likes · 18 min read

Comprehensive Guide to Installing and Configuring Apache Atlas with Hive and Sqoop Hooks

Alimama Tech

Feb 16, 2022 · Big Data

Target Group Discovery: Framework, Models, and Case Study

The article presents a target‑group discovery framework that combines goal definition, rule‑or model‑based segmentation, tiered metrics, benchmarking and quadrant analysis to identify and characterize advantageous, problematic, or weak consumer, product, or merchant sub‑groups, illustrated by a FMCG e‑commerce case study diagnosing high‑share, low‑growth categories.

BenchmarkingBig DataMarketing Analytics

0 likes · 13 min read

Target Group Discovery: Framework, Models, and Case Study

Big Data Technology & Architecture

Feb 16, 2022 · Big Data

Using Flink CDC to Capture MySQL Changes and Sync Them to ClickHouse

This article introduces Change Data Capture (CDC), compares query‑based and log‑based approaches, explains Debezium and ClickHouse, and provides detailed Flink CDC and Flink SQL CDC examples—including Java source code, custom deserialization schema, ClickHouse sink implementation, and required Maven dependencies—to synchronize MySQL data into ClickHouse in real time.

Big DataCDCClickHouse

0 likes · 17 min read

dbaplus Community

Feb 15, 2022 · Big Data

Mastering Data Warehouse Architecture: Concepts, Modeling Techniques, and Real‑Time Strategies

This comprehensive guide explains data warehouse fundamentals, architecture layers, modeling methods such as dimensional and entity modeling, metadata management, and the transition from offline to real‑time processing with Lambda and Kappa architectures, providing practical steps, best practices, and key terminology for building robust analytical platforms.

Big DataData WarehouseETL

0 likes · 63 min read

Mastering Data Warehouse Architecture: Concepts, Modeling Techniques, and Real‑Time Strategies

Big Data Technology & Architecture

Feb 15, 2022 · Big Data

Understanding Flink TaskManager Memory Model (Post‑1.10)

This article explains the official Flink memory model diagram, shows real‑world TaskManager memory parameters, and breaks down the five major memory components—including process, Flink, JVM heap, off‑heap, Metaspace, and overhead—providing configuration guidance for optimal resource allocation.

Big DataFlinkJava

0 likes · 8 min read

Understanding Flink TaskManager Memory Model (Post‑1.10)

DataFunTalk

Feb 15, 2022 · Big Data

SeaTunnel Multi‑Dimensional Practice at Vipshop: ClickHouse‑Hive Integration and Data Platform Integration

The article details Vipshop's multi‑dimensional use of SeaTunnel to integrate Hive and ClickHouse, describing data import/export challenges, tool selection among DataX, SeaTunnel and Spark, custom configurations, platform integration, and future improvements for high‑performance OLAP pipelines.

Big DataClickHouseData Integration

0 likes · 15 min read

SeaTunnel Multi‑Dimensional Practice at Vipshop: ClickHouse‑Hive Integration and Data Platform Integration

IT Architects Alliance

Feb 15, 2022 · Artificial Intelligence

How a Scalable Recommendation Engine Evolved: From V1.0 to V3.0

This article details the evolution of an e‑commerce recommendation system through three architectural versions, highlighting the initial simple design, the challenges that prompted vertical and horizontal splits, the introduction of a configurable pipeline and AB testing, and the final micro‑service‑based, dynamically configurable V3.0 architecture.

AIBig DataScalability

0 likes · 14 min read

How a Scalable Recommendation Engine Evolved: From V1.0 to V3.0

Big Data Technology & Architecture

Feb 14, 2022 · Big Data

Real-Time Advertising Data Warehouse Architecture Based on Flink

This article presents a comprehensive design of a real-time advertising data warehouse powered by Flink, covering construction background, technical and data‑warehouse architecture, real‑time OLAP, stability and data‑quality guarantees, future plans, and the integration of Hologres for simplified processing.

Big DataData QualityFlink

0 likes · 10 min read

Real-Time Advertising Data Warehouse Architecture Based on Flink

DataFunTalk

Feb 13, 2022 · Big Data

How Kuaishou Built a Standardized Data Governance Evaluation System

This article outlines Kuaishou’s approach to establishing a standardized data governance evaluation framework, detailing the challenges of large‑scale data management, the design of assessment metrics across model, quality, and cost dimensions, and the practical strategies and operational mechanisms used to improve data asset health and business value.

Big DataEvaluation FrameworkKuaishou

0 likes · 21 min read

How Kuaishou Built a Standardized Data Governance Evaluation System

Big Data Technology & Architecture

Feb 13, 2022 · Big Data

What's New in Elasticsearch 8.0 – Key Features and Changes

The article provides a comprehensive overview of Elasticsearch 8.0, highlighting major updates such as 7.x REST API compatibility headers, default-enabled security, system‑index protection, a new KNN search API, storage and indexing optimizations, PyTorch model support, and numerous deprecations and feature removals across the stack.

8.0APIBig Data

0 likes · 10 min read

What's New in Elasticsearch 8.0 – Key Features and Changes

DataFunTalk

Feb 12, 2022 · Big Data

NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap

This article introduces NetEase's internally incubated data lake project Arctic, explains the concept of data lakes, outlines NetEase's specific requirements for a unified streaming‑batch platform, details Arctic's core architecture, storage strategy, data‑merge mechanisms, current achievements, and future development plans.

Apache IcebergArcticBig Data

0 likes · 10 min read

NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap

Programmer DD

Feb 12, 2022 · Databases

What’s New in Elasticsearch 8.0? Key Features and Migration Tips

Elasticsearch 8.0 introduces major changes such as 7.x REST API compatibility headers, default‑enabled security with registration tokens, protected system indices, a technical preview of KNN search, storage‑saving field encodings, faster geo‑point indexing, PyTorch model support for NLP, and numerous deprecations and improvements across aggregations, allocation, analysis, authentication, cluster coordination, and packaging.

APIBig DataElasticsearch

0 likes · 10 min read

What’s New in Elasticsearch 8.0? Key Features and Migration Tips

21CTO

Feb 11, 2022 · Cloud Computing

What Will Shape Software Development in 2022? 20 Key Trends Revealed

The article surveys 2022 software‑development forecasts, covering centralized and edge cloud infrastructure, multi‑cloud adoption, containers, security, blockchain, AI, low‑code, databases, big‑data engines, streaming, DevOps observability, programming languages, front‑end frameworks, and mobile development, offering a comprehensive outlook for practitioners and decision‑makers.

2022 trendsBig Datasoftware development

0 likes · 21 min read

What Will Shape Software Development in 2022? 20 Key Trends Revealed

DataFunSummit

Feb 9, 2022 · Big Data

Practical Reflections on OneID: Origins, Scenarios, Challenges, and Data Platform Practices

This article reviews OneID as a core data‑identity infrastructure for enterprise digital transformation, detailing its definition, origins, key use cases, technical and engineering challenges, and emerging trends such as CDP adoption, enterprise‑wide deployment, and weak‑ID intelligent association.

Big DataData IdentityOneID

0 likes · 13 min read

Practical Reflections on OneID: Origins, Scenarios, Challenges, and Data Platform Practices

Big Data Technology & Architecture

Feb 9, 2022 · Big Data

Apache Ambari Project Retired: End of an Era for Hadoop Management Tool

The Apache Ambari project, once a leading web‑based management and monitoring tool for Hadoop clusters, has been officially retired and moved to the Apache Attic after a unanimous community vote, marking the end of its development despite continued access to its website, source code, and JIRA.

Apache AmbariBig DataHadoop

0 likes · 4 min read

Apache Ambari Project Retired: End of an Era for Hadoop Management Tool

政采云技术

Feb 8, 2022 · Industry Insights

Unlocking Enterprise Value with a Data Middle Platform: Architecture & Indicators

This article traces the evolution from traditional data warehouses to modern data lakes and data middle platforms, explains why siloed data development hampers efficiency, and details the architecture and indicator‑library design used by Zhengcaiyun to achieve unified, reusable data services.

Big DataData LakehouseData Middle Platform

0 likes · 14 min read

Unlocking Enterprise Value with a Data Middle Platform: Architecture & Indicators

Big Data Technology & Architecture

Feb 8, 2022 · Big Data

Apache Hudi Overview: Design Principles, Table Architecture, and Read/Write Processes

This article provides a comprehensive overview of Apache Hudi, covering its storage reliance on HDFS, core design principles, table architecture, timeline management, file and index structures, as well as detailed read and write workflows for both Copy‑On‑Write and Merge‑On‑Read table types.

Apache HudiBig DataCopy-on-Write

0 likes · 16 min read

Apache Hudi Overview: Design Principles, Table Architecture, and Read/Write Processes

IT Architects Alliance

Feb 8, 2022 · Backend Development

Designing a Daily Million-Transaction Payment Reconciliation System

This article explains how to architect a payment reconciliation system that can reliably process tens of millions of transactions per day, covering the underlying logic, scalability challenges, data collection methods, big‑data integration, and step‑by‑step processing flows to ensure accurate financial matching.

Backend ArchitectureBig DataHive

0 likes · 32 min read

Designing a Daily Million-Transaction Payment Reconciliation System

DataFunTalk

Feb 3, 2022 · Big Data

Improving Data Processing Efficiency at Kuaishou with Apache Hudi

This article explains how Kuashou tackled latency and efficiency problems in large‑scale data pipelines by adopting Apache Hudi, detailing the pain points, reasons for choosing Hudi, its architecture, model design, handling of bursty updates, back‑fill scenarios, and operational safeguards.

Big DataData LakeFlink

0 likes · 13 min read

Improving Data Processing Efficiency at Kuaishou with Apache Hudi

DataFunTalk

Jan 28, 2022 · Big Data

Real-Time Customer Data Platform (RT‑CDP) Architecture and Implementation at iFanFan

This article explains the concept, challenges, and key business goals of a real‑time Customer Data Platform, details the technology stack selection—including Nebula Graph, Apache Flink, Apache Beam, Kudu, and Doris—and describes the modular architecture, data model, identity service, streaming computation, storage layers, rule engine, operational results, and future directions.

Big DataCDPData Integration

0 likes · 43 min read

Real-Time Customer Data Platform (RT‑CDP) Architecture and Implementation at iFanFan

IT Xianyu

Jan 28, 2022 · Big Data

Step-by-Step Guide to Installing and Configuring Hue on CentOS 7 with Hadoop, Hive, and YARN

This tutorial explains how to set up the Hue web UI on a CentOS 7 machine by installing required dependencies, compiling Hue, configuring HDFS, YARN and Hive integration files, starting Hive services, launching Hue, and accessing the interface, with all commands and configuration snippets provided.

Big DataCentOSHadoop

0 likes · 6 min read

Step-by-Step Guide to Installing and Configuring Hue on CentOS 7 with Hadoop, Hive, and YARN

JD Retail Technology

Jan 27, 2022 · Big Data

How JD’s Custom Spark Engine Tackles Data Skew for Massive Offline Jobs

This article explains JD’s self‑developed data‑skew mitigation solution for Spark, detailing the problem of uneven key distribution, the limitations of the open‑source AQE implementation, and JD’s OptimizeSkewedJoinV2 algorithm that dramatically reduces stage latency in large‑scale join workloads.

Adaptive Query ExecutionBig DataData Skew

0 likes · 13 min read

How JD’s Custom Spark Engine Tackles Data Skew for Massive Offline Jobs

DataFunTalk

Jan 27, 2022 · Big Data

Kyuubi: NetEase’s Open‑Source Multi‑Tenant SQL Engine for Large‑Scale Data Processing

This article introduces Kyuubi, the first NetEase project contributed to the Apache Foundation, describing its core features, multi‑tenant architecture, Spark‑based execution engine, cloud‑native capabilities, and real‑world use cases within NetEase’s data‑warehouse, ad‑hoc, and internal systems, along with performance gains and community resources.

ApacheBig DataKyuubi

0 likes · 23 min read

Kyuubi: NetEase’s Open‑Source Multi‑Tenant SQL Engine for Large‑Scale Data Processing

IT Xianyu

Jan 27, 2022 · Big Data

Installing Apache Hive on macOS with Hadoop and MySQL Metastore

This tutorial provides step‑by‑step instructions for installing Hadoop 3.1.1, Homebrew, Hive, and configuring MySQL as Hive's metastore on macOS, including environment variable setup, hive‑site.xml configuration, MySQL connector placement, schema initialization, and verification commands.

Big DataHadoopHive

0 likes · 6 min read

Installing Apache Hive on macOS with Hadoop and MySQL Metastore

dbaplus Community

Jan 26, 2022 · Big Data

Why Does Elasticsearch Aggregate Faster with Fewer Terms? Uncover the Secrets

This article examines a real‑world Elasticsearch cluster handling hundreds of terabytes, explains why high‑cardinality aggregations can be slower, and shows how setting execution_hint=map and tuning doc_values dramatically improves aggregation performance for ultra‑high‑concurrency workloads.

Big DataElasticsearchPerformance Optimization

0 likes · 12 min read

Why Does Elasticsearch Aggregate Faster with Fewer Terms? Uncover the Secrets

Alibaba Cloud Native

Jan 26, 2022 · Big Data

How to Build a Lakehouse with RocketMQ and Apache Hudi: A Step‑by‑Step Guide

This article explains the Lakehouse architecture, its required features, the evolution of big‑data stacks, and provides a detailed, hands‑on guide for constructing a Lakehouse using RocketMQ (Connector & Stream) and Apache Hudi, including configuration, deployment, and sample code.

Apache HudiBig DataData Lake

0 likes · 18 min read

How to Build a Lakehouse with RocketMQ and Apache Hudi: A Step‑by‑Step Guide

Java High-Performance Architecture

Jan 26, 2022 · Big Data

How Elasticsearch’s Cluster Architecture Powers Scalable Search and Analytics

This article explains Elasticsearch’s distributed cluster design, covering nodes, indices, shards, replicas, deployment models, data storage options, and the trade‑offs of different distributed system architectures for search and analytics workloads.

Big DataCluster ArchitectureElasticsearch

0 likes · 14 min read

IT Architects Alliance

Jan 26, 2022 · Big Data

Why Combine Data Lakes and Warehouses? Understanding Lakehouse Architecture

This article explains the concepts of data warehouses, data marts, and data lakes, illustrates why the lakehouse model emerged to bridge storage and compute, and outlines its key benefits such as flexibility, scalability, reduced redundancy, and unified analytics for modern enterprises.

AnalyticsBig DataData Architecture

0 likes · 12 min read

Why Combine Data Lakes and Warehouses? Understanding Lakehouse Architecture

Architects Research Society

Jan 25, 2022 · Big Data

Azure Data Lake Storage Gen2: Design Guide, Best Practices, and Operational Considerations

This guide provides a comprehensive overview of Azure Data Lake Storage Gen2, covering when to use it, key design considerations, data organization strategies, access control models, file formats, cost‑optimization techniques, monitoring approaches, and performance‑tuning tips for large‑scale big‑data workloads.

ADLS Gen2AzureBig Data

0 likes · 41 min read

Azure Data Lake Storage Gen2: Design Guide, Best Practices, and Operational Considerations

DataFunTalk

Jan 25, 2022 · Big Data

Summary of Flink Forward Asia 2021: Community Growth, Cloud‑Native Deployment, Streaming‑Batch Integration, and Machine Learning

The article provides a comprehensive English summary of the 2021 Flink Forward Asia conference, covering community statistics, cloud‑native deployment modes, fault‑tolerance checkpoint advances, the evolution of streaming‑batch integration, the introduction of Streaming Warehouse, Flink ML 2.0, real‑time use cases at ByteDance and ICBC, Pravega storage innovations, and concluding reflections on the future of real‑time big data processing.

Apache FlinkBig Data

0 likes · 25 min read

Summary of Flink Forward Asia 2021: Community Growth, Cloud‑Native Deployment, Streaming‑Batch Integration, and Machine Learning

Qunar Tech Salon

Jan 25, 2022 · Fundamentals

Curated Collection of Qunar Technical Articles on Architecture Design, Big Data, Frontend, and Cloud Native (2021)

This article compiles a selection of Qunar's 2021 technical writings covering architecture design, big data processing, front‑end engineering, and cloud‑native practices, providing titles, authors, brief abstracts, and direct links for readers seeking in‑depth engineering insights.

Big DataQunararchitecture

0 likes · 8 min read

Curated Collection of Qunar Technical Articles on Architecture Design, Big Data, Frontend, and Cloud Native (2021)

IT Architects Alliance

Jan 25, 2022 · Operations

Design and Architecture of a Shared Resource Platform and Its Technical System

This document outlines the logical and technical architecture of a government shared resource platform, describing application system upgrades, data collection and analysis, multi‑layer system design, standards compliance, interface management, and overall system integration for improved service quality and decision support.

Big DataData IntegrationGovernment IT

0 likes · 23 min read

Design and Architecture of a Shared Resource Platform and Its Technical System

IT Architects Alliance

Jan 24, 2022 · Big Data

How to Build a Scalable Big Data Access Control System with Hive, Presto, and Ranger

This article details the design and implementation of a comprehensive big data permission system that integrates Hive, Presto, Hadoop, and Metabase, covering data access methods, authentication choices, Ranger-based authorization, policy management, and automated workflow integration to balance security and efficiency.

Apache RangerBig DataHive

0 likes · 16 min read

How to Build a Scalable Big Data Access Control System with Hive, Presto, and Ranger

DataFunSummit

Jan 23, 2022 · Big Data

MobTech's Integrated Data Governance Practices and Architecture

This article presents MobTech's comprehensive data governance and security practices, covering the necessity of governance, challenges in large‑scale data environments, the full‑link governance chain, modular architecture, and specific implementations for financial risk‑control scenarios.

Big DataData ArchitectureData Management

0 likes · 19 min read

MobTech's Integrated Data Governance Practices and Architecture

DataFunTalk

Jan 22, 2022 · Big Data

Alibaba Cloud Data Integration (DataX) Architecture, Design Principles, and Solution Overview

This presentation details Alibaba Cloud DataWorks Data Integration (DataX), covering its architecture, core design principles, offline and real‑time synchronization mechanisms, deployment modes, product positioning, use‑case scenarios, and its role within the broader DataWorks ecosystem, highlighting its capabilities for large‑scale data movement and processing.

Alibaba CloudBig DataData Integration

0 likes · 19 min read

Alibaba Cloud Data Integration (DataX) Architecture, Design Principles, and Solution Overview

Big Data Technology & Architecture

Jan 19, 2022 · Big Data

Understanding Flink End-to-End Latency Measurement with LatencyMarker

This article explains the background, source‑code analysis, implementation details, metric granularity, and practical considerations of Flink's LatencyMarker feature for measuring full‑link job latency in streaming applications.

Big DataFlinkJava

0 likes · 12 min read

Understanding Flink End-to-End Latency Measurement with LatencyMarker

Big Data Technology & Architecture

Jan 18, 2022 · Big Data

Data Warehouse Data Quality Measurement Standards

The article outlines four key dimensions for evaluating data warehouse data quality—correctness, completeness, timeliness, and consistency—explains common consistency issues such as differing metric values across models, cross‑dimensional aggregations, and real‑time versus batch calculations, and proposes organizational and review mechanisms to mitigate these problems.

Big DataConsistencyData Quality

0 likes · 9 min read

Data Warehouse Data Quality Measurement Standards

DataFunTalk

Jan 16, 2022 · Big Data

Time Series Database Capabilities and Application Scenarios in IoT, Smart Cities, and Edge Computing

This article explains the fundamentals of time‑series data, outlines the architecture and core technical advantages of Baidu Cloud's TSDB, and demonstrates how the database powers IoT, smart‑city, industrial, power‑grid, and autonomous‑driving use cases through multi‑level storage, distributed query optimization, and edge‑cloud integration.

Big DataCloud ComputingIoT

0 likes · 11 min read

Time Series Database Capabilities and Application Scenarios in IoT, Smart Cities, and Edge Computing

21CTO

Jan 13, 2022 · Fundamentals

How to Achieve Data Maturity: Turning Data into a Strategic Product

The article explains why data maturity is essential for modern enterprises, defines its three pillars—people, tools, and readiness—shows how treating data as a product follows the same principles as great products, and outlines the four S (Speed, Scale, Simplicity, SQL) that guide a mature data ecosystem.

Big DataData Productdata governance

0 likes · 6 min read

How to Achieve Data Maturity: Turning Data into a Strategic Product

TAL Education Technology

Jan 13, 2022 · Cloud Native

Offline Mixed Deployment with Kubernetes: Architecture, Implementation, and Performance Evaluation for Big Data Workloads

This article describes a cloud‑native offline mixed‑deployment solution that leverages Kubernetes to share resources between big‑data clusters and business services, outlines its implementation steps, presents detailed performance comparisons between Yarn and Kubernetes using TPC‑DS, Spark, and Terasort workloads, and discusses production experience and future plans.

Big DataYARNcloud-native

0 likes · 8 min read

Offline Mixed Deployment with Kubernetes: Architecture, Implementation, and Performance Evaluation for Big Data Workloads