Tagged articles
3697 articles
Page 18 of 37
DataFunTalk
DataFunTalk
Apr 9, 2022 · Big Data

Optimizing Apache Iceberg Query Performance with Z‑Order Data Organization

This talk explains how Apache Iceberg’s DataSkipping can lose efficiency with many filter columns, and presents a data‑organization redesign using space‑filling curves and Z‑Order to improve query I/O, detailing the OPTIMIZE syntax, implementation steps, performance benchmarks, and future roadmap.

Apache IcebergBig DataData Skipping
0 likes · 12 min read
Optimizing Apache Iceberg Query Performance with Z‑Order Data Organization
Bilibili Tech
Bilibili Tech
Apr 9, 2022 · Big Data

Bilibili Presto on Hadoop: Architecture, Scaling, and Performance Enhancements

Bilibili’s Presto on Hadoop combines a multi‑engine offline platform with Kubernetes‑managed YARN scheduling, Ranger security, and a custom dispatcher, scaling to over 400 nodes handling 160 k daily queries on 10 PB, while adding coordinator HA, resource‑group punishment, query limits, Alluxio caching, dynamic filtering, and numerous SQL‑level enhancements, with future auto‑scaling and materialized‑view automation.

Big DataHadoopcluster scaling
0 likes · 30 min read
Bilibili Presto on Hadoop: Architecture, Scaling, and Performance Enhancements
DataFunTalk
DataFunTalk
Apr 7, 2022 · Big Data

Apache Kyuubi: Architecture, Use Cases, Community, and Mobile Cloud Deployment

This article introduces Apache Kyuubi—a multi‑tenant Thrift JDBC/ODBC service built on Spark—detailing its architecture, advantages over Spark Thrift Server, real‑world use cases, open‑source community progress, and practical deployment strategies on mobile cloud, Kubernetes, and with Trino.

Apache SparkBig DataKyuubi
0 likes · 16 min read
Apache Kyuubi: Architecture, Use Cases, Community, and Mobile Cloud Deployment
DataFunSummit
DataFunSummit
Apr 6, 2022 · Big Data

Real-time Dimension Modeling with Flink SQL: Challenges and Solutions

This article presents a JD.com case study on applying Flink SQL for real‑time dimension modeling, detailing two complex streaming scenarios—full‑join of multiple streams and full‑group aggregation—along with the associated challenges of historical data handling, state management, and performance optimization, and proposes component‑based architectural solutions.

Big DataFlinkStreaming
0 likes · 14 min read
Real-time Dimension Modeling with Flink SQL: Challenges and Solutions
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 5, 2022 · Big Data

Using ElasticsearchSink with Apache Flink: Configuration, Retry Strategies, and Failure Handling

This article introduces the ElasticsearchSink for Apache Flink, explains how to add Maven dependencies, implement the sink with configuration and retry settings, details failure handlers, and highlights important considerations such as exception handling and checkpoint requirements for reliable streaming pipelines.

Big DataElasticsearchFailure Handling
0 likes · 9 min read
Using ElasticsearchSink with Apache Flink: Configuration, Retry Strategies, and Failure Handling
DataFunTalk
DataFunTalk
Apr 4, 2022 · Big Data

Impala Deployment and Optimization in Sensors Data's Multi-Dimensional Analytics Platform

This article details the architecture of Sensors Data's analytics platform, the implementation of a real‑time Impala query engine, multiple query‑performance optimizations—including storage redesign, user‑behavior sequence tuning, join elimination and expression push‑down—and a resource‑estimation framework that dramatically reduces query failures and latency.

Big DataData PlatformImpala
0 likes · 16 min read
Impala Deployment and Optimization in Sensors Data's Multi-Dimensional Analytics Platform
DataFunTalk
DataFunTalk
Apr 2, 2022 · Big Data

SuperSQL: A High‑Performance Cross‑Engine, Cross‑Data‑Center SQL Middleware for Big Data

The article introduces SuperSQL, a federated SQL middleware that unifies heterogeneous data sources across multiple data centers, leverages Apache Calcite for cost‑based optimization, pushes down operators to various engines, manages metadata with a Trie model, and demonstrates significant performance gains over traditional solutions.

Big DataCross‑Data‑CenterDistributed computing
0 likes · 27 min read
SuperSQL: A High‑Performance Cross‑Engine, Cross‑Data‑Center SQL Middleware for Big Data
DataFunTalk
DataFunTalk
Apr 1, 2022 · Operations

Integrated Digital Supply Chain: JD Logistics' Intelligent Planning, Algorithm Platform, and Digital Twin Practices

This article explores JD Logistics' integrated digital supply chain, detailing its evolution, the construction of an algorithm middle‑platform, engineering platforms, digital twin system, real‑world case studies, and future talent and ecosystem directions, illustrating how AI and big‑data technologies drive end‑to‑end logistics optimization.

Algorithm PlatformBig DataDigital Twin
0 likes · 16 min read
Integrated Digital Supply Chain: JD Logistics' Intelligent Planning, Algorithm Platform, and Digital Twin Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 31, 2022 · Big Data

Bilibili’s Lakehouse Architecture: Integrating Data Lake and Warehouse with Apache Iceberg

To address the high cost and low efficiency of traditional Hadoop‑based data pipelines, Bilibili designed a lakehouse solution using Apache Iceberg, integrating Spark, Flink, Trino, and Alluxio to unify flexible data lake storage with warehouse‑level query performance, reducing data duplication and improving interactive analytics.

Big DataData WarehouseIceberg
0 likes · 17 min read
Bilibili’s Lakehouse Architecture: Integrating Data Lake and Warehouse with Apache Iceberg
DataFunTalk
DataFunTalk
Mar 30, 2022 · Big Data

NetEase Big Data Platform: HDFS Optimization and Practice

This article presents NetEase's big data platform architecture, detailing multi‑layer storage and compute design, HDFS deployment challenges, NameNode and NameSpace performance optimizations, cluster scaling strategies, data tiering, hardware upgrades, and real‑world business use cases, illustrating practical large‑scale big data engineering.

Big DataCluster OptimizationData Management
0 likes · 23 min read
NetEase Big Data Platform: HDFS Optimization and Practice
21CTO
21CTO
Mar 30, 2022 · Big Data

What Drives Taobao App Users? Insights from AARRR and RFM Analyses

This article analyzes 2 million Taobao app user‑behavior records using AARRR funnel metrics and RFM segmentation, revealing daily and hourly usage patterns, conversion bottlenecks, product‑search mismatches, and offering data‑driven marketing recommendations to boost retention and sales.

AARRRBig DataRFM
0 likes · 25 min read
What Drives Taobao App Users? Insights from AARRR and RFM Analyses
Bilibili Tech
Bilibili Tech
Mar 30, 2022 · Big Data

HDFS Architecture, Optimizations, and Future Plans at Bilibili

Bilibili’s HDFS now runs a three‑tier architecture—access, metadata, and data layers—enhanced with a custom MergeFS router, observer NameNode, dynamic load balancing, fast‑failover pipelines, and storage‑aware policies, while future work targets transparent erasure coding, tiered data routing, lock refinements, and a Hadoop 3.x migration.

Big DataDistributed File SystemHDFS
0 likes · 22 min read
HDFS Architecture, Optimizations, and Future Plans at Bilibili
Efficient Ops
Efficient Ops
Mar 29, 2022 · Big Data

How Tencent Cloud Boosted APM Metric Computation Speed 2‑3× with Flink Optimizations

This article explains how Tencent Cloud's APM metric calculation, which transforms massive Span data into aggregated metrics using Flink, faced performance bottlenecks and was optimized through job splitting, batch merging, and dimension pruning, ultimately achieving a 2‑3× speed increase and cutting resource usage to about 30% of the original.

APMBig DataFlink
0 likes · 10 min read
How Tencent Cloud Boosted APM Metric Computation Speed 2‑3× with Flink Optimizations
DataFunTalk
DataFunTalk
Mar 29, 2022 · Big Data

FlinkX Multi-Source Heterogeneous Data Synchronization Framework: Architecture, Features, and Cloud‑Native Enhancements

This article introduces the FlinkX framework for multi‑source heterogeneous data synchronization, detailing its background, core functions such as checkpoint‑based resume, metric monitoring, rate limiting, plugin architecture, cloud‑native K8s deployment, Hudi integration, and future roadmap, while also addressing common Q&A topics.

BatchBig DataData Lake
0 likes · 14 min read
FlinkX Multi-Source Heterogeneous Data Synchronization Framework: Architecture, Features, and Cloud‑Native Enhancements
58 Tech
58 Tech
Mar 29, 2022 · Big Data

Design and Implementation of the 58 Group Penalty Data Center

This article presents the design, architecture, and implementation of a unified penalty data center for 58 Group, detailing the challenges of heterogeneous data sources, the selection of Flink for real‑time ETL, the use of a DSL and LRU aggregation, and the adoption of MVEL for feature recognition to achieve standardized, high‑performance penalty data processing.

Big DataData engineeringETL
0 likes · 13 min read
Design and Implementation of the 58 Group Penalty Data Center
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 28, 2022 · Big Data

Real-time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

This article presents JD's real-time dimension modeling case using Flink SQL, detailing two complex streaming scenarios, the difficulties of handling historical data and state management, and a component‑based solution that leverages external KV stores and optimized Flink operators to improve performance and scalability.

Big DataFlinkStreaming
0 likes · 13 min read
Real-time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions
Architects' Tech Alliance
Architects' Tech Alliance
Mar 28, 2022 · Artificial Intelligence

Digital Twin: Ten Fundamental Questions and Insights for Researchers, Decision‑Makers, and Practitioners

This article analyzes ten fundamental questions about digital twins, covering definitions, stakeholders, global interest, relationship with smart manufacturing, integration with New IT, scientific challenges, standards, and commercial tools, aiming to guide researchers, policymakers, and practitioners in understanding and applying digital twin technology.

AIBig DataDigital Twin
0 likes · 22 min read
Digital Twin: Ten Fundamental Questions and Insights for Researchers, Decision‑Makers, and Practitioners
Bilibili Tech
Bilibili Tech
Mar 25, 2022 · Big Data

Bilibili's YARN Scheduling Optimization Practice: From Heartbeat-Driven to Global Scheduling

Bilibili transformed its YARN CapacityScheduler from a heartbeat‑driven design to a multi‑threaded global scheduler by separating lock handling, adopting Weighted Round‑Robin with DRF, adding batch node selection, fixing proposal inconsistencies, tuning GC and logging, and thereby reduced application allocation time by about 38 % on clusters of up to 8,000 nodes.

Big DataCapacitySchedulerHadoop
0 likes · 15 min read
Bilibili's YARN Scheduling Optimization Practice: From Heartbeat-Driven to Global Scheduling
DataFunTalk
DataFunTalk
Mar 24, 2022 · Big Data

Real‑time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

This article presents a JD.com BI engineer's case study on applying Flink SQL to real‑time dimension modeling, detailing two complex streaming scenarios, the technical difficulties of handling historical data and performance, and a component‑based solution architecture with future roadmap considerations.

Big DataFlinkdimension modeling
0 likes · 13 min read
Real‑time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions
StarRocks
StarRocks
Mar 23, 2022 · Databases

Accelerating Zepp Health’s Analytics with StarRocks: An OLAP Case Study

Facing inflexible point‑lookup limits and slow query times on HBase, Zepp Health redesigned its massive event‑tracking data pipeline—migrating ingestion through Kafka, Flink, and Hudi to a StarRocks‑based OLAP layer—achieving sub‑100 ms average query latency, 20 % storage savings, and dramatically faster multi‑dimensional analytics.

Big DataFlinkHudi
0 likes · 9 min read
Accelerating Zepp Health’s Analytics with StarRocks: An OLAP Case Study
DataFunTalk
DataFunTalk
Mar 23, 2022 · Big Data

Iceberg Data Lake Query Optimization Practices and Governance

This talk by Tencent senior engineer Chen Liang covers Iceberg table format fundamentals, data lake ingestion, query processing, hidden partitioning, time‑travel, major features, optimization techniques such as compaction, bin‑packing, sorting and Z‑ordering, and outlines a future roadmap for improving performance and governance in big‑data environments.

Big DataData LakeFlink
0 likes · 12 min read
Iceberg Data Lake Query Optimization Practices and Governance
Tencent Tech
Tencent Tech
Mar 21, 2022 · R&D Management

Inside Tencent’s 2021 R&D Report: Coding Trends, AI Advances & Innovation

Tencent’s 2021 R&D Report details a 41% rise in engineering staff, 32 billion new code lines, Go becoming the top language, massive growth in open‑source contributions, breakthroughs in cloud OS, databases, AI, and a commitment to carbon‑neutral technology‑driven social impact.

AIBig DataR&D
0 likes · 8 min read
Inside Tencent’s 2021 R&D Report: Coding Trends, AI Advances & Innovation
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 15, 2022 · Big Data

How Modern Data Lake Engines Accelerate Analytics: Inside StarRocks Architecture

This article explains why data lakes are essential for today’s analytics, outlines the three main user demands, defines data lakes, compares rule‑based and cost‑based optimizers, explores record‑oriented versus block‑oriented processing, and details StarRocks’ frontend‑backend architecture and benchmark results.

Analytics EngineBig DataData Lake
0 likes · 17 min read
How Modern Data Lake Engines Accelerate Analytics: Inside StarRocks Architecture
DataFunTalk
DataFunTalk
Mar 15, 2022 · Big Data

Bilibili's Billion‑Scale Data Synchronization Using Apache SeaTunnel

This article details Bilibili's implementation of a hundred‑terabyte‑per‑day data synchronization pipeline, covering tool selection between DataX‑based Rider and SeaTunnel‑based AlterEgo, architecture design, performance tuning, logging optimization, rate‑limiting strategies, and comprehensive monitoring for large‑scale offline data ingestion and export.

Apache SeaTunnelBig DataClickHouse
0 likes · 13 min read
Bilibili's Billion‑Scale Data Synchronization Using Apache SeaTunnel
IT Architects Alliance
IT Architects Alliance
Mar 14, 2022 · Big Data

Comprehensive Guide to Kafka Architecture, Core Concepts, and Production Deployment

This article provides an in‑depth overview of Kafka, covering why messaging systems are needed, core concepts, cluster architecture, performance optimizations such as sequential disk writes and zero‑copy, hardware sizing, replication, consumer groups, offset management, rebalance strategies, and practical deployment and operational guidelines.

Big DataCluster DeploymentDistributed Messaging
0 likes · 35 min read
Comprehensive Guide to Kafka Architecture, Core Concepts, and Production Deployment
BaiPing Technology
BaiPing Technology
Mar 14, 2022 · Big Data

Mastering DataWorks & MaxCompute: A Complete Guide to Big Data Architecture and Governance

DataWorks, Alibaba Cloud’s comprehensive PaaS platform, combined with the serverless MaxCompute data warehouse, offers an integrated solution for data integration, development, quality, and services, while detailed naming and layer conventions ensure scalable, maintainable big‑data architectures and effective governance across ODS, CDM, DWD, DWS, and ADS layers.

Big DataDataWorksMaxCompute
0 likes · 8 min read
Mastering DataWorks & MaxCompute: A Complete Guide to Big Data Architecture and Governance
DataFunTalk
DataFunTalk
Mar 13, 2022 · Big Data

Tencent Data Lake Metadata Governance Practice and Architecture

This article presents Tencent's data lake metadata governance practice, covering data lake fundamentals, the 3+2 architecture of storage, compute and unified metadata, multi‑tenant design, the re‑implemented Hive Metastore for online catalog, performance optimizations, and offline data‑governance capabilities.

Big DataCloud ComputingData Lake
0 likes · 18 min read
Tencent Data Lake Metadata Governance Practice and Architecture
DevOps
DevOps
Mar 11, 2022 · Cloud Computing

Informationization vs. Digital Transformation: Definitions, Differences, and Their Impact on Chinese Enterprises

The article explains the definitions of informationization and digital transformation, compares their technical, demand, core‑goal, and ecosystem differences, and analyzes how digital technologies such as cloud, big data and AI are reshaping industries, enterprise strategies, talent needs, and overall competitiveness in China.

Big DataChinaDigital Transformation
0 likes · 14 min read
Informationization vs. Digital Transformation: Definitions, Differences, and Their Impact on Chinese Enterprises
vivo Internet Technology
vivo Internet Technology
Mar 9, 2022 · Big Data

Incremental Synchronization of Massive HBase Data to a Data Warehouse: Solution Overview and Performance Evaluation

The paper proposes a generic, timeRange‑based incremental extraction method for synchronizing tens of billions of HBase rows to a data warehouse, demonstrating that it avoids full‑table scans, automatically detects schema changes, and delivers significantly lower latency than Hive mapping or timestamp‑based approaches, and has been integrated into a unified big‑data platform.

Big DataHBasePerformance evaluation
0 likes · 8 min read
Incremental Synchronization of Massive HBase Data to a Data Warehouse: Solution Overview and Performance Evaluation
DataFunTalk
DataFunTalk
Mar 3, 2022 · Big Data

Youzan Data Platform and DP Data Development Platform: Architecture, Core Modules, and Scheduling System Upgrade

This article presents an in‑depth overview of Youzan's data platform, introduces the DP data development platform with its key features and workflow, details the core module architecture—including service, scheduling, and component layers—and explains the migration from Airflow to DolphinScheduler to improve performance, stability, and scalability.

Big DataData DevelopmentData Platform
0 likes · 14 min read
Youzan Data Platform and DP Data Development Platform: Architecture, Core Modules, and Scheduling System Upgrade
IT Xianyu
IT Xianyu
Mar 3, 2022 · Databases

Introducing SPL: An Open‑Source Structured Data Processing Language with Full SQL‑92 Capabilities

SPL is an open‑source structured data processing language that extends full SQL‑92 functionality to a wide range of data sources—including CSV, Excel, JSON, NoSQL and Hadoop—allowing developers to perform complex queries, multi‑step calculations, and mixed‑source analytics without a traditional relational database.

Big DataData IntegrationSPL
0 likes · 14 min read
Introducing SPL: An Open‑Source Structured Data Processing Language with Full SQL‑92 Capabilities
AntTech
AntTech
Mar 1, 2022 · Big Data

Graph Computing at Ant Group: From Fraud Prevention to Industry‑Wide Impact

The article explains how Ant Group leverages large‑scale graph computing—through its GeaBase and TuGraph platforms and a dedicated research team—to enhance real‑time fraud detection, drive industry standards, and explore future applications across finance, energy, and public services.

Ant GroupBig DataTuGraph
0 likes · 7 min read
Graph Computing at Ant Group: From Fraud Prevention to Industry‑Wide Impact
DataFunTalk
DataFunTalk
Mar 1, 2022 · Cloud Native

Alibaba Cloud Native Data Lake with Apache Iceberg: Architecture, Challenges, and Solutions

The presentation outlines Alibaba Cloud's native data lake solution built on Apache Iceberg, covering data lake fundamentals, cloud migration challenges, Iceberg's architecture and features, real‑time ingestion with Flink, unified metadata management, security guarantees, and testing practices to ensure reliable, scalable big‑data analytics.

Apache IcebergBig DataData Lake
0 likes · 16 min read
Alibaba Cloud Native Data Lake with Apache Iceberg: Architecture, Challenges, and Solutions
Architects Research Society
Architects Research Society
Feb 26, 2022 · Big Data

Introduction to Azure Data Lake Analytics (ADLA) and Its Architecture

This article introduces Azure Data Lake Analytics, explains how data lakes differ from traditional warehouses, outlines the ETL process, highlights the benefits of schema‑on‑read storage, and describes the four‑stage Azure data platform architecture for ingesting, storing, processing, and analyzing massive datasets.

AzureBig DataU-SQL
0 likes · 5 min read
Introduction to Azure Data Lake Analytics (ADLA) and Its Architecture
Kuaishou Big Data
Kuaishou Big Data
Feb 25, 2022 · Big Data

How Kuaishou Scales Data Sync: Architecture, Challenges, and Future Plans

This article details the design, evolution, and optimization of Kuaishou's data synchronization platform, covering business overview, architecture, key technologies, performance tuning, data source protection, incremental data lake integration, and future roadmap for a unified data fabric.

Big DataReal-time Processingarchitecture
0 likes · 15 min read
How Kuaishou Scales Data Sync: Architecture, Challenges, and Future Plans
DataFunTalk
DataFunTalk
Feb 25, 2022 · Big Data

Tencent's Application of Apache Iceberg for Real‑Time Data Lake Ingestion, Governance, and Query Optimization

This article explains how Tencent leverages Apache Iceberg together with Flink to build a real‑time data lake pipeline, covering data ingestion, Iceberg's snapshot‑based read/write model, compaction and governance services, Z‑order based query optimization, performance results, and future roadmap.

Apache IcebergBig DataData Lake
0 likes · 24 min read
Tencent's Application of Apache Iceberg for Real‑Time Data Lake Ingestion, Governance, and Query Optimization
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 23, 2022 · Big Data

Understanding Mini‑Batch Streaming Aggregation in Flink SQL

This article explains Flink SQL’s streaming aggregation Mini‑Batch feature, covering its purpose, configuration parameters, underlying optimizer rules, operator implementations, watermark handling, buffer processing, and the optional Local‑Global two‑phase aggregation optimization for improving throughput and reducing state overhead in large‑scale data pipelines.

Big DataFlinkMini-Batch
0 likes · 10 min read
Understanding Mini‑Batch Streaming Aggregation in Flink SQL
DataFunTalk
DataFunTalk
Feb 23, 2022 · Big Data

NetEase Data Platform DataOps Practices for Improving Data Quality

This article details how NetEase's DataFunTalk presentation explores the growing data quality challenges in data development and demonstrates the application of DataOps principles—including pre‑ and post‑control mechanisms, sandbox environments, and automated governance tools—to systematically reduce defects, optimize resources, and ensure reliable data delivery across the company's diverse business lines.

Big DataData PlatformDataOps
0 likes · 14 min read
NetEase Data Platform DataOps Practices for Improving Data Quality
Architects' Tech Alliance
Architects' Tech Alliance
Feb 22, 2022 · Cloud Computing

Understanding China's “East Data West Computing” Initiative: Goals, Rationale, and Implementation

The “East Data West Computing” program is a national strategy that relocates computing workloads from data‑intensive eastern regions to resource‑rich western areas by building a network of data‑center hubs and clusters, aiming to balance supply and demand, improve energy efficiency, and boost overall computing capacity.

Big DataData CentersEast Data West Computing
0 likes · 7 min read
Understanding China's “East Data West Computing” Initiative: Goals, Rationale, and Implementation
ByteDance Data Platform
ByteDance Data Platform
Feb 21, 2022 · Big Data

Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL

This article examines how to design enterprise‑grade data warehouses by evaluating development convenience, ecosystem, decoupling, performance and security, compares Hive and SparkSQL along with other engines such as Presto, Doris and ClickHouse, and outlines best‑practice component selections for long‑running batch and interactive analytics.

Big DataData WarehouseETL
0 likes · 19 min read
Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL
DataFunTalk
DataFunTalk
Feb 19, 2022 · Big Data

Fundamentals of Data Middle Platform: Logic, Principles, and Practice

This article explains what a data middle platform is, why organizations need it, its core principles, technical architecture, and practical implementation guidelines, highlighting how it solves issues like inconsistent metrics, duplicate data construction, low query efficiency, poor data quality, and high development costs.

Big DataData ArchitectureData Middle Platform
0 likes · 14 min read
Fundamentals of Data Middle Platform: Logic, Principles, and Practice
Bilibili Tech
Bilibili Tech
Feb 18, 2022 · Big Data

Evolution of Bilibili's Data Retrieval Services and Lakehouse Architecture

Bilibili’s data retrieval journey progressed from a fragmented, chimney‑style pipeline to a unified Flink‑based service layer with the Ark construction system and Akuya SQL engine, and finally to an Iceberg‑driven lakehouse that eliminates data duplication, streamlines cross‑engine optimization, and offers platformized, low‑latency analytics.

Big DataBilibiliData Retrieval
0 likes · 14 min read
Evolution of Bilibili's Data Retrieval Services and Lakehouse Architecture
Alimama Tech
Alimama Tech
Feb 16, 2022 · Big Data

Target Group Discovery: Framework, Models, and Case Study

The article presents a target‑group discovery framework that combines goal definition, rule‑or model‑based segmentation, tiered metrics, benchmarking and quadrant analysis to identify and characterize advantageous, problematic, or weak consumer, product, or merchant sub‑groups, illustrated by a FMCG e‑commerce case study diagnosing high‑share, low‑growth categories.

BenchmarkingBig DataMarketing Analytics
0 likes · 13 min read
Target Group Discovery: Framework, Models, and Case Study
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 16, 2022 · Big Data

Using Flink CDC to Capture MySQL Changes and Sync Them to ClickHouse

This article introduces Change Data Capture (CDC), compares query‑based and log‑based approaches, explains Debezium and ClickHouse, and provides detailed Flink CDC and Flink SQL CDC examples—including Java source code, custom deserialization schema, ClickHouse sink implementation, and required Maven dependencies—to synchronize MySQL data into ClickHouse in real time.

Big DataCDCClickHouse
0 likes · 17 min read
Using Flink CDC to Capture MySQL Changes and Sync Them to ClickHouse
dbaplus Community
dbaplus Community
Feb 15, 2022 · Big Data

Mastering Data Warehouse Architecture: Concepts, Modeling Techniques, and Real‑Time Strategies

This comprehensive guide explains data warehouse fundamentals, architecture layers, modeling methods such as dimensional and entity modeling, metadata management, and the transition from offline to real‑time processing with Lambda and Kappa architectures, providing practical steps, best practices, and key terminology for building robust analytical platforms.

Big DataData WarehouseETL
0 likes · 63 min read
Mastering Data Warehouse Architecture: Concepts, Modeling Techniques, and Real‑Time Strategies
IT Architects Alliance
IT Architects Alliance
Feb 15, 2022 · Artificial Intelligence

How a Scalable Recommendation Engine Evolved: From V1.0 to V3.0

This article details the evolution of an e‑commerce recommendation system through three architectural versions, highlighting the initial simple design, the challenges that prompted vertical and horizontal splits, the introduction of a configurable pipeline and AB testing, and the final micro‑service‑based, dynamically configurable V3.0 architecture.

AIBig DataScalability
0 likes · 14 min read
How a Scalable Recommendation Engine Evolved: From V1.0 to V3.0
DataFunTalk
DataFunTalk
Feb 13, 2022 · Big Data

How Kuaishou Built a Standardized Data Governance Evaluation System

This article outlines Kuaishou’s approach to establishing a standardized data governance evaluation framework, detailing the challenges of large‑scale data management, the design of assessment metrics across model, quality, and cost dimensions, and the practical strategies and operational mechanisms used to improve data asset health and business value.

Big DataEvaluation FrameworkKuaishou
0 likes · 21 min read
How Kuaishou Built a Standardized Data Governance Evaluation System
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 13, 2022 · Big Data

What's New in Elasticsearch 8.0 – Key Features and Changes

The article provides a comprehensive overview of Elasticsearch 8.0, highlighting major updates such as 7.x REST API compatibility headers, default-enabled security, system‑index protection, a new KNN search API, storage and indexing optimizations, PyTorch model support, and numerous deprecations and feature removals across the stack.

8.0APIBig Data
0 likes · 10 min read
What's New in Elasticsearch 8.0 – Key Features and Changes
DataFunTalk
DataFunTalk
Feb 12, 2022 · Big Data

NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap

This article introduces NetEase's internally incubated data lake project Arctic, explains the concept of data lakes, outlines NetEase's specific requirements for a unified streaming‑batch platform, details Arctic's core architecture, storage strategy, data‑merge mechanisms, current achievements, and future development plans.

Apache IcebergArcticBig Data
0 likes · 10 min read
NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap
Programmer DD
Programmer DD
Feb 12, 2022 · Databases

What’s New in Elasticsearch 8.0? Key Features and Migration Tips

Elasticsearch 8.0 introduces major changes such as 7.x REST API compatibility headers, default‑enabled security with registration tokens, protected system indices, a technical preview of KNN search, storage‑saving field encodings, faster geo‑point indexing, PyTorch model support for NLP, and numerous deprecations and improvements across aggregations, allocation, analysis, authentication, cluster coordination, and packaging.

APIBig DataElasticsearch
0 likes · 10 min read
What’s New in Elasticsearch 8.0? Key Features and Migration Tips
21CTO
21CTO
Feb 11, 2022 · Cloud Computing

What Will Shape Software Development in 2022? 20 Key Trends Revealed

The article surveys 2022 software‑development forecasts, covering centralized and edge cloud infrastructure, multi‑cloud adoption, containers, security, blockchain, AI, low‑code, databases, big‑data engines, streaming, DevOps observability, programming languages, front‑end frameworks, and mobile development, offering a comprehensive outlook for practitioners and decision‑makers.

2022 trendsBig Datasoftware development
0 likes · 21 min read
What Will Shape Software Development in 2022? 20 Key Trends Revealed
政采云技术
政采云技术
Feb 8, 2022 · Industry Insights

Unlocking Enterprise Value with a Data Middle Platform: Architecture & Indicators

This article traces the evolution from traditional data warehouses to modern data lakes and data middle platforms, explains why siloed data development hampers efficiency, and details the architecture and indicator‑library design used by Zhengcaiyun to achieve unified, reusable data services.

Big DataData LakehouseData Middle Platform
0 likes · 14 min read
Unlocking Enterprise Value with a Data Middle Platform: Architecture & Indicators
IT Architects Alliance
IT Architects Alliance
Feb 8, 2022 · Backend Development

Designing a Daily Million-Transaction Payment Reconciliation System

This article explains how to architect a payment reconciliation system that can reliably process tens of millions of transactions per day, covering the underlying logic, scalability challenges, data collection methods, big‑data integration, and step‑by‑step processing flows to ensure accurate financial matching.

Backend ArchitectureBig DataHive
0 likes · 32 min read
Designing a Daily Million-Transaction Payment Reconciliation System
DataFunTalk
DataFunTalk
Feb 3, 2022 · Big Data

Improving Data Processing Efficiency at Kuaishou with Apache Hudi

This article explains how Kuashou tackled latency and efficiency problems in large‑scale data pipelines by adopting Apache Hudi, detailing the pain points, reasons for choosing Hudi, its architecture, model design, handling of bursty updates, back‑fill scenarios, and operational safeguards.

Big DataData LakeFlink
0 likes · 13 min read
Improving Data Processing Efficiency at Kuaishou with Apache Hudi
DataFunTalk
DataFunTalk
Jan 28, 2022 · Big Data

Real-Time Customer Data Platform (RT‑CDP) Architecture and Implementation at iFanFan

This article explains the concept, challenges, and key business goals of a real‑time Customer Data Platform, details the technology stack selection—including Nebula Graph, Apache Flink, Apache Beam, Kudu, and Doris—and describes the modular architecture, data model, identity service, streaming computation, storage layers, rule engine, operational results, and future directions.

Big DataCDPData Integration
0 likes · 43 min read
Real-Time Customer Data Platform (RT‑CDP) Architecture and Implementation at iFanFan
JD Retail Technology
JD Retail Technology
Jan 27, 2022 · Big Data

How JD’s Custom Spark Engine Tackles Data Skew for Massive Offline Jobs

This article explains JD’s self‑developed data‑skew mitigation solution for Spark, detailing the problem of uneven key distribution, the limitations of the open‑source AQE implementation, and JD’s OptimizeSkewedJoinV2 algorithm that dramatically reduces stage latency in large‑scale join workloads.

Adaptive Query ExecutionBig DataData Skew
0 likes · 13 min read
How JD’s Custom Spark Engine Tackles Data Skew for Massive Offline Jobs
DataFunTalk
DataFunTalk
Jan 27, 2022 · Big Data

Kyuubi: NetEase’s Open‑Source Multi‑Tenant SQL Engine for Large‑Scale Data Processing

This article introduces Kyuubi, the first NetEase project contributed to the Apache Foundation, describing its core features, multi‑tenant architecture, Spark‑based execution engine, cloud‑native capabilities, and real‑world use cases within NetEase’s data‑warehouse, ad‑hoc, and internal systems, along with performance gains and community resources.

ApacheBig DataKyuubi
0 likes · 23 min read
Kyuubi: NetEase’s Open‑Source Multi‑Tenant SQL Engine for Large‑Scale Data Processing
IT Xianyu
IT Xianyu
Jan 27, 2022 · Big Data

Installing Apache Hive on macOS with Hadoop and MySQL Metastore

This tutorial provides step‑by‑step instructions for installing Hadoop 3.1.1, Homebrew, Hive, and configuring MySQL as Hive's metastore on macOS, including environment variable setup, hive‑site.xml configuration, MySQL connector placement, schema initialization, and verification commands.

Big DataHadoopHive
0 likes · 6 min read
Installing Apache Hive on macOS with Hadoop and MySQL Metastore
dbaplus Community
dbaplus Community
Jan 26, 2022 · Big Data

Why Does Elasticsearch Aggregate Faster with Fewer Terms? Uncover the Secrets

This article examines a real‑world Elasticsearch cluster handling hundreds of terabytes, explains why high‑cardinality aggregations can be slower, and shows how setting execution_hint=map and tuning doc_values dramatically improves aggregation performance for ultra‑high‑concurrency workloads.

Big DataElasticsearchPerformance Optimization
0 likes · 12 min read
Why Does Elasticsearch Aggregate Faster with Fewer Terms? Uncover the Secrets
Architects Research Society
Architects Research Society
Jan 25, 2022 · Big Data

Azure Data Lake Storage Gen2: Design Guide, Best Practices, and Operational Considerations

This guide provides a comprehensive overview of Azure Data Lake Storage Gen2, covering when to use it, key design considerations, data organization strategies, access control models, file formats, cost‑optimization techniques, monitoring approaches, and performance‑tuning tips for large‑scale big‑data workloads.

ADLS Gen2AzureBig Data
0 likes · 41 min read
Azure Data Lake Storage Gen2: Design Guide, Best Practices, and Operational Considerations
DataFunTalk
DataFunTalk
Jan 25, 2022 · Big Data

Summary of Flink Forward Asia 2021: Community Growth, Cloud‑Native Deployment, Streaming‑Batch Integration, and Machine Learning

The article provides a comprehensive English summary of the 2021 Flink Forward Asia conference, covering community statistics, cloud‑native deployment modes, fault‑tolerance checkpoint advances, the evolution of streaming‑batch integration, the introduction of Streaming Warehouse, Flink ML 2.0, real‑time use cases at ByteDance and ICBC, Pravega storage innovations, and concluding reflections on the future of real‑time big data processing.

Apache FlinkBig Data
0 likes · 25 min read
Summary of Flink Forward Asia 2021: Community Growth, Cloud‑Native Deployment, Streaming‑Batch Integration, and Machine Learning
IT Architects Alliance
IT Architects Alliance
Jan 25, 2022 · Operations

Design and Architecture of a Shared Resource Platform and Its Technical System

This document outlines the logical and technical architecture of a government shared resource platform, describing application system upgrades, data collection and analysis, multi‑layer system design, standards compliance, interface management, and overall system integration for improved service quality and decision support.

Big DataData IntegrationGovernment IT
0 likes · 23 min read
Design and Architecture of a Shared Resource Platform and Its Technical System
DataFunSummit
DataFunSummit
Jan 23, 2022 · Big Data

MobTech's Integrated Data Governance Practices and Architecture

This article presents MobTech's comprehensive data governance and security practices, covering the necessity of governance, challenges in large‑scale data environments, the full‑link governance chain, modular architecture, and specific implementations for financial risk‑control scenarios.

Big DataData ArchitectureData Management
0 likes · 19 min read
MobTech's Integrated Data Governance Practices and Architecture
DataFunTalk
DataFunTalk
Jan 22, 2022 · Big Data

Alibaba Cloud Data Integration (DataX) Architecture, Design Principles, and Solution Overview

This presentation details Alibaba Cloud DataWorks Data Integration (DataX), covering its architecture, core design principles, offline and real‑time synchronization mechanisms, deployment modes, product positioning, use‑case scenarios, and its role within the broader DataWorks ecosystem, highlighting its capabilities for large‑scale data movement and processing.

Alibaba CloudBig DataData Integration
0 likes · 19 min read
Alibaba Cloud Data Integration (DataX) Architecture, Design Principles, and Solution Overview
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 18, 2022 · Big Data

Data Warehouse Data Quality Measurement Standards

The article outlines four key dimensions for evaluating data warehouse data quality—correctness, completeness, timeliness, and consistency—explains common consistency issues such as differing metric values across models, cross‑dimensional aggregations, and real‑time versus batch calculations, and proposes organizational and review mechanisms to mitigate these problems.

Big DataConsistencyData Quality
0 likes · 9 min read
Data Warehouse Data Quality Measurement Standards
DataFunTalk
DataFunTalk
Jan 16, 2022 · Big Data

Time Series Database Capabilities and Application Scenarios in IoT, Smart Cities, and Edge Computing

This article explains the fundamentals of time‑series data, outlines the architecture and core technical advantages of Baidu Cloud's TSDB, and demonstrates how the database powers IoT, smart‑city, industrial, power‑grid, and autonomous‑driving use cases through multi‑level storage, distributed query optimization, and edge‑cloud integration.

Big DataCloud ComputingIoT
0 likes · 11 min read
Time Series Database Capabilities and Application Scenarios in IoT, Smart Cities, and Edge Computing
21CTO
21CTO
Jan 13, 2022 · Fundamentals

How to Achieve Data Maturity: Turning Data into a Strategic Product

The article explains why data maturity is essential for modern enterprises, defines its three pillars—people, tools, and readiness—shows how treating data as a product follows the same principles as great products, and outlines the four S (Speed, Scale, Simplicity, SQL) that guide a mature data ecosystem.

Big DataData Productdata governance
0 likes · 6 min read
How to Achieve Data Maturity: Turning Data into a Strategic Product
TAL Education Technology
TAL Education Technology
Jan 13, 2022 · Cloud Native

Offline Mixed Deployment with Kubernetes: Architecture, Implementation, and Performance Evaluation for Big Data Workloads

This article describes a cloud‑native offline mixed‑deployment solution that leverages Kubernetes to share resources between big‑data clusters and business services, outlines its implementation steps, presents detailed performance comparisons between Yarn and Kubernetes using TPC‑DS, Spark, and Terasort workloads, and discusses production experience and future plans.

Big DataYARNcloud-native
0 likes · 8 min read
Offline Mixed Deployment with Kubernetes: Architecture, Implementation, and Performance Evaluation for Big Data Workloads