Tagged articles
3697 articles
Page 14 of 37
DataFunSummit
DataFunSummit
Jan 10, 2023 · Big Data

Exploring Iceberg in Huawei Terminal Cloud: Architecture, Features, and Future Plans

This article presents a comprehensive overview of Iceberg's adoption in Huawei Terminal Cloud, covering its architectural overview, key features such as Git‑style data management, real‑time processing, acceleration layers, and future development directions, along with a Q&A session addressing performance and implementation details.

Big DataData LakeFlink
0 likes · 15 min read
Exploring Iceberg in Huawei Terminal Cloud: Architecture, Features, and Future Plans
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jan 10, 2023 · Big Data

How Alibaba’s Dolphin Engine Uses Flink + Hologres for Real‑Time Big Data

The Dolphin engine, built by Alibaba’s Data Engine team, combines Flink and Hologres to deliver ultra‑large‑scale OLAP, streaming, batch, and AI capabilities for real‑time advertising analytics, offering smart materialization, intelligent indexing, and vector recall while supporting millions of advertisers and petabyte‑level data.

AIBig DataFlink
0 likes · 13 min read
How Alibaba’s Dolphin Engine Uses Flink + Hologres for Real‑Time Big Data
DataFunSummit
DataFunSummit
Jan 9, 2023 · Big Data

JD Data‑Driven Business Development: Building a Business Metric Data System and Marketplace Governance

The article outlines JD's data‑driven business development strategy, describing the current challenges of its business data marketplace, the governance framework—including layered architecture, standardization, ClickHouse dictionary refresh, and optimization measures—and the resulting performance improvements and future outlook.

Big DataClickHouseJD.com
0 likes · 13 min read
JD Data‑Driven Business Development: Building a Business Metric Data System and Marketplace Governance
DataFunTalk
DataFunTalk
Jan 8, 2023 · Big Data

ByteDance Event‑Tracking Data Cost Governance Practices

This article describes ByteDance's comprehensive approach to managing the massive volume of event‑tracking (埋点) data, detailing the background, cost‑reduction strategies, experience review, future plans, and a Q&A session that together illustrate how systematic data governance can dramatically cut storage and processing expenses.

Big DataByteDanceSampling
0 likes · 18 min read
ByteDance Event‑Tracking Data Cost Governance Practices
DataFunSummit
DataFunSummit
Jan 7, 2023 · Big Data

Redefining the Customer Data Platform (CDP) for New Energy Vehicle Companies

This article explores why the automotive industry's shift to new energy vehicles necessitates a redefinition of the Customer Data Platform (CDP), detailing the changing traffic structure, varied departmental demands, CDP typologies, implementation strategies, and the benefits of a unified, extensible CDP architecture for marketing, sales, and after‑sales.

AutomotiveBig DataCDP
0 likes · 13 min read
Redefining the Customer Data Platform (CDP) for New Energy Vehicle Companies
Data Thinking Notes
Data Thinking Notes
Jan 5, 2023 · Big Data

Why Data Lakes Are Outshining Traditional Data Warehouses: A Deep Dive

This comprehensive guide explains the evolution from traditional data warehouses to modern data lakes, detailing concepts, architectures, differences, implementation steps, and real‑world case studies, while also comparing major cloud providers' solutions and highlighting how data platforms support digital transformation and analytics.

AnalyticsBig DataData Lake
0 likes · 97 min read
Why Data Lakes Are Outshining Traditional Data Warehouses: A Deep Dive
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Jan 4, 2023 · Big Data

Choosing the Right Data Architecture: Warehouse, Mart, or Lake?

Understanding enterprise data platforms requires grasping the differences between data warehouses, data marts, and data lakes, their architectures, use cases, and key capabilities such as integration, real‑time processing, governance, and cost control, to guide organizations in building scalable, flexible data solutions.

Big DataData Mart
0 likes · 15 min read
Choosing the Right Data Architecture: Warehouse, Mart, or Lake?
DataFunSummit
DataFunSummit
Jan 4, 2023 · Big Data

Data Intelligence Expert Interview – Maturity, Trends, and Practices of Data Middle Platforms

The interview gathers insights from data‑platform experts on the maturity stages, technology trends, implementation methodologies, open‑source ecosystems, system architectures, governance, security, and assessment criteria of modern data middle platforms, offering a comprehensive guide for practitioners.

Big DataData ObservabilityData Platform
0 likes · 28 min read
Data Intelligence Expert Interview – Maturity, Trends, and Practices of Data Middle Platforms
Data Thinking Notes
Data Thinking Notes
Jan 3, 2023 · Big Data

How a Scalable Data Service Platform Transforms Big Data into APIs

This article outlines the design and implementation of a unified data service platform that standardizes data access, accelerates model processing, provides flexible API construction, and ensures high availability through gateway, caching, and monitoring, ultimately reducing cost and improving efficiency for both C‑end and B‑end applications.

Big DataData PlatformService Architecture
0 likes · 25 min read
How a Scalable Data Service Platform Transforms Big Data into APIs
Tencent Cloud Developer
Tencent Cloud Developer
Jan 3, 2023 · Big Data

How Tencent’s Cloud‑Native Lakehouse Tackles PB‑Scale Performance Challenges

This article analyzes Tencent Cloud’s DLC lakehouse solution, explaining the unified data lake‑warehouse architecture, the performance hurdles of object‑storage‑based analytics, and the multi‑dimensional caching, virtual‑cluster elasticity, and advanced filter techniques that enable second‑level analysis on petabyte‑scale data while reducing costs.

Big DataCachingDLC
0 likes · 13 min read
How Tencent’s Cloud‑Native Lakehouse Tackles PB‑Scale Performance Challenges
ITPUB
ITPUB
Jan 3, 2023 · Databases

How DragonF MPP DB Redefines Cloud‑Native Data Warehousing at Massive Scale

The article details the design, core features, and real‑world performance of the DragonF MPP DB, a cloud‑native, compute‑storage‑separated database that overcomes traditional MPP limitations, supports millions of daily jobs, and outlines its future roadmap for ultra‑large‑scale data platforms.

Big DataData WarehouseMPP
0 likes · 11 min read
How DragonF MPP DB Redefines Cloud‑Native Data Warehousing at Massive Scale
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 3, 2023 · Big Data

Migrating Hive SQL Jobs to Flink Using the SQL Gateway

This article explains how to use Apache Flink 1.16's SQL Gateway to migrate Hive SQL tasks to Flink, covering the underlying Hive‑on‑Flink architecture, dialect compatibility, streaming and batch demos, configuration details, and practical tips for developers and platform engineers.

Batch processingBig DataFlink
0 likes · 19 min read
Migrating Hive SQL Jobs to Flink Using the SQL Gateway
DataFunTalk
DataFunTalk
Jan 3, 2023 · Big Data

Tencent Unified Big Data Scheduling Platform – Architecture, Design, and Operations

The article presents an in‑depth overview of Tencent's self‑developed Unified Scheduling Platform, detailing its system architecture, design challenges, performance optimizations, resource‑fair scheduling mechanisms, operational metrics, future roadmap, and a Q&A session that together illustrate how the platform enables massive offline data processing at scale.

Big DataDistributed SystemsPerformance Optimization
0 likes · 18 min read
Tencent Unified Big Data Scheduling Platform – Architecture, Design, and Operations
Code Ape Tech Column
Code Ape Tech Column
Jan 3, 2023 · Big Data

Elasticsearch vs ClickHouse: Performance, Cost, and Deployment Guide

This article compares Elasticsearch and ClickHouse in terms of write throughput, query speed, and server cost, then provides a step‑by‑step deployment guide for a private data pipeline using Zookeeper, Kafka, FileBeat, and ClickHouse, along with common issues and their solutions.

Big DataClickHouseElasticsearch
0 likes · 15 min read
Elasticsearch vs ClickHouse: Performance, Cost, and Deployment Guide
Top Architect
Top Architect
Jan 2, 2023 · Big Data

Optimizing Kafka at Meituan: Challenges and Solutions for a Large‑Scale Data Platform

This article details Meituan's use of Kafka as a unified data cache and distribution layer, outlines the challenges of massive scale and latency, and presents comprehensive optimizations across application, system, and cluster management layers, including disk balancing, migration acceleration, fetcher isolation, and full‑link monitoring.

Big DataDistributed SystemsKafka
0 likes · 22 min read
Optimizing Kafka at Meituan: Challenges and Solutions for a Large‑Scale Data Platform
ITPUB
ITPUB
Dec 31, 2022 · Databases

Why HBase? Strengths, Weaknesses, Real‑World Scenarios, and Architecture Explained

This article examines HBase’s high reliability and performance as a column‑oriented NoSQL store, outlines its advantages and limitations, presents two practical use cases from e‑commerce, and details its data model, architecture components, and design considerations for effective deployment.

Big DataData StorageHBase
0 likes · 12 min read
Why HBase? Strengths, Weaknesses, Real‑World Scenarios, and Architecture Explained
Aikesheng Open Source Community
Aikesheng Open Source Community
Dec 31, 2022 · Databases

Understanding ClickHouse Performance: Storage Engine and Compute Engine Perspectives

This article explains why ClickHouse delivers high query speed by detailing storage‑engine optimizations such as pre‑sorting, columnar layout and compression, and compute‑engine techniques like vectorized execution, built‑in functions and minimal join usage, while also promoting the related book and giveaway.

Big DataClickHouseOLAP
0 likes · 9 min read
Understanding ClickHouse Performance: Storage Engine and Compute Engine Perspectives
Architect's Tech Stack
Architect's Tech Stack
Dec 30, 2022 · Big Data

Distributed Computing Is Not a Panacea for Big Data: Prioritize Single‑Node Performance First

While distributed clusters are popular for big‑data processing, they are not a universal solution; tasks that are hard to partition or involve heavy cross‑node communication often perform better on a well‑optimized single machine, making a careful analysis of workload characteristics essential before scaling out.

Algorithm OptimizationBig DataDistributed computing
0 likes · 14 min read
Distributed Computing Is Not a Panacea for Big Data: Prioritize Single‑Node Performance First
DataFunTalk
DataFunTalk
Dec 29, 2022 · Big Data

Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)

This article presents the background, requirements, architecture, key modules, and practical impact of OPPO's non‑intrusive big‑data diagnostic platform—named Compass—designed to quickly locate issues, provide optimization suggestions, and achieve cost‑saving and efficiency gains for large‑scale Spark and Hadoop workloads.

Big DataCost ReductionHadoop
0 likes · 17 min read
Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)
ByteDance Data Platform
ByteDance Data Platform
Dec 28, 2022 · Big Data

How Cloud Data Warehouses Are Shaping the Future of Big Data and DataOps

This article examines the four‑stage evolution of data warehouses, highlights the cost‑effective, scalable advantages of cloud‑native warehouses, explores the rapid growth of data‑management infrastructure, and discusses the emerging practices of DataOps and AI integration that are redefining modern data stacks.

AIBig DataData Management
0 likes · 15 min read
How Cloud Data Warehouses Are Shaping the Future of Big Data and DataOps
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 28, 2022 · Big Data

Flink 1.16 Highlights: Adaptive Batch Scheduling, Speculative Execution, Hybrid Shuffle, Dynamic Partition Pruning, Hive SQL Migration, Checkpoint Enhancements, CDC Integration, and Table Store

Flink 1.16 introduces adaptive batch scheduling, speculative execution, hybrid shuffle, dynamic partition pruning, improved Hive SQL compatibility, advanced checkpoint mechanisms including changelog backend, and integrates CDC with Kafka and Table Store, offering faster, more stable, and easier-to-use stream‑batch processing capabilities.

Big DataCDCCheckpoint
0 likes · 8 min read
Flink 1.16 Highlights: Adaptive Batch Scheduling, Speculative Execution, Hybrid Shuffle, Dynamic Partition Pruning, Hive SQL Migration, Checkpoint Enhancements, CDC Integration, and Table Store
High Availability Architecture
High Availability Architecture
Dec 27, 2022 · Big Data

Design and Implementation of a Data Service Middle Platform for Scalable Data SaaS

This article presents a comprehensive overview of a data service middle platform, detailing its background, architectural design, data construction, model definition and acceleration, API creation, query processing, service gateway, common solutions for standardization and cost reduction, as well as achieved results and future plans.

APIBig DataData Platform
0 likes · 22 min read
Design and Implementation of a Data Service Middle Platform for Scalable Data SaaS
Tencent Advertising Technology
Tencent Advertising Technology
Dec 27, 2022 · Big Data

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

The article details how Tencent Advertising re‑architected its massive log pipeline by consolidating heterogeneous real‑time and offline logs into an Iceberg‑based data lake, introducing multi‑level partitioning, Spark and Flink ingestion, and numerous performance and cost optimizations for scalable big‑data analytics.

Big DataData LakeFlink
0 likes · 20 min read
Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink
DataFunTalk
DataFunTalk
Dec 24, 2022 · Big Data

Evolution of Data Platforms: From Early Computers to the Modern Data Stack

This article traces the history of data platforms—from the first general‑purpose computers and traditional BI, through the rise of data warehouses, big‑data frameworks like Hadoop, Spark and Flink, to the modern data‑stack era with cloud‑native architectures, Lambda/Kappa models, and emerging tools—highlighting key technologies, architectural shifts, and future prospects.

Big DataCloud ComputingData Warehouse
0 likes · 26 min read
Evolution of Data Platforms: From Early Computers to the Modern Data Stack
DataFunSummit
DataFunSummit
Dec 24, 2022 · Operations

Understanding DataOps: Evolution, Technology Stacks, and Industry Applications

This article explores DataOps from its historical evolution through the digital 3.0 era, outlines its core technology stacks such as Data Fabric, Data Mesh, and Modern Data Stack, and demonstrates practical applications across finance, manufacturing, telecom, and public services, highlighting its role in agile, cloud‑native data management.

Big DataDataOpsdata governance
0 likes · 18 min read
Understanding DataOps: Evolution, Technology Stacks, and Industry Applications
Bilibili Tech
Bilibili Tech
Dec 23, 2022 · Big Data

Data Service Platform Architecture and Design

The article outlines a standardized data‑service platform built atop a warehouse, detailing its construction, query, and gateway layers—supporting model definition, acceleration, reusable APIs, unified DSL/SQL interfaces, and observability—to solve ingestion, definition, and lineage issues, achieving 500+ APIs, sub‑day creation, and 18% cost reduction.

Big DataData Serviceapi-gateway
0 likes · 22 min read
Data Service Platform Architecture and Design
DataFunSummit
DataFunSummit
Dec 22, 2022 · Big Data

SeaTunnel: An Open‑Source Ultra‑Scale Data Integration Platform – Design Goals, Architecture, and Future Roadmap

This article introduces SeaTunnel, an open‑source ultra‑large‑scale data integration platform, covering its design objectives, current status with over 50 connectors and multi‑engine support, overall architecture, execution flow, connector translation, source and sink APIs, global commit strategies, table & catalog APIs, and the upcoming roadmap for connector expansion, a web UI, and a dedicated engine.

Big DataConnectorSeaTunnel
0 likes · 10 min read
SeaTunnel: An Open‑Source Ultra‑Scale Data Integration Platform – Design Goals, Architecture, and Future Roadmap
ITPUB
ITPUB
Dec 21, 2022 · Big Data

How Bilibili Optimized Flink Runtime for Massive Real‑Time Jobs

This article details Bilibili's extensive enhancements to the Flink runtime—including checkpoint recoverability, max‑parallelism calculations, State Processor API extensions, Full and Regional Checkpoints, hybrid HA, task‑level recovery, load‑balanced partitioners, and large‑scale cluster maintenance—to improve reliability and performance of its billion‑scale streaming workloads.

Big DataCheckpointFlink
0 likes · 33 min read
How Bilibili Optimized Flink Runtime for Massive Real‑Time Jobs
DataFunSummit
DataFunSummit
Dec 21, 2022 · Big Data

Big Data Platform Architecture: Expert Insights on Components, Challenges, and Trends

An expert interview series examines the architecture of big data platforms, detailing core modules such as data integration, storage, computation, scheduling, and query analysis, while highlighting current challenges, best‑practice tools, and future trends like cloud‑native, object storage, and real‑time processing.

Big DataDistributed computingQuery Engines
0 likes · 12 min read
Big Data Platform Architecture: Expert Insights on Components, Challenges, and Trends
Xianyu Technology
Xianyu Technology
Dec 21, 2022 · Artificial Intelligence

Xianyu Recommendation System: Architecture, Challenges, and Deployment

The Xianyu recommendation system, built by backend expert Wan Xiaoyong, evolved from offline scoring to a full‑graph, serverless recall‑ranking pipeline that tackles C2C uncertainties through centralized feature engineering, model compression, staged deployment, flexible experimentation, robust governance, and plans for automated attribution and interpretability.

AIBig DataFeature Engineering
0 likes · 10 min read
Xianyu Recommendation System: Architecture, Challenges, and Deployment
DataFunSummit
DataFunSummit
Dec 20, 2022 · Big Data

JD Retail Big Data OLAP Application and Practice

This talk presents JD Retail’s big‑data OLAP solution, covering the massive, variable and complex traffic data challenges, the custom data‑ingestion and versioned update tools, ClickHouse query‑architecture upgrades, optimization techniques, and future plans for multi‑cluster querying and pre‑computation.

Big DataClickHouseJD Retail
0 likes · 21 min read
JD Retail Big Data OLAP Application and Practice
Top Architect
Top Architect
Dec 20, 2022 · Databases

Elasticsearch DSL Query Syntax Overview (Version 7.x)

This article provides a comprehensive beginner-friendly guide to Elasticsearch 7.x DSL query syntax, covering core keywords, mapping types, query examples, boolean logic, and code snippets to help readers understand and construct effective search queries.

Big DataDSLDatabase
0 likes · 8 min read
Elasticsearch DSL Query Syntax Overview (Version 7.x)
Data Thinking Notes
Data Thinking Notes
Dec 19, 2022 · Big Data

Data Quality Mastery: From Expectations to Operational Assurance

This article outlines a comprehensive data quality management framework, covering expectations, measurement, assurance, and operational practices, and provides concrete templates, rule designs, and governance processes to help data teams systematically assess, monitor, and improve data reliability throughout the lifecycle.

Big DataData Qualitydata governance
0 likes · 18 min read
Data Quality Mastery: From Expectations to Operational Assurance
ITPUB
ITPUB
Dec 18, 2022 · Databases

Why ClickHouse Is So Fast: Deep Dive into Storage and Compute Engine Optimizations

This article explains how ClickHouse achieves high query performance by leveraging storage‑engine designs such as pre‑sorting, columnar layout, and block‑level compression, and by exploiting a vectorized compute engine while avoiding joins and using built‑in functions.

Big DataClickHouseColumnar Storage
0 likes · 9 min read
Why ClickHouse Is So Fast: Deep Dive into Storage and Compute Engine Optimizations
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 15, 2022 · Big Data

Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans

This technical article presents a comprehensive overview of migrating Hive SQL to Flink SQL, covering the motivations behind the migration, key challenges such as compatibility, stability and performance, practical implementation steps, a detailed demo, future development directions, and a Q&A session addressing common concerns.

Batch processingBig DataData Lake
0 likes · 13 min read
Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans
DataFunTalk
DataFunTalk
Dec 14, 2022 · Big Data

Cloud‑Native Big Data Solutions for the Financial Industry: Architecture, Deployment, Scheduling, and Resource Management

This article explains why the financial sector is moving its big‑data workloads to cloud‑native platforms, compares cloud‑native systems with traditional Hadoop, describes deployment options such as Serverless YARN and Arcee Operator, and details the high‑performance GRO scheduler, agent, and ResLake resource‑lake architecture that together improve resource utilization, reduce costs, and ensure reliable, low‑latency processing for finance workloads.

Big Datacloud-nativeresource scheduling
0 likes · 19 min read
Cloud‑Native Big Data Solutions for the Financial Industry: Architecture, Deployment, Scheduling, and Resource Management
dbaplus Community
dbaplus Community
Dec 13, 2022 · Big Data

How ClickHouse Powers Real-Time Self-Service Analytics at Scale

Facing massive daily data volumes and complex, ad‑hoc analytical needs, Zhaozhuan’s engineering team evaluated multiple OLAP engines and chose ClickHouse, then built a four‑layer self‑service analytics platform, detailing architecture, use‑cases, performance tuning, large‑scale joins, and future roadmap challenges.

Big DataClickHouseData Architecture
0 likes · 14 min read
How ClickHouse Powers Real-Time Self-Service Analytics at Scale
DataFunSummit
DataFunSummit
Dec 13, 2022 · Big Data

Introducing the Star River Big Data Development Platform: Architecture, Core Capabilities, and Future Plans

This article presents an in‑depth overview of 58.com’s self‑built Star River big data platform, covering its evolution across three eras, resource management hierarchy, core technical capabilities such as metadata services, data maps and lineage, governance practices, and the roadmap for further enhancements.

Big DataData Platformarchitecture
0 likes · 14 min read
Introducing the Star River Big Data Development Platform: Architecture, Core Capabilities, and Future Plans
DataFunTalk
DataFunTalk
Dec 12, 2022 · Big Data

Cloud‑Native and Intelligent Fusion: Key Trends Shaping the Future of Big Data

The article explains how cloud‑native architectures, data governance, intelligent fusion, and privacy computing are driving the evolution of big data, recounting the history from Google’s early papers and Hadoop to modern managed services, compute‑storage separation, AI‑powered recommendation platforms, and real‑world success cases.

Big DataCloud Computingcloud-native
0 likes · 10 min read
Cloud‑Native and Intelligent Fusion: Key Trends Shaping the Future of Big Data
AntTech
AntTech
Dec 11, 2022 · Information Security

Occlum v1.0: Open‑Source Trusted Execution Environment OS with Major Performance Gains and Spark Big Data Integration

Occlum v1.0, the open‑source trusted execution environment operating system released by Ant Group, delivers up to five‑fold performance improvements, supports over 150 Linux syscalls, introduces async I/O, dynamic memory management, and a Spark‑BigDL big‑data analysis solution, while outlining future GPU and TDX extensions.

Big DataConfidential ComputingOcclum
0 likes · 11 min read
Occlum v1.0: Open‑Source Trusted Execution Environment OS with Major Performance Gains and Spark Big Data Integration
DataFunSummit
DataFunSummit
Dec 10, 2022 · Big Data

Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions

This presentation details how Guanyuan Data leverages Apache Spark within its self‑service analytics platform, covering product features, flexible deployment, resource isolation, performance challenges, architectural solutions, and future cloud‑native enhancements to support thousands of users and massive query workloads.

Apache SparkBig DataData Platform
0 likes · 14 min read
Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions
ITPUB
ITPUB
Dec 10, 2022 · Big Data

How ClickHouse Powers Real-Time Self-Service Analytics at Scale

This article examines why ClickHouse was chosen as the OLAP engine for a massive self‑service analytics platform, describes the system architecture, shares concrete memory and performance tuning parameters, and outlines current challenges and future roadmap for large‑scale real‑time data analysis.

Big DataClickHouseData Architecture
0 likes · 14 min read
How ClickHouse Powers Real-Time Self-Service Analytics at Scale
php Courses
php Courses
Dec 9, 2022 · Databases

Elasticsearch Index and Document Operations Tutorial

This tutorial explains how to create, query, update, and delete Elasticsearch indices and documents using RESTful HTTP requests, covering basic CRUD operations, various query types, pagination, sorting, aggregations, highlighting, and mapping definitions with practical JSON examples.

Big DataElasticsearchJSON
0 likes · 8 min read
Elasticsearch Index and Document Operations Tutorial
DataFunSummit
DataFunSummit
Dec 7, 2022 · Big Data

Modern Data Governance at NetEase DataFan: Evolution, Challenges, and Solutions

This article details NetEase DataFan's journey in building a full‑stack big‑data platform, explains the design‑first data‑mid‑platform approach, analyzes cost, quality, and security problems encountered, and presents the modern data‑governance framework that integrates development, governance, and consumption into a closed loop.

Big DataCost ManagementData Platform
0 likes · 22 min read
Modern Data Governance at NetEase DataFan: Evolution, Challenges, and Solutions
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 7, 2022 · Databases

How Lindorm Cut Costs and Boost Performance for Alibaba’s Massive Data Workloads

This article reviews Lindorm’s evolution from its HBase‑based 1.0 architecture to the cloud‑native 2.0 version, outlines 2022’s cost‑saving and efficiency challenges, details compression, storage, time‑series and SQL enhancements, and shares real‑world case studies demonstrating significant cost reductions and performance gains.

Big DataCost ReductionLindorm
0 likes · 24 min read
How Lindorm Cut Costs and Boost Performance for Alibaba’s Massive Data Workloads
Data Thinking Notes
Data Thinking Notes
Dec 5, 2022 · Big Data

How NetEase Cloud Music Cut Storage Costs by 30% Through Data Governance

This article details NetEase Cloud Music's year‑long data governance initiative, covering data background, governance strategy, project plan, practical actions, results, and future outlook, and shows how metadata‑driven management reduced storage by over 30% while improving reliability and efficiency.

Big DataHadoopcloud music
0 likes · 17 min read
How NetEase Cloud Music Cut Storage Costs by 30% Through Data Governance
DataFunSummit
DataFunSummit
Dec 5, 2022 · Big Data

Impala Cluster Performance Optimization Based on Historical Queries: Practices and Solutions

This article presents a comprehensive overview of Impala cluster performance optimization using historical query analysis, covering background, high‑performance data‑warehouse construction principles, identified pain points, HBO implementation details, optimization techniques, and future development plans for the Impala ecosystem.

Big DataHBOHistorical Queries
0 likes · 16 min read
Impala Cluster Performance Optimization Based on Historical Queries: Practices and Solutions
Top Architect
Top Architect
Dec 4, 2022 · Databases

Deep Dive into Elasticsearch Pagination: from/size, Scroll, and Search After

This article explains how Elasticsearch handles deep pagination, compares the traditional from/size method with Scroll and Search After techniques, details their internal query and fetch phases, provides practical code examples, and offers guidance on choosing the right approach for large‑scale search workloads.

Big Datapaginationscroll
0 likes · 15 min read
Deep Dive into Elasticsearch Pagination: from/size, Scroll, and Search After
Architects Research Society
Architects Research Society
Dec 3, 2022 · Databases

Solr vs Elasticsearch: Choosing the Right Search Engine for Your Organization

This article compares Solr and Elasticsearch, examining their cloud, analytics, and cognitive search capabilities, and provides guidance on selecting the most suitable engine based on factors such as deployment complexity, resource requirements, scalability, integration with Hadoop ecosystems, and specific organizational use cases.

Big DataElasticsearchSolr
0 likes · 9 min read
Solr vs Elasticsearch: Choosing the Right Search Engine for Your Organization
DataFunSummit
DataFunSummit
Dec 2, 2022 · Big Data

BitSail: ByteDance’s Open‑Source Unified Data Integration Engine – Architecture, Evolution, and Capabilities

BitSail, ByteDance’s open‑source data integration engine, unifies batch, streaming, and incremental data synchronization across heterogeneous sources, detailing its evolution from early Flink‑based prototypes to a mature, plugin‑driven architecture with multi‑engine support, low‑cost co‑development, and robust CDC lakehouse capabilities.

Big DataCDCFlink
0 likes · 19 min read
BitSail: ByteDance’s Open‑Source Unified Data Integration Engine – Architecture, Evolution, and Capabilities
DataFunSummit
DataFunSummit
Dec 1, 2022 · Big Data

City Data Acquisition Platform: Architecture, Core Technologies, and Incremental Synchronization Strategies

This article presents an overview of a smart city unified perception platform, detailing its modular architecture, solutions for multi-source heterogeneity, incremental synchronization strategies, and real-time API data collection, while discussing extensibility and practical implementation considerations.

API integrationBig DataData Platform
0 likes · 20 min read
City Data Acquisition Platform: Architecture, Core Technologies, and Incremental Synchronization Strategies
Architecture Digest
Architecture Digest
Dec 1, 2022 · Big Data

Understanding Data Warehouse Architecture and Layered Design

This article explains the concepts, architecture, and layered design of data warehouses, covering data flow, ETL processes, ODS, DWD, DWM, DWS, ADS layers, their characteristics, differences from databases, and the role of data marts in supporting OLAP and decision‑making.

AnalyticsBig DataData Layers
0 likes · 13 min read
Understanding Data Warehouse Architecture and Layered Design
21CTO
21CTO
Nov 30, 2022 · Big Data

Mastering Data Sharding: Hash, Range, and Consistent Hash Techniques

This article explains core data sharding concepts and models—including hash‑based, range‑based, and consistent hashing—detailing their mappings, routing strategies, scalability considerations, and practical implementation examples for handling massive datasets in distributed systems.

Big DataConsistent HashingHashing
0 likes · 11 min read
Mastering Data Sharding: Hash, Range, and Consistent Hash Techniques
DeWu Technology
DeWu Technology
Nov 30, 2022 · Big Data

Fundamentals and Implementation of Data Lineage in Big Data Environments

Data lineage in big‑data environments tracks how data moves and transforms—from source tables through SQL processing to final storage—enabling management tasks such as domain segmentation, performance tuning, anomaly detection, and dependency verification, with implementations ranging from simple regex extraction to robust AST parsing and optimization, as used by tools like Alibaba DataWorks and Apache Atlas.

ASTBig DataData Lineage
0 likes · 7 min read
Fundamentals and Implementation of Data Lineage in Big Data Environments
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 30, 2022 · Big Data

What’s New in Apache Flink 2022? Highlights from the Flink Forward Asia Summit

The 2022 Flink Forward Asia summit showcased Apache Flink’s rapid community growth, key technical breakthroughs such as distributed snapshot upgrades, cloud‑native state storage, hybrid shuffle, Flink CDC 2.0, and Flink ML 2.0, and real‑world deployments at companies like Midea, miHoYo and Disney.

Apache FlinkBig DataFlink Forward Asia
0 likes · 25 min read
What’s New in Apache Flink 2022? Highlights from the Flink Forward Asia Summit
Bilibili Tech
Bilibili Tech
Nov 29, 2022 · Big Data

How Bilibili Supercharged Flink: Checkpoint, HA, and Runtime Optimizations

This article details Bilibili's extensive enhancements to Flink's runtime—including checkpoint recoverability, operator ID stability, state processor extensions, hybrid high‑availability, regional checkpointing, and load‑based channel selection—to improve scalability, reliability, and operational efficiency of large‑scale streaming jobs.

Big DataCheckpointFlink
0 likes · 32 min read
How Bilibili Supercharged Flink: Checkpoint, HA, and Runtime Optimizations
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 29, 2022 · Big Data

How Flink’s Stream‑Batch Fusion Is Transforming Real‑Time Big Data

The article explores Apache Flink’s eight‑year journey to becoming a top‑level Apache project, Alibaba’s extensive contributions, the rise of stream‑batch unified computing, its impact on real‑time data integration, cloud‑native deployment, and the emerging Flink‑based data‑warehouse and serverless solutions.

Apache FlinkBig DataData Integration
0 likes · 15 min read
How Flink’s Stream‑Batch Fusion Is Transforming Real‑Time Big Data
Data Thinking Notes
Data Thinking Notes
Nov 28, 2022 · Big Data

Unlocking Data Value: How Metadata Drives Efficient Data Management and Quality

This comprehensive guide explains how metadata connects source data, warehouses, and applications, outlines its technical and business classifications, demonstrates its value for data management, profiling, portals, and ETL development, and details optimization, storage, lifecycle, and quality practices essential for robust big‑data operations.

Big DataData QualityData Warehouse
0 likes · 35 min read
Unlocking Data Value: How Metadata Drives Efficient Data Management and Quality
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 28, 2022 · Big Data

Comprehensive Guide to Big Data Interview Topics: Log Collection, Data Synchronization, Offline Development, Real‑time Technology, Data Services, and Data Mining

This article provides an extensive overview of big‑data interview subjects, covering browser and mobile log collection methods, data synchronization techniques (batch, real‑time, sharding), offline data development platforms, streaming architectures, data service evolution, performance optimization, and data‑mining layers and applications.

Big DataStreamingdata mining
0 likes · 17 min read
Comprehensive Guide to Big Data Interview Topics: Log Collection, Data Synchronization, Offline Development, Real‑time Technology, Data Services, and Data Mining
Volcano Engine Developer Services
Volcano Engine Developer Services
Nov 28, 2022 · Cloud Native

How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok

ByteDance’s cloud‑native computing team, led by Li Yakun, details how they transformed a Hadoop‑centric big‑data stack into a Kubernetes‑driven platform—customizing storage, middleware, and scheduling—to support petabyte‑scale workloads, achieve over 40% resource utilization, and sustain rapid product growth.

Big DataSparkcloud-native
0 likes · 17 min read
How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok
DataFunTalk
DataFunTalk
Nov 25, 2022 · Operations

Overview of Volcano Engine A/B Experiment System Platform

This article presents a comprehensive overview of Volcano Engine's A/B testing platform, detailing its four core stages—reliable experiment system, efficient data construction, scientific statistical analysis, and fine-grained governance—while explaining execution components, data pipelines, statistical methods, and operational best practices for large‑scale experimentation.

A/B testingBig DataExperiment Platform
0 likes · 16 min read
Overview of Volcano Engine A/B Experiment System Platform
Data Thinking Notes
Data Thinking Notes
Nov 23, 2022 · Big Data

Mastering Fact Table Design: From Basics to Advanced Strategies

This comprehensive guide explains the fundamentals, design rules, and various types of fact tables—including transaction, snapshot, and aggregate tables—while detailing Kimball's four-step modeling process, grain declaration, handling of additive measures, and practical examples for effective data warehouse implementation.

Big DataData WarehouseFact Table
0 likes · 16 min read
Mastering Fact Table Design: From Basics to Advanced Strategies
Data Thinking Notes
Data Thinking Notes
Nov 22, 2022 · Big Data

Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It

A Sqoop job that normally finishes within 2.5 hours occasionally takes more than 8 hours due to data skew caused by an unsuitable split column, and the article details the investigation, root‑cause analysis, and a practical solution using a better split column and adjusted parallelism.

Big DataData SkewHive
0 likes · 5 min read
Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It
DataFunSummit
DataFunSummit
Nov 22, 2022 · Big Data

BI Platform Practice at Xiaomi: Evolution, Architecture, and Future Directions

This article details Xiaomi's multi‑year journey in building a group‑wide Business Intelligence platform, covering its historical evolution, technical challenges in performance, modeling, visualization and permissions, the current four‑layer architecture, and future plans to make the platform more business‑centric and simpler.

AnalyticsBIBig Data
0 likes · 15 min read
BI Platform Practice at Xiaomi: Evolution, Architecture, and Future Directions
Top Architect
Top Architect
Nov 22, 2022 · Big Data

Efficient Massive Excel Import/Export with POI and EasyExcel in Java

This article explains how to efficiently import and export massive datasets (up to millions of rows) between Excel and databases using Apache POI, SXSSF, and Alibaba's EasyExcel, comparing workbook types, outlining performance considerations, and providing Java code examples for batch processing, paging, and transaction management.

Batch processingBig DataExcel
0 likes · 23 min read
Efficient Massive Excel Import/Export with POI and EasyExcel in Java
Bilibili Tech
Bilibili Tech
Nov 22, 2022 · Big Data

Overview of the Berserker Big Data Platform and Its Data Development Architecture

The Berserker big‑data platform provides a one‑stop data development and governance solution built on over 40 micro‑services, featuring the Archer scheduler with CN and EN nodes, Raft‑based state management, Docker‑isolated task execution, smart routing, and plans to make EN stateless, migrate to Kubernetes, and unify batch and streaming services.

ArcherBig DataDocker
0 likes · 17 min read
Overview of the Berserker Big Data Platform and Its Data Development Architecture
DevOps Cloud Academy
DevOps Cloud Academy
Nov 22, 2022 · Big Data

Components and Key Terminology in Apache Airflow

Apache Airflow’s architecture consists of schedulers, executors, workers, a web server, and a metadata database, enabling scalable workflow orchestration, while essential terminology such as DAGs, operators, and sensors defines how tasks are organized, executed, and monitored within data pipelines.

Apache AirflowBig DataDAG
0 likes · 8 min read
Components and Key Terminology in Apache Airflow
Architects' Tech Alliance
Architects' Tech Alliance
Nov 20, 2022 · Databases

Columnar Storage vs Row Storage: Overview, Write/Read Comparison, Pros, Cons, and Use Cases

This article explains the differences between row-based and column-based storage, comparing their write and read performance, outlining advantages and disadvantages, and describing suitable scenarios such as OLAP queries, column families, compression, and indexing, to help choose the appropriate storage model.

Big DataColumnar StorageDatabase
0 likes · 10 min read
Columnar Storage vs Row Storage: Overview, Write/Read Comparison, Pros, Cons, and Use Cases
ITPUB
ITPUB
Nov 18, 2022 · Big Data

How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes

This article introduces Iceberg’s table‑format fundamentals, details Xiaomi’s large‑scale deployment of Iceberg for CDC and log ingestion, explores their streaming‑batch integration experiments, outlines future roadmap items, and provides a comprehensive Q&A covering practical challenges and solutions.

Batch processingBig DataData Lake
0 likes · 23 min read
How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes
ByteDance Terminal Technology
ByteDance Terminal Technology
Nov 18, 2022 · Big Data

Practices and Techniques for Large‑Scale Distributed Trace Data Analysis at ByteDance

This article presents ByteDance’s experience building a massive trace‑data analysis platform, covering observability fundamentals, the evolution of its distributed tracing system, various aggregation computation models, technical architecture choices, and concrete use‑cases such as precise topology, traffic estimation, dependency analysis, performance anti‑patterns, bottleneck detection, and error propagation.

Big DataGraph DatabaseObservability
0 likes · 21 min read
Practices and Techniques for Large‑Scale Distributed Trace Data Analysis at ByteDance
360 Smart Cloud
360 Smart Cloud
Nov 17, 2022 · Databases

Exploring StarRocks Applications, Performance Tests, and Cloud‑Native Integration at 360

This article reviews the practical applications and experimental explorations of StarRocks at 360, describing the cloud‑native lake‑warehouse product Yunzhou, its three‑tier architecture, performance comparisons with Trino using TPCH 100 GB, challenges of Kubernetes integration, and future directions for storage‑compute separation.

Big DataData WarehouseOLAP
0 likes · 7 min read
Exploring StarRocks Applications, Performance Tests, and Cloud‑Native Integration at 360
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Nov 16, 2022 · Operations

Design and Implementation of a Continuous Performance Optimization and Tracking Platform for Xiaohongshu Services

To curb rising resource costs as Xiaohourshu scales, engineers built a Continuous Performance Optimization & Tracking Platform that continuously profiles services, stores diff‑analyzed data in ClickHouse, automatically detects tiny regressions, links them to code changes, and has already saved and flagged roughly 20,000 CPU cores across search, recommendation and advertising workloads.

Big DataContinuous Monitoringcloud-native
0 likes · 16 min read
Design and Implementation of a Continuous Performance Optimization and Tracking Platform for Xiaohongshu Services
DataFunSummit
DataFunSummit
Nov 15, 2022 · Big Data

Industrial Data Governance: Challenges, Practices, and Insights

Industrial data governance, essential for digital transformation, faces challenges such as data heterogeneity, volume, quality, and integration across the value chain, and the presentation outlines background, practical approaches, strategic thinking, and a phased, demand‑driven model to enhance data quality, assetization, and business value.

Big DataDigital Transformationdata assetization
0 likes · 24 min read
Industrial Data Governance: Challenges, Practices, and Insights
Past Memory Big Data
Past Memory Big Data
Nov 15, 2022 · Big Data

How Uber Accelerated Presto Queries with Alluxio Local Cache

Uber processes over 500,000 daily Presto queries across 20 clusters handling more than 50 PB of data, and by deploying Alluxio Local Cache on NVMe disks they raised cache‑hit rates from roughly 65% to over 90% while addressing real‑time partition updates, node churn, and cache‑size constraints.

AlluxioBig DataConsistent Hashing
0 likes · 15 min read
How Uber Accelerated Presto Queries with Alluxio Local Cache
Java Architect Essentials
Java Architect Essentials
Nov 14, 2022 · Big Data

Efficient Import and Export of Millions of Records Using Apache POI and EasyExcel

This article explains how to handle massive Excel import and export tasks in Java by comparing traditional POI implementations, selecting the appropriate Workbook type based on data volume, and leveraging Alibaba's EasyExcel library together with batch JDBC operations to process over three million rows with minimal memory usage and high performance.

Apache POIBig DataData Export
0 likes · 22 min read
Efficient Import and Export of Millions of Records Using Apache POI and EasyExcel
Huolala Tech
Huolala Tech
Nov 11, 2022 · Big Data

How Huolala Boosted Offline Scheduling Performance: Strategies & Lessons

Huolala’s big‑data offline platform, built from scratch, faced escalating scheduling delays as task instances grew, prompting a series of short‑ and mid‑term optimizations—including zombie task cleanup, retention policies, memory caching, algorithmic tweaks, and high‑availability enhancements—to dramatically reduce dependency computation time and sustain million‑scale daily workloads.

Big DataDistributed Systemsoffline scheduling
0 likes · 12 min read
How Huolala Boosted Offline Scheduling Performance: Strategies & Lessons
Meituan Technology Team
Meituan Technology Team
Nov 10, 2022 · Big Data

Optimizing Spark mapPartitions: Memory Management and Best Practices

The article details how Meituan’s Turing machine‑learning platform cut offline resource use by 80% and task time by 63% through memory‑level techniques such as column pruning, adaptive caching, and a deep dive into Spark’s mapPartitions operator, including source‑code analysis, GC behavior, and a low‑memory batch‑iterator best practice.

Big DataMemory OptimizationPerformance tuning
0 likes · 19 min read
Optimizing Spark mapPartitions: Memory Management and Best Practices