Tagged articles

3697 articles

Page 14 of 37

Jan 10, 2023 · Big Data

Exploring Iceberg in Huawei Terminal Cloud: Architecture, Features, and Future Plans

This article presents a comprehensive overview of Iceberg's adoption in Huawei Terminal Cloud, covering its architectural overview, key features such as Git‑style data management, real‑time processing, acceleration layers, and future development directions, along with a Q&A session addressing performance and implementation details.

Big DataData LakeFlink

0 likes · 15 min read

Exploring Iceberg in Huawei Terminal Cloud: Architecture, Features, and Future Plans

Alibaba Cloud Big Data AI Platform

Jan 10, 2023 · Big Data

How Alibaba’s Dolphin Engine Uses Flink + Hologres for Real‑Time Big Data

The Dolphin engine, built by Alibaba’s Data Engine team, combines Flink and Hologres to deliver ultra‑large‑scale OLAP, streaming, batch, and AI capabilities for real‑time advertising analytics, offering smart materialization, intelligent indexing, and vector recall while supporting millions of advertisers and petabyte‑level data.

AIBig DataFlink

0 likes · 13 min read

How Alibaba’s Dolphin Engine Uses Flink + Hologres for Real‑Time Big Data

DataFunSummit

Jan 9, 2023 · Big Data

JD Data‑Driven Business Development: Building a Business Metric Data System and Marketplace Governance

The article outlines JD's data‑driven business development strategy, describing the current challenges of its business data marketplace, the governance framework—including layered architecture, standardization, ClickHouse dictionary refresh, and optimization measures—and the resulting performance improvements and future outlook.

Big DataClickHouseJD.com

0 likes · 13 min read

JD Data‑Driven Business Development: Building a Business Metric Data System and Marketplace Governance

DataFunSummit

Jan 8, 2023 · Big Data

Apache InLong SPI Refactoring: Reducing Maintenance Costs and Boosting Extensibility

This article explains how Apache InLong's manager service applied SPI‑based refactoring to simplify code, lower maintenance overhead, and dramatically improve extensibility for a rapidly growing variety of data sources and sinks in large‑scale data integration scenarios.

Apache InLongBig DataSPI

0 likes · 9 min read

Apache InLong SPI Refactoring: Reducing Maintenance Costs and Boosting Extensibility

DataFunTalk

Jan 8, 2023 · Big Data

ByteDance Event‑Tracking Data Cost Governance Practices

This article describes ByteDance's comprehensive approach to managing the massive volume of event‑tracking (埋点) data, detailing the background, cost‑reduction strategies, experience review, future plans, and a Q&A session that together illustrate how systematic data governance can dramatically cut storage and processing expenses.

Big DataByteDanceSampling

0 likes · 18 min read

ByteDance Event‑Tracking Data Cost Governance Practices

Architects Research Society

Jan 7, 2023 · Big Data

Enterprise Data Strategy: Aligning Tactical Steps with Strategic Success

The article uses a dating analogy to illustrate how enterprise data strategy must combine clean, high‑quality data, governance, and analytics with clear tactical components to support strategic goals, drive market advantage, and enable reliable, mission‑focused outcomes in the experience economy.

Big Datadata governancedata-science

0 likes · 9 min read

Enterprise Data Strategy: Aligning Tactical Steps with Strategic Success

DataFunSummit

Jan 7, 2023 · Big Data

Redefining the Customer Data Platform (CDP) for New Energy Vehicle Companies

This article explores why the automotive industry's shift to new energy vehicles necessitates a redefinition of the Customer Data Platform (CDP), detailing the changing traffic structure, varied departmental demands, CDP typologies, implementation strategies, and the benefits of a unified, extensible CDP architecture for marketing, sales, and after‑sales.

AutomotiveBig DataCDP

0 likes · 13 min read

Redefining the Customer Data Platform (CDP) for New Energy Vehicle Companies

Data Thinking Notes

Jan 5, 2023 · Big Data

Why Data Lakes Are Outshining Traditional Data Warehouses: A Deep Dive

This comprehensive guide explains the evolution from traditional data warehouses to modern data lakes, detailing concepts, architectures, differences, implementation steps, and real‑world case studies, while also comparing major cloud providers' solutions and highlighting how data platforms support digital transformation and analytics.

AnalyticsBig DataData Lake

0 likes · 97 min read

Why Data Lakes Are Outshining Traditional Data Warehouses: A Deep Dive

JD Tech

Jan 4, 2023 · Big Data

Implementing Data Cubes in Hive Using WITH CUBE, GROUPING SETS, and WITH ROLLUP

This article demonstrates how to build multi‑dimensional data cubes on JD's big‑data platform using Hive, comparing UNION ALL with the more concise WITH CUBE, GROUPING SETS, and WITH ROLLUP functions, and discusses practical pitfalls and optimization tips.

Big DataGrouping SetsHive

0 likes · 10 min read

Implementing Data Cubes in Hive Using WITH CUBE, GROUPING SETS, and WITH ROLLUP

StarRing Big Data Open Lab

Jan 4, 2023 · Big Data

Choosing the Right Data Architecture: Warehouse, Mart, or Lake?

Understanding enterprise data platforms requires grasping the differences between data warehouses, data marts, and data lakes, their architectures, use cases, and key capabilities such as integration, real‑time processing, governance, and cost control, to guide organizations in building scalable, flexible data solutions.

Big DataData Mart

0 likes · 15 min read

Choosing the Right Data Architecture: Warehouse, Mart, or Lake?

DataFunSummit

Jan 4, 2023 · Big Data

Data Intelligence Expert Interview – Maturity, Trends, and Practices of Data Middle Platforms

The interview gathers insights from data‑platform experts on the maturity stages, technology trends, implementation methodologies, open‑source ecosystems, system architectures, governance, security, and assessment criteria of modern data middle platforms, offering a comprehensive guide for practitioners.

Big DataData ObservabilityData Platform

0 likes · 28 min read

Data Intelligence Expert Interview – Maturity, Trends, and Practices of Data Middle Platforms

Data Thinking Notes

Jan 3, 2023 · Big Data

How a Scalable Data Service Platform Transforms Big Data into APIs

This article outlines the design and implementation of a unified data service platform that standardizes data access, accelerates model processing, provides flexible API construction, and ensures high availability through gateway, caching, and monitoring, ultimately reducing cost and improving efficiency for both C‑end and B‑end applications.

Big DataData PlatformService Architecture

0 likes · 25 min read

How a Scalable Data Service Platform Transforms Big Data into APIs

Tencent Cloud Developer

Jan 3, 2023 · Big Data

How Tencent’s Cloud‑Native Lakehouse Tackles PB‑Scale Performance Challenges

This article analyzes Tencent Cloud’s DLC lakehouse solution, explaining the unified data lake‑warehouse architecture, the performance hurdles of object‑storage‑based analytics, and the multi‑dimensional caching, virtual‑cluster elasticity, and advanced filter techniques that enable second‑level analysis on petabyte‑scale data while reducing costs.

Big DataCachingDLC

0 likes · 13 min read

How Tencent’s Cloud‑Native Lakehouse Tackles PB‑Scale Performance Challenges

ITPUB

Jan 3, 2023 · Databases

How DragonF MPP DB Redefines Cloud‑Native Data Warehousing at Massive Scale

The article details the design, core features, and real‑world performance of the DragonF MPP DB, a cloud‑native, compute‑storage‑separated database that overcomes traditional MPP limitations, supports millions of daily jobs, and outlines its future roadmap for ultra‑large‑scale data platforms.

Big DataData WarehouseMPP

0 likes · 11 min read

How DragonF MPP DB Redefines Cloud‑Native Data Warehousing at Massive Scale

Big Data Technology & Architecture

Jan 3, 2023 · Big Data

Migrating Hive SQL Jobs to Flink Using the SQL Gateway

This article explains how to use Apache Flink 1.16's SQL Gateway to migrate Hive SQL tasks to Flink, covering the underlying Hive‑on‑Flink architecture, dialect compatibility, streaming and batch demos, configuration details, and practical tips for developers and platform engineers.

Batch processingBig DataFlink

0 likes · 19 min read

Migrating Hive SQL Jobs to Flink Using the SQL Gateway

DataFunTalk

Jan 3, 2023 · Big Data

Tencent Unified Big Data Scheduling Platform – Architecture, Design, and Operations

The article presents an in‑depth overview of Tencent's self‑developed Unified Scheduling Platform, detailing its system architecture, design challenges, performance optimizations, resource‑fair scheduling mechanisms, operational metrics, future roadmap, and a Q&A session that together illustrate how the platform enables massive offline data processing at scale.

Big DataDistributed SystemsPerformance Optimization

0 likes · 18 min read

Tencent Unified Big Data Scheduling Platform – Architecture, Design, and Operations

Code Ape Tech Column

Jan 3, 2023 · Big Data

Elasticsearch vs ClickHouse: Performance, Cost, and Deployment Guide

This article compares Elasticsearch and ClickHouse in terms of write throughput, query speed, and server cost, then provides a step‑by‑step deployment guide for a private data pipeline using Zookeeper, Kafka, FileBeat, and ClickHouse, along with common issues and their solutions.

Big DataClickHouseElasticsearch

0 likes · 15 min read

Elasticsearch vs ClickHouse: Performance, Cost, and Deployment Guide

dbaplus Community

Jan 2, 2023 · Operations

How to Build a Scalable Prometheus Monitoring System for Big Data on Kubernetes

This article explains how to design and implement a Prometheus‑based monitoring solution for big‑data components running on Kubernetes, covering metric exposure methods, scrape configurations, alerting architecture, exporter development, and practical code examples for a production‑ready setup.

Big DataExporterPrometheus

0 likes · 18 min read

How to Build a Scalable Prometheus Monitoring System for Big Data on Kubernetes

Top Architect

Jan 2, 2023 · Big Data

Optimizing Kafka at Meituan: Challenges and Solutions for a Large‑Scale Data Platform

This article details Meituan's use of Kafka as a unified data cache and distribution layer, outlines the challenges of massive scale and latency, and presents comprehensive optimizations across application, system, and cluster management layers, including disk balancing, migration acceleration, fetcher isolation, and full‑link monitoring.

Big DataDistributed SystemsKafka

0 likes · 22 min read

Optimizing Kafka at Meituan: Challenges and Solutions for a Large‑Scale Data Platform

ITPUB

Dec 31, 2022 · Databases

Why HBase? Strengths, Weaknesses, Real‑World Scenarios, and Architecture Explained

This article examines HBase’s high reliability and performance as a column‑oriented NoSQL store, outlines its advantages and limitations, presents two practical use cases from e‑commerce, and details its data model, architecture components, and design considerations for effective deployment.

Big DataData StorageHBase

0 likes · 12 min read

Why HBase? Strengths, Weaknesses, Real‑World Scenarios, and Architecture Explained

DataFunSummit

Dec 31, 2022 · Big Data

The Evolution of Data Platforms: From Early Computing to the Modern Big Data Stack

This article reviews the history of data platforms—from the first general‑purpose computers and early relational databases through traditional BI, agile BI, and big‑data technologies like Hadoop, Spark, and Flink, up to today’s cloud‑native modern data stack and its future outlook.

Big DataData PlatformFlink

0 likes · 26 min read

The Evolution of Data Platforms: From Early Computing to the Modern Big Data Stack

DataFunTalk

Dec 31, 2022 · Cloud Native

Design Philosophy and Architecture of JuiceFS: A Cloud‑Native Distributed File System

This article reviews the evolution of file storage, outlines challenges of cloud‑native data management, and details JuiceFS’s cloud‑native design philosophy, architecture, and key use cases such as Kubernetes, AI, and big‑data workloads.

AIBig DataDistributed File System

0 likes · 23 min read

Design Philosophy and Architecture of JuiceFS: A Cloud‑Native Distributed File System

Aikesheng Open Source Community

Dec 31, 2022 · Databases

Understanding ClickHouse Performance: Storage Engine and Compute Engine Perspectives

This article explains why ClickHouse delivers high query speed by detailing storage‑engine optimizations such as pre‑sorting, columnar layout and compression, and compute‑engine techniques like vectorized execution, built‑in functions and minimal join usage, while also promoting the related book and giveaway.

Big DataClickHouseOLAP

0 likes · 9 min read

Understanding ClickHouse Performance: Storage Engine and Compute Engine Perspectives

Architect's Tech Stack

Dec 30, 2022 · Big Data

Distributed Computing Is Not a Panacea for Big Data: Prioritize Single‑Node Performance First

While distributed clusters are popular for big‑data processing, they are not a universal solution; tasks that are hard to partition or involve heavy cross‑node communication often perform better on a well‑optimized single machine, making a careful analysis of workload characteristics essential before scaling out.

Algorithm OptimizationBig DataDistributed computing

0 likes · 14 min read

Distributed Computing Is Not a Panacea for Big Data: Prioritize Single‑Node Performance First

DataFunTalk

Dec 29, 2022 · Big Data

Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)

This article presents the background, requirements, architecture, key modules, and practical impact of OPPO's non‑intrusive big‑data diagnostic platform—named Compass—designed to quickly locate issues, provide optimization suggestions, and achieve cost‑saving and efficiency gains for large‑scale Spark and Hadoop workloads.

Big DataCost ReductionHadoop

0 likes · 17 min read

Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)

ByteDance Data Platform

Dec 28, 2022 · Big Data

How Cloud Data Warehouses Are Shaping the Future of Big Data and DataOps

This article examines the four‑stage evolution of data warehouses, highlights the cost‑effective, scalable advantages of cloud‑native warehouses, explores the rapid growth of data‑management infrastructure, and discusses the emerging practices of DataOps and AI integration that are redefining modern data stacks.

AIBig DataData Management

0 likes · 15 min read

How Cloud Data Warehouses Are Shaping the Future of Big Data and DataOps

Big Data Technology & Architecture

Dec 28, 2022 · Big Data

Flink 1.16 Highlights: Adaptive Batch Scheduling, Speculative Execution, Hybrid Shuffle, Dynamic Partition Pruning, Hive SQL Migration, Checkpoint Enhancements, CDC Integration, and Table Store

Flink 1.16 introduces adaptive batch scheduling, speculative execution, hybrid shuffle, dynamic partition pruning, improved Hive SQL compatibility, advanced checkpoint mechanisms including changelog backend, and integrates CDC with Kafka and Table Store, offering faster, more stable, and easier-to-use stream‑batch processing capabilities.

Big DataCDCCheckpoint

0 likes · 8 min read

Flink 1.16 Highlights: Adaptive Batch Scheduling, Speculative Execution, Hybrid Shuffle, Dynamic Partition Pruning, Hive SQL Migration, Checkpoint Enhancements, CDC Integration, and Table Store

High Availability Architecture

Dec 27, 2022 · Big Data

Design and Implementation of a Data Service Middle Platform for Scalable Data SaaS

This article presents a comprehensive overview of a data service middle platform, detailing its background, architectural design, data construction, model definition and acceleration, API creation, query processing, service gateway, common solutions for standardization and cost reduction, as well as achieved results and future plans.

APIBig DataData Platform

0 likes · 22 min read

Design and Implementation of a Data Service Middle Platform for Scalable Data SaaS

Tencent Advertising Technology

Dec 27, 2022 · Big Data

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

The article details how Tencent Advertising re‑architected its massive log pipeline by consolidating heterogeneous real‑time and offline logs into an Iceberg‑based data lake, introducing multi‑level partitioning, Spark and Flink ingestion, and numerous performance and cost optimizations for scalable big‑data analytics.

Big DataData LakeFlink

0 likes · 20 min read

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

DataFunTalk

Dec 25, 2022 · Big Data

Maintaining Wide Tables: Resource Impact, Evaluation, Granularity, Timeliness, and Automatic Expansion

The article explains how wide tables are maintained without excessive resource consumption, outlines criteria for deciding which metrics belong in a wide table, describes their granularity and timeliness considerations, and clarifies that they do not automatically expand when tracking points change.

AnalyticsBig DataData Warehouse

0 likes · 4 min read

Maintaining Wide Tables: Resource Impact, Evaluation, Granularity, Timeliness, and Automatic Expansion

DataFunTalk

Dec 24, 2022 · Big Data

Evolution of Data Platforms: From Early Computers to the Modern Data Stack

This article traces the history of data platforms—from the first general‑purpose computers and traditional BI, through the rise of data warehouses, big‑data frameworks like Hadoop, Spark and Flink, to the modern data‑stack era with cloud‑native architectures, Lambda/Kappa models, and emerging tools—highlighting key technologies, architectural shifts, and future prospects.

Big DataCloud ComputingData Warehouse

0 likes · 26 min read

Evolution of Data Platforms: From Early Computers to the Modern Data Stack

DataFunSummit

Dec 24, 2022 · Operations

Understanding DataOps: Evolution, Technology Stacks, and Industry Applications

This article explores DataOps from its historical evolution through the digital 3.0 era, outlines its core technology stacks such as Data Fabric, Data Mesh, and Modern Data Stack, and demonstrates practical applications across finance, manufacturing, telecom, and public services, highlighting its role in agile, cloud‑native data management.

Big DataDataOpsdata governance

0 likes · 18 min read

Understanding DataOps: Evolution, Technology Stacks, and Industry Applications

Big Data Technology & Architecture

Dec 23, 2022 · Big Data

Understanding Spark SQL CacheManager: Caching Mechanism, Triggering, Uncaching, and Canonicalization

This article explains Spark SQL's CacheManager, how it stores cached query results using InMemoryRelation, the ways to trigger and release caches, the internal data structures like IndexedSeq and CachedData, and the role of canonicalization in determining cache reuse.

Big DataCacheManagerCaching

0 likes · 8 min read

Understanding Spark SQL CacheManager: Caching Mechanism, Triggering, Uncaching, and Canonicalization

Bilibili Tech

Dec 23, 2022 · Big Data

Data Service Platform Architecture and Design

The article outlines a standardized data‑service platform built atop a warehouse, detailing its construction, query, and gateway layers—supporting model definition, acceleration, reusable APIs, unified DSL/SQL interfaces, and observability—to solve ingestion, definition, and lineage issues, achieving 500+ APIs, sub‑day creation, and 18% cost reduction.

Big DataData Serviceapi-gateway

0 likes · 22 min read

Data Service Platform Architecture and Design

DataFunSummit

Dec 22, 2022 · Big Data

SeaTunnel: An Open‑Source Ultra‑Scale Data Integration Platform – Design Goals, Architecture, and Future Roadmap

This article introduces SeaTunnel, an open‑source ultra‑large‑scale data integration platform, covering its design objectives, current status with over 50 connectors and multi‑engine support, overall architecture, execution flow, connector translation, source and sink APIs, global commit strategies, table & catalog APIs, and the upcoming roadmap for connector expansion, a web UI, and a dedicated engine.

Big DataConnectorSeaTunnel

0 likes · 10 min read

SeaTunnel: An Open‑Source Ultra‑Scale Data Integration Platform – Design Goals, Architecture, and Future Roadmap

ITPUB

Dec 21, 2022 · Big Data

How Bilibili Optimized Flink Runtime for Massive Real‑Time Jobs

This article details Bilibili's extensive enhancements to the Flink runtime—including checkpoint recoverability, max‑parallelism calculations, State Processor API extensions, Full and Regional Checkpoints, hybrid HA, task‑level recovery, load‑balanced partitioners, and large‑scale cluster maintenance—to improve reliability and performance of its billion‑scale streaming workloads.

Big DataCheckpointFlink

0 likes · 33 min read

How Bilibili Optimized Flink Runtime for Massive Real‑Time Jobs

DataFunSummit

Dec 21, 2022 · Big Data

Big Data Platform Architecture: Expert Insights on Components, Challenges, and Trends

An expert interview series examines the architecture of big data platforms, detailing core modules such as data integration, storage, computation, scheduling, and query analysis, while highlighting current challenges, best‑practice tools, and future trends like cloud‑native, object storage, and real‑time processing.

Big DataDistributed computingQuery Engines

0 likes · 12 min read

Big Data Platform Architecture: Expert Insights on Components, Challenges, and Trends

Xianyu Technology

Dec 21, 2022 · Artificial Intelligence

Xianyu Recommendation System: Architecture, Challenges, and Deployment

The Xianyu recommendation system, built by backend expert Wan Xiaoyong, evolved from offline scoring to a full‑graph, serverless recall‑ranking pipeline that tackles C2C uncertainties through centralized feature engineering, model compression, staged deployment, flexible experimentation, robust governance, and plans for automated attribution and interpretability.

AIBig DataFeature Engineering

0 likes · 10 min read

Xianyu Recommendation System: Architecture, Challenges, and Deployment

DataFunSummit

Dec 20, 2022 · Big Data

JD Retail Big Data OLAP Application and Practice

This talk presents JD Retail’s big‑data OLAP solution, covering the massive, variable and complex traffic data challenges, the custom data‑ingestion and versioned update tools, ClickHouse query‑architecture upgrades, optimization techniques, and future plans for multi‑cluster querying and pre‑computation.

Big DataClickHouseJD Retail

0 likes · 21 min read

JD Retail Big Data OLAP Application and Practice

Top Architect

Dec 20, 2022 · Databases

Elasticsearch DSL Query Syntax Overview (Version 7.x)

This article provides a comprehensive beginner-friendly guide to Elasticsearch 7.x DSL query syntax, covering core keywords, mapping types, query examples, boolean logic, and code snippets to help readers understand and construct effective search queries.

Big DataDSLDatabase

0 likes · 8 min read

Elasticsearch DSL Query Syntax Overview (Version 7.x)

Data Thinking Notes

Dec 19, 2022 · Big Data

Data Quality Mastery: From Expectations to Operational Assurance

This article outlines a comprehensive data quality management framework, covering expectations, measurement, assurance, and operational practices, and provides concrete templates, rule designs, and governance processes to help data teams systematically assess, monitor, and improve data reliability throughout the lifecycle.

Big DataData Qualitydata governance

0 likes · 18 min read

Data Quality Mastery: From Expectations to Operational Assurance

Big Data Technology & Architecture

Dec 19, 2022 · Big Data

Near Real-Time Data Lake Practices in TikTok E-commerce: Architecture, Techniques, and Case Studies

This article presents a comprehensive overview of TikTok e-commerce's near‑real‑time data lake implementation, detailing data lake characteristics, architecture choices, practical use cases across analysis and operations, and for future challenges and plans.

Apache HudiBig DataData Lake

0 likes · 16 min read

Near Real-Time Data Lake Practices in TikTok E-commerce: Architecture, Techniques, and Case Studies

ITPUB

Dec 18, 2022 · Databases

Why ClickHouse Is So Fast: Deep Dive into Storage and Compute Engine Optimizations

This article explains how ClickHouse achieves high query performance by leveraging storage‑engine designs such as pre‑sorting, columnar layout, and block‑level compression, and by exploiting a vectorized compute engine while avoiding joins and using built‑in functions.

Big DataClickHouseColumnar Storage

0 likes · 9 min read

Why ClickHouse Is So Fast: Deep Dive into Storage and Compute Engine Optimizations

DataFunTalk

Dec 18, 2022 · Big Data

Expert Interview: Architecture, Components, and Future Trends of Big Data Platforms

DataFun interviewed leading big‑data experts to outline the core components of modern big‑data platform architectures, discuss integration, storage, computation, scheduling, and query technologies, and share their perspectives on current challenges and future cloud‑native trends.

Big DataOLAPexpert interview

0 likes · 11 min read

Expert Interview: Architecture, Components, and Future Trends of Big Data Platforms

DataFunSummit

Dec 17, 2022 · Big Data

Douyu Live's Digitalization Journey: Data Platform Challenges, Practices, and Future Outlook

This article presents Douyu Live's experience in building a data middle platform, outlining the challenges of data application, the four‑stage evolution of their data tools, current achievements, and future goals to empower every employee as a data analyst.

Big DataData PlatformDigital Transformation

0 likes · 15 min read

Douyu Live's Digitalization Journey: Data Platform Challenges, Practices, and Future Outlook

Data Thinking Notes

Dec 15, 2022 · Big Data

Why 80% of Data Analysis Time Is Spent on Data Preparation—and How to Master It

Data preparation consumes about 80% of the entire analytics workflow, making data collection, quality assurance, and governance critical pillars—spanning metadata, master data, storage layers like data lakes and warehouses, and rigorous preprocessing—to turn raw information into reliable insights.

Big DataData ManagementETL

0 likes · 12 min read

Why 80% of Data Analysis Time Is Spent on Data Preparation—and How to Master It

Big Data Technology & Architecture

Dec 15, 2022 · Big Data

Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans

This technical article presents a comprehensive overview of migrating Hive SQL to Flink SQL, covering the motivations behind the migration, key challenges such as compatibility, stability and performance, practical implementation steps, a detailed demo, future development directions, and a Q&A session addressing common concerns.

Batch processingBig DataData Lake

0 likes · 13 min read

Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans

Zhuanzhuan Tech

Dec 15, 2022 · Big Data

Zhuanzhuan User Profile Platform: Architecture, Tag Construction, Storage, and User Segmentation Practices

This article details Zhuanzhuan's user profile platform, covering its business-driven motivation, tag taxonomy, system architecture, data pipelines using Hive, ClickHouse and Spark, storage design, per‑user insight, segmentation techniques, ID‑mapping, and future plans for real‑time tagging.

Big DataData engineeringHive

0 likes · 17 min read

DataFunTalk

Dec 14, 2022 · Big Data

Cloud‑Native Big Data Solutions for the Financial Industry: Architecture, Deployment, Scheduling, and Resource Management

This article explains why the financial sector is moving its big‑data workloads to cloud‑native platforms, compares cloud‑native systems with traditional Hadoop, describes deployment options such as Serverless YARN and Arcee Operator, and details the high‑performance GRO scheduler, agent, and ResLake resource‑lake architecture that together improve resource utilization, reduce costs, and ensure reliable, low‑latency processing for finance workloads.

Big Datacloud-nativeresource scheduling

0 likes · 19 min read

Cloud‑Native Big Data Solutions for the Financial Industry: Architecture, Deployment, Scheduling, and Resource Management

dbaplus Community

Dec 13, 2022 · Big Data

How ClickHouse Powers Real-Time Self-Service Analytics at Scale

Facing massive daily data volumes and complex, ad‑hoc analytical needs, Zhaozhuan’s engineering team evaluated multiple OLAP engines and chose ClickHouse, then built a four‑layer self‑service analytics platform, detailing architecture, use‑cases, performance tuning, large‑scale joins, and future roadmap challenges.

Big DataClickHouseData Architecture

0 likes · 14 min read

How ClickHouse Powers Real-Time Self-Service Analytics at Scale

DataFunSummit

Dec 13, 2022 · Big Data

Introducing the Star River Big Data Development Platform: Architecture, Core Capabilities, and Future Plans

This article presents an in‑depth overview of 58.com’s self‑built Star River big data platform, covering its evolution across three eras, resource management hierarchy, core technical capabilities such as metadata services, data maps and lineage, governance practices, and the roadmap for further enhancements.

Big DataData Platformarchitecture

0 likes · 14 min read

Introducing the Star River Big Data Development Platform: Architecture, Core Capabilities, and Future Plans

DataFunTalk

Dec 12, 2022 · Big Data

Cloud‑Native and Intelligent Fusion: Key Trends Shaping the Future of Big Data

The article explains how cloud‑native architectures, data governance, intelligent fusion, and privacy computing are driving the evolution of big data, recounting the history from Google’s early papers and Hadoop to modern managed services, compute‑storage separation, AI‑powered recommendation platforms, and real‑world success cases.

Big DataCloud Computingcloud-native

0 likes · 10 min read

Cloud‑Native and Intelligent Fusion: Key Trends Shaping the Future of Big Data

DataFunTalk

Dec 12, 2022 · Artificial Intelligence

Graph Algorithms in Risk Control: Fundamentals, Evolution, Platforms, and Future Outlook

This article presents a comprehensive overview of how graph algorithms and graph neural networks are applied to internet risk control, covering basic concepts, evolutionary trends, platform implementations, future challenges, and a Q&A session that bridges theory and practice.

Big DataPlatform Engineeringgraph algorithms

0 likes · 19 min read

Graph Algorithms in Risk Control: Fundamentals, Evolution, Platforms, and Future Outlook

AntTech

Dec 11, 2022 · Information Security

Occlum v1.0: Open‑Source Trusted Execution Environment OS with Major Performance Gains and Spark Big Data Integration

Occlum v1.0, the open‑source trusted execution environment operating system released by Ant Group, delivers up to five‑fold performance improvements, supports over 150 Linux syscalls, introduces async I/O, dynamic memory management, and a Spark‑BigDL big‑data analysis solution, while outlining future GPU and TDX extensions.

Big DataConfidential ComputingOcclum

0 likes · 11 min read

Occlum v1.0: Open‑Source Trusted Execution Environment OS with Major Performance Gains and Spark Big Data Integration

DataFunSummit

Dec 10, 2022 · Big Data

Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions

This presentation details how Guanyuan Data leverages Apache Spark within its self‑service analytics platform, covering product features, flexible deployment, resource isolation, performance challenges, architectural solutions, and future cloud‑native enhancements to support thousands of users and massive query workloads.

Apache SparkBig DataData Platform

0 likes · 14 min read

Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions

ITPUB

Dec 10, 2022 · Big Data

How ClickHouse Powers Real-Time Self-Service Analytics at Scale

This article examines why ClickHouse was chosen as the OLAP engine for a massive self‑service analytics platform, describes the system architecture, shares concrete memory and performance tuning parameters, and outlines current challenges and future roadmap for large‑scale real‑time data analysis.

Big DataClickHouseData Architecture

0 likes · 14 min read

php Courses

Dec 9, 2022 · Databases

Elasticsearch Index and Document Operations Tutorial

This tutorial explains how to create, query, update, and delete Elasticsearch indices and documents using RESTful HTTP requests, covering basic CRUD operations, various query types, pagination, sorting, aggregations, highlighting, and mapping definitions with practical JSON examples.

Big DataElasticsearchJSON

0 likes · 8 min read

Elasticsearch Index and Document Operations Tutorial

DataFunSummit

Dec 8, 2022 · Databases

Understanding ClickHouse Distributed DDL Execution: Cases, Principles, and Mitigation Guide

This article analyzes ClickHouse distributed DDL execution by presenting typical failure scenarios, dissecting the underlying Zookeeper‑based workflow, and offering practical mitigation steps to avoid DDL timeouts and improve cluster stability for large‑scale data operations.

Big DataClickHouseDatabase operations

0 likes · 12 min read

Understanding ClickHouse Distributed DDL Execution: Cases, Principles, and Mitigation Guide

Data Thinking Notes

Dec 8, 2022 · Big Data

Why Layer Your Data Warehouse? Unlock Performance, Cost Savings, and Maintainability

This article explains the purpose and benefits of data‑warehouse layering, outlines the four ETL steps, describes each architectural layer from ODS to ADS, presents modeling principles, naming conventions, and includes sample DDL to illustrate how layered design improves data quality, reuse, and operational efficiency.

Big DataData WarehouseETL

0 likes · 36 min read

Why Layer Your Data Warehouse? Unlock Performance, Cost Savings, and Maintainability

Thoughts on Knowledge and Action

Dec 7, 2022 · Big Data

Mastering Elasticsearch: Core Concepts, Cluster Architecture, and Indexing Mechanics

This article explains Elasticsearch’s fundamental building blocks, cluster roles, shard and replica strategies, master election, split‑brain prevention, inverted index structure, and the complete search and indexing lifecycle for handling large‑scale data efficiently.

Big DataCluster ManagementDistributed Systems

0 likes · 10 min read

Mastering Elasticsearch: Core Concepts, Cluster Architecture, and Indexing Mechanics

DataFunSummit

Dec 7, 2022 · Big Data

Modern Data Governance at NetEase DataFan: Evolution, Challenges, and Solutions

This article details NetEase DataFan's journey in building a full‑stack big‑data platform, explains the design‑first data‑mid‑platform approach, analyzes cost, quality, and security problems encountered, and presents the modern data‑governance framework that integrates development, governance, and consumption into a closed loop.

Big DataCost ManagementData Platform

0 likes · 22 min read

Modern Data Governance at NetEase DataFan: Evolution, Challenges, and Solutions

Alibaba Cloud Developer

Dec 7, 2022 · Databases

How Lindorm Cut Costs and Boost Performance for Alibaba’s Massive Data Workloads

This article reviews Lindorm’s evolution from its HBase‑based 1.0 architecture to the cloud‑native 2.0 version, outlines 2022’s cost‑saving and efficiency challenges, details compression, storage, time‑series and SQL enhancements, and shares real‑world case studies demonstrating significant cost reductions and performance gains.

Big DataCost ReductionLindorm

0 likes · 24 min read

How Lindorm Cut Costs and Boost Performance for Alibaba’s Massive Data Workloads

Zhengtong Technical Team

Dec 6, 2022 · Big Data

Beidou Grid Code: Theory, Implementation, and Urban Management Applications

This article introduces the Beidou Grid Code, its theoretical foundation in GeoSOT, detailed hierarchical encoding rules, implementation challenges using MySQL and JPA, and showcases practical urban management applications such as case reporting, hotspot analysis, indoor positioning, and data security.

BeidouBig DataGIS

0 likes · 16 min read

Beidou Grid Code: Theory, Implementation, and Urban Management Applications

Data Thinking Notes

Dec 5, 2022 · Big Data

How NetEase Cloud Music Cut Storage Costs by 30% Through Data Governance

This article details NetEase Cloud Music's year‑long data governance initiative, covering data background, governance strategy, project plan, practical actions, results, and future outlook, and shows how metadata‑driven management reduced storage by over 30% while improving reliability and efficiency.

Big DataHadoopcloud music

0 likes · 17 min read

How NetEase Cloud Music Cut Storage Costs by 30% Through Data Governance

DataFunSummit

Dec 5, 2022 · Big Data

Impala Cluster Performance Optimization Based on Historical Queries: Practices and Solutions

This article presents a comprehensive overview of Impala cluster performance optimization using historical query analysis, covering background, high‑performance data‑warehouse construction principles, identified pain points, HBO implementation details, optimization techniques, and future development plans for the Impala ecosystem.

Big DataHBOHistorical Queries

0 likes · 16 min read

Impala Cluster Performance Optimization Based on Historical Queries: Practices and Solutions

Top Architect

Dec 4, 2022 · Databases

Deep Dive into Elasticsearch Pagination: from/size, Scroll, and Search After

This article explains how Elasticsearch handles deep pagination, compares the traditional from/size method with Scroll and Search After techniques, details their internal query and fetch phases, provides practical code examples, and offers guidance on choosing the right approach for large‑scale search workloads.

Big Datapaginationscroll

0 likes · 15 min read

Deep Dive into Elasticsearch Pagination: from/size, Scroll, and Search After

Architects Research Society

Dec 3, 2022 · Databases

Solr vs Elasticsearch: Choosing the Right Search Engine for Your Organization

This article compares Solr and Elasticsearch, examining their cloud, analytics, and cognitive search capabilities, and provides guidance on selecting the most suitable engine based on factors such as deployment complexity, resource requirements, scalability, integration with Hadoop ecosystems, and specific organizational use cases.

Big DataElasticsearchSolr

0 likes · 9 min read

Solr vs Elasticsearch: Choosing the Right Search Engine for Your Organization

DataFunSummit

Dec 2, 2022 · Big Data

BitSail: ByteDance’s Open‑Source Unified Data Integration Engine – Architecture, Evolution, and Capabilities

BitSail, ByteDance’s open‑source data integration engine, unifies batch, streaming, and incremental data synchronization across heterogeneous sources, detailing its evolution from early Flink‑based prototypes to a mature, plugin‑driven architecture with multi‑engine support, low‑cost co‑development, and robust CDC lakehouse capabilities.

Big DataCDCFlink

0 likes · 19 min read

BitSail: ByteDance’s Open‑Source Unified Data Integration Engine – Architecture, Evolution, and Capabilities

DataFunSummit

Dec 1, 2022 · Big Data

City Data Acquisition Platform: Architecture, Core Technologies, and Incremental Synchronization Strategies

This article presents an overview of a smart city unified perception platform, detailing its modular architecture, solutions for multi-source heterogeneity, incremental synchronization strategies, and real-time API data collection, while discussing extensibility and practical implementation considerations.

API integrationBig DataData Platform

0 likes · 20 min read

City Data Acquisition Platform: Architecture, Core Technologies, and Incremental Synchronization Strategies

Architecture Digest

Dec 1, 2022 · Big Data

Understanding Data Warehouse Architecture and Layered Design

This article explains the concepts, architecture, and layered design of data warehouses, covering data flow, ETL processes, ODS, DWD, DWM, DWS, ADS layers, their characteristics, differences from databases, and the role of data marts in supporting OLAP and decision‑making.

AnalyticsBig DataData Layers

0 likes · 13 min read

Understanding Data Warehouse Architecture and Layered Design

21CTO

Nov 30, 2022 · Big Data

Mastering Data Sharding: Hash, Range, and Consistent Hash Techniques

This article explains core data sharding concepts and models—including hash‑based, range‑based, and consistent hashing—detailing their mappings, routing strategies, scalability considerations, and practical implementation examples for handling massive datasets in distributed systems.

Big DataConsistent HashingHashing

0 likes · 11 min read

Mastering Data Sharding: Hash, Range, and Consistent Hash Techniques

DeWu Technology

Nov 30, 2022 · Big Data

Fundamentals and Implementation of Data Lineage in Big Data Environments

Data lineage in big‑data environments tracks how data moves and transforms—from source tables through SQL processing to final storage—enabling management tasks such as domain segmentation, performance tuning, anomaly detection, and dependency verification, with implementations ranging from simple regex extraction to robust AST parsing and optimization, as used by tools like Alibaba DataWorks and Apache Atlas.

ASTBig DataData Lineage

0 likes · 7 min read

Fundamentals and Implementation of Data Lineage in Big Data Environments

JD Tech Talk

Nov 30, 2022 · Databases

Risk Insight Platform Architecture and ClickHouse Implementation for Real-Time Risk Monitoring

The article presents a comprehensive risk insight platform built on ClickHouse, Flink, and intelligent algorithms, detailing its architecture, technical challenges, solutions, real-time data modeling, practical applications in fraud detection and user behavior analysis, and future optimization directions.

Big DataData engineeringOLAP

0 likes · 13 min read

Risk Insight Platform Architecture and ClickHouse Implementation for Real-Time Risk Monitoring

Alibaba Cloud Big Data AI Platform

Nov 30, 2022 · Big Data

What’s New in Apache Flink 2022? Highlights from the Flink Forward Asia Summit

The 2022 Flink Forward Asia summit showcased Apache Flink’s rapid community growth, key technical breakthroughs such as distributed snapshot upgrades, cloud‑native state storage, hybrid shuffle, Flink CDC 2.0, and Flink ML 2.0, and real‑world deployments at companies like Midea, miHoYo and Disney.

Apache FlinkBig DataFlink Forward Asia

0 likes · 25 min read

What’s New in Apache Flink 2022? Highlights from the Flink Forward Asia Summit

Bilibili Tech

Nov 29, 2022 · Big Data

How Bilibili Supercharged Flink: Checkpoint, HA, and Runtime Optimizations

This article details Bilibili's extensive enhancements to Flink's runtime—including checkpoint recoverability, operator ID stability, state processor extensions, hybrid high‑availability, regional checkpointing, and load‑based channel selection—to improve scalability, reliability, and operational efficiency of large‑scale streaming jobs.

Big DataCheckpointFlink

0 likes · 32 min read

How Bilibili Supercharged Flink: Checkpoint, HA, and Runtime Optimizations

Alibaba Cloud Big Data AI Platform

Nov 29, 2022 · Big Data

How Flink’s Stream‑Batch Fusion Is Transforming Real‑Time Big Data

The article explores Apache Flink’s eight‑year journey to becoming a top‑level Apache project, Alibaba’s extensive contributions, the rise of stream‑batch unified computing, its impact on real‑time data integration, cloud‑native deployment, and the emerging Flink‑based data‑warehouse and serverless solutions.

Apache FlinkBig DataData Integration

0 likes · 15 min read

How Flink’s Stream‑Batch Fusion Is Transforming Real‑Time Big Data

Data Thinking Notes

Nov 28, 2022 · Big Data

Unlocking Data Value: How Metadata Drives Efficient Data Management and Quality

This comprehensive guide explains how metadata connects source data, warehouses, and applications, outlines its technical and business classifications, demonstrates its value for data management, profiling, portals, and ETL development, and details optimization, storage, lifecycle, and quality practices essential for robust big‑data operations.

Big DataData QualityData Warehouse

0 likes · 35 min read

Unlocking Data Value: How Metadata Drives Efficient Data Management and Quality

Big Data Technology & Architecture

Nov 28, 2022 · Big Data

Comprehensive Guide to Big Data Interview Topics: Log Collection, Data Synchronization, Offline Development, Real‑time Technology, Data Services, and Data Mining

This article provides an extensive overview of big‑data interview subjects, covering browser and mobile log collection methods, data synchronization techniques (batch, real‑time, sharding), offline data development platforms, streaming architectures, data service evolution, performance optimization, and data‑mining layers and applications.

Big DataStreamingdata mining

0 likes · 17 min read

Comprehensive Guide to Big Data Interview Topics: Log Collection, Data Synchronization, Offline Development, Real‑time Technology, Data Services, and Data Mining

Volcano Engine Developer Services

Nov 28, 2022 · Cloud Native

How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok

ByteDance’s cloud‑native computing team, led by Li Yakun, details how they transformed a Hadoop‑centric big‑data stack into a Kubernetes‑driven platform—customizing storage, middleware, and scheduling—to support petabyte‑scale workloads, achieve over 40% resource utilization, and sustain rapid product growth.

Big DataSparkcloud-native

0 likes · 17 min read

How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok

DataFunTalk

Nov 26, 2022 · Big Data

Data Governance: Concepts, Evaluation Methods, and Observability with GuanCe Cloud

This article explains data governance fundamentals, outlines common evaluation shortcomings, and introduces observability concepts and the GuanCe Cloud platform as a way to objectively measure and improve governance outcomes across the entire data lifecycle.

Big DataData QualityObservability

0 likes · 10 min read

Data Governance: Concepts, Evaluation Methods, and Observability with GuanCe Cloud

Programmer DD

Nov 26, 2022 · Big Data

How Flink Became the Real‑Time Big Data Standard – Insights from Alibaba’s Wang Feng

This interview with Alibaba researcher Wang Feng (aka Mo Wen) explores Apache Flink’s eight‑year journey to top‑level Apache status, its unified stream‑batch architecture, the rise of Flink Table Store and CDC, and how cloud‑native deployments are reshaping real‑time big data processing.

Apache FlinkBig DataData Integration

0 likes · 16 min read

How Flink Became the Real‑Time Big Data Standard – Insights from Alibaba’s Wang Feng

DataFunTalk

Nov 25, 2022 · Operations

Overview of Volcano Engine A/B Experiment System Platform

This article presents a comprehensive overview of Volcano Engine's A/B testing platform, detailing its four core stages—reliable experiment system, efficient data construction, scientific statistical analysis, and fine-grained governance—while explaining execution components, data pipelines, statistical methods, and operational best practices for large‑scale experimentation.

A/B testingBig DataExperiment Platform

0 likes · 16 min read

Overview of Volcano Engine A/B Experiment System Platform

Alibaba Cloud Big Data AI Platform

Nov 25, 2022 · Big Data

How EMR‑StarRocks & Flink CDC Simplify Real‑Time Data Warehousing

This article explains how Alibaba Cloud EMR‑StarRocks integrates with Flink CDC, outlines common real‑time ingestion pain points, and introduces the CTAS/CDAS and Connector‑V2 features that streamline table creation, schema evolution, and resource‑efficient streaming for large‑scale analytics.

Big DataCDASCTAS

0 likes · 14 min read

How EMR‑StarRocks & Flink CDC Simplify Real‑Time Data Warehousing

Data Thinking Notes

Nov 23, 2022 · Big Data

Mastering Fact Table Design: From Basics to Advanced Strategies

This comprehensive guide explains the fundamentals, design rules, and various types of fact tables—including transaction, snapshot, and aggregate tables—while detailing Kimball's four-step modeling process, grain declaration, handling of additive measures, and practical examples for effective data warehouse implementation.

Big DataData WarehouseFact Table

0 likes · 16 min read

Mastering Fact Table Design: From Basics to Advanced Strategies

Data Thinking Notes

Nov 22, 2022 · Big Data

Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It

A Sqoop job that normally finishes within 2.5 hours occasionally takes more than 8 hours due to data skew caused by an unsuitable split column, and the article details the investigation, root‑cause analysis, and a practical solution using a better split column and adjusted parallelism.

Big DataData SkewHive

0 likes · 5 min read

Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It

DataFunSummit

Nov 22, 2022 · Big Data

BI Platform Practice at Xiaomi: Evolution, Architecture, and Future Directions

This article details Xiaomi's multi‑year journey in building a group‑wide Business Intelligence platform, covering its historical evolution, technical challenges in performance, modeling, visualization and permissions, the current four‑layer architecture, and future plans to make the platform more business‑centric and simpler.

AnalyticsBIBig Data

0 likes · 15 min read

BI Platform Practice at Xiaomi: Evolution, Architecture, and Future Directions

Top Architect

Nov 22, 2022 · Big Data

Efficient Massive Excel Import/Export with POI and EasyExcel in Java

This article explains how to efficiently import and export massive datasets (up to millions of rows) between Excel and databases using Apache POI, SXSSF, and Alibaba's EasyExcel, comparing workbook types, outlining performance considerations, and providing Java code examples for batch processing, paging, and transaction management.

Batch processingBig DataExcel

0 likes · 23 min read

Efficient Massive Excel Import/Export with POI and EasyExcel in Java

Bilibili Tech

Nov 22, 2022 · Big Data

Overview of the Berserker Big Data Platform and Its Data Development Architecture

The Berserker big‑data platform provides a one‑stop data development and governance solution built on over 40 micro‑services, featuring the Archer scheduler with CN and EN nodes, Raft‑based state management, Docker‑isolated task execution, smart routing, and plans to make EN stateless, migrate to Kubernetes, and unify batch and streaming services.

ArcherBig DataDocker

0 likes · 17 min read

Overview of the Berserker Big Data Platform and Its Data Development Architecture

DevOps Cloud Academy

Nov 22, 2022 · Big Data

Components and Key Terminology in Apache Airflow

Apache Airflow’s architecture consists of schedulers, executors, workers, a web server, and a metadata database, enabling scalable workflow orchestration, while essential terminology such as DAGs, operators, and sensors defines how tasks are organized, executed, and monitored within data pipelines.

Apache AirflowBig DataDAG

0 likes · 8 min read

Components and Key Terminology in Apache Airflow

Architects' Tech Alliance

Nov 20, 2022 · Databases

Columnar Storage vs Row Storage: Overview, Write/Read Comparison, Pros, Cons, and Use Cases

This article explains the differences between row-based and column-based storage, comparing their write and read performance, outlining advantages and disadvantages, and describing suitable scenarios such as OLAP queries, column families, compression, and indexing, to help choose the appropriate storage model.

Big DataColumnar StorageDatabase

0 likes · 10 min read

Columnar Storage vs Row Storage: Overview, Write/Read Comparison, Pros, Cons, and Use Cases

ITPUB

Nov 18, 2022 · Big Data

How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes

This article introduces Iceberg’s table‑format fundamentals, details Xiaomi’s large‑scale deployment of Iceberg for CDC and log ingestion, explores their streaming‑batch integration experiments, outlines future roadmap items, and provides a comprehensive Q&A covering practical challenges and solutions.

Batch processingBig DataData Lake

0 likes · 23 min read

How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes

ByteDance Terminal Technology

Nov 18, 2022 · Big Data

Practices and Techniques for Large‑Scale Distributed Trace Data Analysis at ByteDance

This article presents ByteDance’s experience building a massive trace‑data analysis platform, covering observability fundamentals, the evolution of its distributed tracing system, various aggregation computation models, technical architecture choices, and concrete use‑cases such as precise topology, traffic estimation, dependency analysis, performance anti‑patterns, bottleneck detection, and error propagation.

Big DataGraph DatabaseObservability

0 likes · 21 min read

Practices and Techniques for Large‑Scale Distributed Trace Data Analysis at ByteDance

360 Smart Cloud

Nov 17, 2022 · Databases

Exploring StarRocks Applications, Performance Tests, and Cloud‑Native Integration at 360

This article reviews the practical applications and experimental explorations of StarRocks at 360, describing the cloud‑native lake‑warehouse product Yunzhou, its three‑tier architecture, performance comparisons with Trino using TPCH 100 GB, challenges of Kubernetes integration, and future directions for storage‑compute separation.

Big DataData WarehouseOLAP

0 likes · 7 min read

Exploring StarRocks Applications, Performance Tests, and Cloud‑Native Integration at 360

Xiaohongshu Tech REDtech

Nov 16, 2022 · Operations

Design and Implementation of a Continuous Performance Optimization and Tracking Platform for Xiaohongshu Services

To curb rising resource costs as Xiaohourshu scales, engineers built a Continuous Performance Optimization & Tracking Platform that continuously profiles services, stores diff‑analyzed data in ClickHouse, automatically detects tiny regressions, links them to code changes, and has already saved and flagged roughly 20,000 CPU cores across search, recommendation and advertising workloads.

Big DataContinuous Monitoringcloud-native

0 likes · 16 min read

Design and Implementation of a Continuous Performance Optimization and Tracking Platform for Xiaohongshu Services

DataFunSummit

Nov 15, 2022 · Big Data

Industrial Data Governance: Challenges, Practices, and Insights

Industrial data governance, essential for digital transformation, faces challenges such as data heterogeneity, volume, quality, and integration across the value chain, and the presentation outlines background, practical approaches, strategic thinking, and a phased, demand‑driven model to enhance data quality, assetization, and business value.

Big DataDigital Transformationdata assetization

0 likes · 24 min read

Industrial Data Governance: Challenges, Practices, and Insights

Past Memory Big Data

Nov 15, 2022 · Big Data

How Uber Accelerated Presto Queries with Alluxio Local Cache

Uber processes over 500,000 daily Presto queries across 20 clusters handling more than 50 PB of data, and by deploying Alluxio Local Cache on NVMe disks they raised cache‑hit rates from roughly 65% to over 90% while addressing real‑time partition updates, node churn, and cache‑size constraints.

AlluxioBig DataConsistent Hashing

0 likes · 15 min read

How Uber Accelerated Presto Queries with Alluxio Local Cache

Java Architect Essentials

Nov 14, 2022 · Big Data

Efficient Import and Export of Millions of Records Using Apache POI and EasyExcel

This article explains how to handle massive Excel import and export tasks in Java by comparing traditional POI implementations, selecting the appropriate Workbook type based on data volume, and leveraging Alibaba's EasyExcel library together with batch JDBC operations to process over three million rows with minimal memory usage and high performance.

Apache POIBig DataData Export

0 likes · 22 min read

Efficient Import and Export of Millions of Records Using Apache POI and EasyExcel

Huolala Tech

Nov 11, 2022 · Big Data

How Huolala Boosted Offline Scheduling Performance: Strategies & Lessons

Huolala’s big‑data offline platform, built from scratch, faced escalating scheduling delays as task instances grew, prompting a series of short‑ and mid‑term optimizations—including zombie task cleanup, retention policies, memory caching, algorithmic tweaks, and high‑availability enhancements—to dramatically reduce dependency computation time and sustain million‑scale daily workloads.

Big DataDistributed Systemsoffline scheduling

0 likes · 12 min read

How Huolala Boosted Offline Scheduling Performance: Strategies & Lessons

Open Source Linux

Nov 11, 2022 · Big Data

Deploy Hadoop on Kubernetes with Helm: A Complete Step‑by‑Step Guide

This guide walks through deploying Hadoop 3.x on a Kubernetes cluster using Helm, covering repository addition, Docker image creation, Helm chart configuration, service adjustments, installation, verification commands, and clean uninstallation, complete with code snippets and screenshots.

Big DataDockerHadoop

0 likes · 14 min read

Deploy Hadoop on Kubernetes with Helm: A Complete Step‑by‑Step Guide

Meituan Technology Team

Nov 10, 2022 · Big Data

Optimizing Spark mapPartitions: Memory Management and Best Practices

The article details how Meituan’s Turing machine‑learning platform cut offline resource use by 80% and task time by 63% through memory‑level techniques such as column pruning, adaptive caching, and a deep dive into Spark’s mapPartitions operator, including source‑code analysis, GC behavior, and a low‑memory batch‑iterator best practice.

Big DataMemory OptimizationPerformance tuning

0 likes · 19 min read

Optimizing Spark mapPartitions: Memory Management and Best Practices