Tagged articles
3697 articles
Page 6 of 37
Data Thinking Notes
Data Thinking Notes
Sep 12, 2024 · Information Security

How to Overcome the Top 3 Data Flow Challenges and Secure Your Data Assets

This article outlines the framework for data element circulation, identifies three major security and compliance challenges in data flow, and presents five practical measures plus a six‑step method for incorporating data assets into financial statements to enhance transparency and value.

Big DataData FlowData Security
0 likes · 10 min read
How to Overcome the Top 3 Data Flow Challenges and Secure Your Data Assets
Sohu Tech Products
Sohu Tech Products
Sep 11, 2024 · Big Data

Tencent Real-time Lakehouse Intelligent Optimization Practice

Tencent’s real‑time lakehouse combines Spark, Flink, StarRocks and Presto compute layers with Iceberg‑based management and HDFS/COS storage, and its Intelligent Optimize Service—comprising Compaction, Expiration, Cleaning, Clustering, Index and Auto‑Engine modules—automatically reduces merge time, improves query performance, enables secondary indexing, and dynamically routes hot partitions, while future plans target cold/hot separation, materialized view acceleration, and AI‑driven optimizations.

Big DataLakehousePyIceberg
0 likes · 12 min read
Tencent Real-time Lakehouse Intelligent Optimization Practice
AntTech
AntTech
Sep 10, 2024 · Big Data

From DATA for AI to AI for DATA: Evolution of Ant Group’s Intelligent Data System

The talk reviews the rapid evolution of data technologies—from early database foundations and big‑data breakthroughs to the rise of generative AI—highlighting how Ant Group’s data platform is shifting from a cost‑efficiency focus to a value‑centric, multimodal, AI‑driven ecosystem.

Artificial IntelligenceBig DataData Engineering
0 likes · 17 min read
From DATA for AI to AI for DATA: Evolution of Ant Group’s Intelligent Data System
AntData
AntData
Sep 9, 2024 · Big Data

From Cost‑Efficiency to Value‑Centric: The Evolution of Data Systems in the Data+AI Era

The article reviews the rapid advances in generative AI and big‑data technologies, traces the historical development of data infrastructure, and argues that modern data systems are shifting from a cost‑efficiency focus to a value‑centric paradigm driven by multimodal, non‑structured data, vector search and machine‑oriented services.

@DataArtificial IntelligenceBig Data
0 likes · 18 min read
From Cost‑Efficiency to Value‑Centric: The Evolution of Data Systems in the Data+AI Era
Baidu Geek Talk
Baidu Geek Talk
Sep 9, 2024 · Big Data

TDS Platform Overview: Architecture, Modules, and Features of Baidu MEG's Turing 3.0 Data Ecosystem

The TDS platform, central to Baidu MEG’s Turing 3.0 ecosystem, unifies data development, warehouse management, monitoring, and resource control through Spark‑based TDE, a visual studio, and AI‑enhanced tools like Smart Diagnosis and Text2SQL, enabling standardized workflows, scalable scheduling, and handling over 30 k daily tasks.

AIBig DataData Development
0 likes · 21 min read
TDS Platform Overview: Architecture, Modules, and Features of Baidu MEG's Turing 3.0 Data Ecosystem
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Sep 9, 2024 · Big Data

Why DataFusion is Revolutionizing Big Data Queries with Rust and Arrow

This article introduces DataFusion, a high‑performance, Rust‑based query engine that leverages Apache Arrow’s columnar memory format to enable fast, extensible data processing across multiple storage formats and cloud sources, explains its architecture, execution model, and provides practical Rust code examples for custom extensions.

Apache ArrowBig DataDataFusion
0 likes · 16 min read
Why DataFusion is Revolutionizing Big Data Queries with Rust and Arrow
DataFunSummit
DataFunSummit
Sep 8, 2024 · Big Data

Building and Optimizing a Cross‑Border E‑Commerce Data Platform: Architecture, Challenges, and Protonbase‑Based Solutions

This article presents Xide International's cross‑border e‑commerce data platform, detailing its multi‑layer business architecture, the scalability and data‑access problems encountered, and how a Protonbase‑driven data‑warehouse and micro‑service redesign dramatically improved query speed, operational efficiency, and cost.

Big DataData PlatformData Warehouse
0 likes · 11 min read
Building and Optimizing a Cross‑Border E‑Commerce Data Platform: Architecture, Challenges, and Protonbase‑Based Solutions
Didi Tech
Didi Tech
Sep 5, 2024 · Industry Insights

How Didi Built a Multi‑Protocol, Petabyte‑Scale Storage System for AI Training

Facing petabyte‑level data, billions of small files, and the need for POSIX, S3, and HDFS compatibility, Didi designed a new generation of non‑structured storage—OrangeFS—by analyzing internal systems, combining multiple storage solutions, reusing GIFT technology, and implementing a high‑performance metadata service, multi‑protocol fusion, and robust scalability features.

AI storageBig DataDistributed File System
0 likes · 27 min read
How Didi Built a Multi‑Protocol, Petabyte‑Scale Storage System for AI Training
dbaplus Community
dbaplus Community
Sep 4, 2024 · Big Data

How Ctrip Scaled Its Data Platform to Multi‑IDC Architecture with Spark 3, Kyuubi, and Celeborn

This article details how Ctrip’s data platform evolved from a single‑IDC design to a multi‑IDC, tiered storage and scheduling architecture, covering the challenges of rapid data growth, the migration to Spark 3 via Kyuubi, the introduction of Celeborn shuffle service, and the resulting performance and reliability gains.

Big DataHDFSKyuubi
0 likes · 23 min read
How Ctrip Scaled Its Data Platform to Multi‑IDC Architecture with Spark 3, Kyuubi, and Celeborn
DataFunTalk
DataFunTalk
Sep 4, 2024 · Artificial Intelligence

Data+AI Data Lake Technologies: Challenges, Apache Iceberg Overview, and Vector Table Implementations with PyIceberg

This article explores the evolution of data lakes for AI, discusses the challenges of AI-era data management, introduces Apache Iceberg and its architecture, demonstrates PyIceberg-based AI training and inference pipelines, and presents vector table designs with LSH indexing and performance optimizations.

AIApache IcebergBig Data
0 likes · 22 min read
Data+AI Data Lake Technologies: Challenges, Apache Iceberg Overview, and Vector Table Implementations with PyIceberg
DataFunSummit
DataFunSummit
Aug 31, 2024 · Big Data

Apache Hudi Clustering: Workflow and Layout Optimization Strategies (Part 6)

This article explains Apache Hudi's clustering service, detailing its workflow, three execution modes, and layout optimization strategies—including linear, Z‑order, and Hilbert space‑filling curves—to improve storage locality and query performance in large‑scale data lake environments.

Apache HudiBig DataData Storage
0 likes · 8 min read
Apache Hudi Clustering: Workflow and Layout Optimization Strategies (Part 6)
Data Thinking Notes
Data Thinking Notes
Aug 29, 2024 · Big Data

How ICBC Evolved Its Data Intelligence Architecture for Real‑Time Insights

At the 2024 Data Intelligence Conference, ICBC's Big Data and AI Lab detailed the evolution of its data intelligence platform, covering architectural redesign, real‑time data warehouse technology, unified intelligent data tools, and future development directions to boost efficiency and innovation.

Big DataData Platformarchitecture evolution
0 likes · 3 min read
How ICBC Evolved Its Data Intelligence Architecture for Real‑Time Insights
Zhuanzhuan Tech
Zhuanzhuan Tech
Aug 28, 2024 · Big Data

Quality Inspection Data Collection: Design, Architecture, and Applications

This article outlines the design, architecture, and practical applications of a quality inspection data collection system, covering data point structures, reporting mechanisms, compliance analysis, intelligent strategy iteration, and BI dashboards, illustrating how big‑data techniques enable digital transformation of inspection processes.

BIBig DataCompliance
0 likes · 10 min read
Quality Inspection Data Collection: Design, Architecture, and Applications
Architecture Digest
Architecture Digest
Aug 27, 2024 · Big Data

Curated List of Free API Interfaces for Various Services

This article provides a comprehensive collection of free, unlimited-use API endpoints covering diverse services such as phone number lookup, historical events, stock data, weather forecasts, identity verification, jokes, maps, and many others, offering developers ready-to-use resources for building data-driven applications.

BackendBig Datadata services
0 likes · 5 min read
Curated List of Free API Interfaces for Various Services
DataFunTalk
DataFunTalk
Aug 27, 2024 · Big Data

Kuaishou's Year-Long White‑Box Cost Governance in Big Data: Engine, Data‑Warehouse, and Tool Optimizations

This article presents Kuaishou's comprehensive white‑box cost governance practice over the past year, detailing the data‑governance framework, engine and data‑warehouse white‑boxing techniques, compression algorithm replacement, HBO automatic tuning, operator analysis, and the resulting performance and cost benefits, as well as future plans.

Big DataData Warehousecost optimization
0 likes · 29 min read
Kuaishou's Year-Long White‑Box Cost Governance in Big Data: Engine, Data‑Warehouse, and Tool Optimizations
DataFunSummit
DataFunSummit
Aug 26, 2024 · Big Data

Building a Doris‑Based Lakehouse Integrated Analytics System at Kuaishou

This article presents Kuaishou's experience of designing and implementing a Doris‑driven lakehouse integrated analytics system, covering the current OLAP landscape, challenges of data duplication and governance, the new architecture with caching and auto‑materialization, implementation details, performance impact, and future work.

Auto MaterializationBig DataData Warehouse
0 likes · 24 min read
Building a Doris‑Based Lakehouse Integrated Analytics System at Kuaishou
Bilibili Tech
Bilibili Tech
Aug 23, 2024 · Big Data

Accelerating Multi‑Dimensional OLAP Queries in ClickHouse with Grouping Sets, RBM, and Dense Dictionary Encoding

To achieve sub‑second, multi‑dimensional analytics on Bilibili’s hundred‑million‑row datasets, the team built a ClickHouse‑based acceleration layer that combines grouping‑set pre‑aggregation, bitmap (RBM) distinct handling, and a dense dictionary encoding service, dramatically cutting CPU, memory and query latency versus traditional OLAP pipelines.

Big DataClickHouseData Warehouse
0 likes · 28 min read
Accelerating Multi‑Dimensional OLAP Queries in ClickHouse with Grouping Sets, RBM, and Dense Dictionary Encoding
ByteDance Data Platform
ByteDance Data Platform
Aug 20, 2024 · Big Data

How FlinkSQL Optimizations Cut CPU Usage by Up to 60% in Streaming Jobs

This article details the FlinkSQL performance enhancements implemented by the streaming team, covering view reuse, redundant shuffle removal, multiple‑input operator redesign, long sliding‑window optimizations, and native JSON format improvements, which together deliver up to 60% CPU savings and massive core‑hour reductions.

Big DataCPU ReductionFlinkSQL
0 likes · 13 min read
How FlinkSQL Optimizations Cut CPU Usage by Up to 60% in Streaming Jobs
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 20, 2024 · Big Data

Practical Insights on Using Apache Paimon for Real-World Data Lake Scenarios

This article shares a personal, experience‑driven overview of Apache Paimon, highlighting its design simplicity, key capabilities such as schema evolution, stream‑batch unified processing, primary‑key support, and closed‑loop data handling, while discussing when its features are appropriate for production environments.

Apache PaimonBatch processingBig Data
0 likes · 5 min read
Practical Insights on Using Apache Paimon for Real-World Data Lake Scenarios
Su San Talks Tech
Su San Talks Tech
Aug 18, 2024 · Big Data

How to Crush the One Billion Row Java Challenge: From 14 Minutes to Sub‑2‑Second Runtime

This article walks through the One Billion Row Challenge, explaining the problem, baseline solution, and a series of performance optimizations—from JVM selection and parallel I/O to custom hash tables, unsafe memory access, and SIMD techniques—that shrink execution time from minutes to under two seconds.

Big DataOne Billion Row ChallengeOptimization
0 likes · 20 min read
How to Crush the One Billion Row Java Challenge: From 14 Minutes to Sub‑2‑Second Runtime
DataFunSummit
DataFunSummit
Aug 17, 2024 · Big Data

AnalyticDB Spark Architecture and Vectorized Engine Performance Overview

This article introduces the AnalyticDB Spark architecture, explains the need for Spark vectorization, surveys industry vectorized solutions, details ADB Spark's own vectorized implementation with Gluten and Velox, and presents performance test results showing a 6.98‑fold speedup over open‑source Spark.

AnalyticDBBig DataGluten
0 likes · 9 min read
AnalyticDB Spark Architecture and Vectorized Engine Performance Overview
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Aug 16, 2024 · Big Data

Understanding the Lambda Architecture for Big Data Processing

This article explains the Lambda architecture—a three‑layer model combining batch and real‑time processing for large‑scale data, outlines its components, advantages, disadvantages, common tools, and compares it with the Kappa alternative while providing practical insights for data engineers.

Batch processingBig DataData Engineering
0 likes · 5 min read
Understanding the Lambda Architecture for Big Data Processing
High Availability Architecture
High Availability Architecture
Aug 16, 2024 · Big Data

Introduction to Elasticsearch: Core Concepts, Query Types, Pagination, and Data Synchronization

This article provides a comprehensive overview of Elasticsearch, covering its distributed storage architecture, core data model concepts, analysis and query capabilities, practical next‑token pagination techniques, join strategies, and various data synchronization methods for integrating Elasticsearch with other systems.

Big DataElasticsearchQuery DSL
0 likes · 13 min read
Introduction to Elasticsearch: Core Concepts, Query Types, Pagination, and Data Synchronization
DataFunSummit
DataFunSummit
Aug 15, 2024 · Artificial Intelligence

Building an LLM‑Driven Metric Platform for Data Democratization

This article explains how large language models (LLMs) can launch data democratization by constructing a metric platform that combines LLM agents, semantic layers, NL2SQL/NL2API pipelines, warehouse‑internal and external semantics, and showcases SwiftAgent/SwiftMetrics innovations, real‑world case studies, and future directions.

Big DataData DemocratizationLLM
0 likes · 13 min read
Building an LLM‑Driven Metric Platform for Data Democratization
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Aug 14, 2024 · Big Data

Understanding Data Middle Platform: Value, Architecture, and Real‑World Cases

This article explains the concept, value, three‑layer architecture, and practical implementations of a data middle platform, illustrating how it standardizes data, forms a middle‑office organization, and drives cost‑effective business empowerment through examples from Alibaba, NetEase, and other enterprises.

Big DataData Platformarchitecture
0 likes · 9 min read
Understanding Data Middle Platform: Value, Architecture, and Real‑World Cases
DataFunSummit
DataFunSummit
Aug 14, 2024 · Big Data

Solving Typical Issues in Migrating to Spark 3.1: Multiple Catalog, Hive‑SQL to Spark‑SQL Migration, and Performance & Stability Optimizations at Xiaomi

This article shares Xiaomi's experience building a next‑generation one‑stop data development platform on Spark 3.1, covering typical challenges such as Multiple Catalog implementation, Hive‑SQL to Spark‑SQL migration, offline Spark performance and stability optimizations, and future roadmap plans.

Apache SparkBig DataData Platform
0 likes · 18 min read
Solving Typical Issues in Migrating to Spark 3.1: Multiple Catalog, Hive‑SQL to Spark‑SQL Migration, and Performance & Stability Optimizations at Xiaomi
DataFunSummit
DataFunSummit
Aug 13, 2024 · Big Data

Data Cost Reduction and Efficiency: Qichacha's Data Architecture and Multi‑Cloud Unified Design

This article presents Qichacha's comprehensive data‑cost‑reduction strategy, detailing its Hadoop‑based three‑pillar architecture, layered data warehouse, Hive upgrades, unified metadata across multi‑cloud clusters, middleware choices such as Alluxio and JuiceFS, version‑compatible hybrid clouds, and Kubernetes‑driven resource orchestration to achieve scalable, low‑cost data processing.

Big DataData WarehouseHadoop
0 likes · 16 min read
Data Cost Reduction and Efficiency: Qichacha's Data Architecture and Multi‑Cloud Unified Design
Bilibili Tech
Bilibili Tech
Aug 13, 2024 · Big Data

How Bilibili Re‑engineered Its Search Indexing with Distributed Storage and Spark

This article details Bilibili's transformation of its search offline indexing pipeline, moving from manual MySQL‑based processes to a high‑capacity, distributed KV store and Spark‑driven builds, addressing performance, maintenance, and scalability challenges while improving resource efficiency and iteration speed.

Big DataBilibiliKV Store
0 likes · 24 min read
How Bilibili Re‑engineered Its Search Indexing with Distributed Storage and Spark

How Hudi MetaServer Transforms Metadata Management and Performance in Data Lakes

This article examines the challenges of Hudi metadata stored on HDFS, introduces the independently developed Hudi MetaServer for centralized metadata, visual management, unified permission control, TTL, expression payloads, and multi‑active scaling, and outlines future enhancements such as LLS, multi‑table fusion, and JDBC support.

Big DataData LakeHudi
0 likes · 11 min read
How Hudi MetaServer Transforms Metadata Management and Performance in Data Lakes
Top Architect
Top Architect
Aug 10, 2024 · Big Data

Design and Implementation of a Scalable Real-Time Log Monitoring Platform at Baidu

This article introduces Baidu's log platform that handles billions of daily events, explains UBC logging concepts and monitoring requirements, and details a low‑cost, high‑accuracy architecture using real‑time streaming, dimension mapping, watermarking, and time‑window aggregation to achieve reliable, scalable event monitoring.

Big DataLog MonitoringReal-time Streaming
0 likes · 14 min read
Design and Implementation of a Scalable Real-Time Log Monitoring Platform at Baidu
DataFunSummit
DataFunSummit
Aug 9, 2024 · Big Data

Design and Practice of Ant Group's Metric System

This article presents a comprehensive overview of Ant Group's metric system, covering its definition, three-layer architecture, common challenges, concept consensus methods, semantic layer options, mechanism design, productization capabilities, platform improvements, business outcomes, future directions, and a detailed Q&A session.

Big DataData Platformdata modeling
0 likes · 28 min read
Design and Practice of Ant Group's Metric System
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Aug 8, 2024 · Big Data

How to Migrate HBase and HDFS Clusters Safely Without Downtime

This guide details a step‑by‑step migration plan for HBase and HDFS clusters, covering background, high‑availability architecture, role assignments, expansion and shrinkage of ZooKeeper and JournalNode, NameNode and DataNode migration, rolling restarts, and common upgrade pitfalls.

Big DataCluster MigrationHBase
0 likes · 12 min read
How to Migrate HBase and HDFS Clusters Safely Without Downtime
DataFunSummit
DataFunSummit
Aug 6, 2024 · Big Data

Implementing a Multi‑Tenant Lakehouse Data Platform for Real‑Time Analytics at a SaaS CRM Company

This article details how a SaaS CRM provider built a cloud‑native Lakehouse platform to support multi‑tenant real‑time analytics, describing data challenges, metadata‑driven architecture, virtual database design, query optimization, BI integration, AI readiness, migration steps, and the resulting performance and scalability gains.

Big DataData PlatformLakehouse
0 likes · 19 min read
Implementing a Multi‑Tenant Lakehouse Data Platform for Real‑Time Analytics at a SaaS CRM Company
DataFunSummit
DataFunSummit
Aug 5, 2024 · Big Data

Velox Memory Management and Execution Engine Overview

This article presents a comprehensive overview of Meta's open‑source Velox query execution engine, detailing its architecture, vectorized execution model, memory‑pool hierarchy, arbitrator and allocator designs, spilling techniques, and future development plans for large‑scale data processing.

Big DataQuery ExecutionSpilling
0 likes · 24 min read
Velox Memory Management and Execution Engine Overview
NewBeeNLP
NewBeeNLP
Aug 5, 2024 · Industry Insights

How Alibaba Cloud Scales Search Recommendations with Big Data, AI, and LLMs

This article details Alibaba Cloud's end‑to‑end architecture for search and advertising recommendation, covering the data platform, AI services, feature‑store design, training and inference optimizations, and the integration of large language models for new recommendation scenarios.

AI platformAlibaba CloudBig Data
0 likes · 17 min read
How Alibaba Cloud Scales Search Recommendations with Big Data, AI, and LLMs
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 5, 2024 · Big Data

Key Features of Apache Flink 1.20: Materialized Tables, DISTRIBUTED BY, and State/Checkpoint Optimizations

The article reviews Apache Flink 1.20, highlighting the new Materialized Table concept, the DISTRIBUTED BY support for load‑balanced storage and join performance, and state/checkpoint file merging improvements, while providing code examples and practical insights for users.

Apache FlinkBig DataCheckpoint Optimization
0 likes · 7 min read
Key Features of Apache Flink 1.20: Materialized Tables, DISTRIBUTED BY, and State/Checkpoint Optimizations
DataFunSummit
DataFunSummit
Aug 4, 2024 · Big Data

Apache Hudi from Zero to One: Comprehensive Guide to Write Indexing (Part 4)

This article explains Apache Hudi’s write‑side indexing, detailing the indexing API, various index types—including simple, Bloom, bucket, HBase, and record‑level indexes—and their mechanisms, helping readers understand how Hudi validates record existence and optimizes updates and deletions.

Apache HudiBig DataData Lake
0 likes · 9 min read
Apache Hudi from Zero to One: Comprehensive Guide to Write Indexing (Part 4)
DataFunTalk
DataFunTalk
Aug 2, 2024 · Artificial Intelligence

From Big Data to Large Models: Alibaba Cloud AI Platform Architecture and Practices for Search Recommendation

This presentation details Alibaba Cloud's AI platform, covering the end‑to‑end pipeline from big‑data processing and feature engineering to large‑model training, inference optimization, recommendation system architecture, and RAG applications, highlighting practical engineering solutions and performance gains.

AI platformBig DataFeature Store
0 likes · 18 min read
From Big Data to Large Models: Alibaba Cloud AI Platform Architecture and Practices for Search Recommendation
DataFunSummit
DataFunSummit
Aug 1, 2024 · Big Data

Deep Dive into Apache Spark SQL: Concepts, Core Components, and API

This article provides a comprehensive overview of Apache Spark SQL, covering its fundamental concepts such as TreeNode, AST, and QueryPlan, the distinction between logical and physical plans, the rule‑execution framework, core components like SparkSqlParser and Analyzer, as well as the Spark Session, Dataset/DataFrame, and various writer APIs, supplemented by a detailed Q&A session.

Apache SparkBig DataData Processing
0 likes · 19 min read
Deep Dive into Apache Spark SQL: Concepts, Core Components, and API
StarRocks
StarRocks
Aug 1, 2024 · Big Data

How Kingsoft Office Boosted Query Speed 2.3× with StarRocks 3.0

Kingsoft Office migrated its reporting platform from a multi‑engine stack to StarRocks 3.0, achieving a 48.84% performance gain, halving query latency, reducing operational costs, and improving resource utilization while supporting storage‑compute separation and seamless Trino SQL compatibility.

Big DataStarRocksStorage-Compute Separation
0 likes · 14 min read
How Kingsoft Office Boosted Query Speed 2.3× with StarRocks 3.0
Data Thinking Notes
Data Thinking Notes
Jul 29, 2024 · Big Data

What Is a Data Middle Platform and How Does It Transform Enterprise Data Management?

This article explains the concept, design principles, and core components of a data middle platform, detailing its overall, functional, layered, logical, and data architectures, as well as the specific platforms for data collection, processing, organization, governance, quality, sharing, and visualization, illustrated with diagrams.

Big DataData ArchitectureData Integration
0 likes · 27 min read
What Is a Data Middle Platform and How Does It Transform Enterprise Data Management?
58 Tech
58 Tech
Jul 29, 2024 · Databases

HBase Cloud Migration: Architecture, Challenges, and Solutions

This technical report details the background, architecture, construction, core issues, migration plans, and future roadmap of moving 58's HBase clusters to a cloud‑native environment, highlighting cost reduction, operational automation, and performance optimizations.

Big DataDatabasesHBase
0 likes · 22 min read
HBase Cloud Migration: Architecture, Challenges, and Solutions
DataFunTalk
DataFunTalk
Jul 27, 2024 · Big Data

Design and Implementation of Kuaishou's Metric Middle Platform

This article presents Kuaishou's metric middle platform, detailing its background, design principles, metric management and service architecture, including headless BI concepts, unified analysis language OAX, query engine OCTO, data modeling layers, acceleration strategies, and future directions toward intelligence and high performance.

Big DataHeadless BIKuaishou
0 likes · 19 min read
Design and Implementation of Kuaishou's Metric Middle Platform
DataFunSummit
DataFunSummit
Jul 26, 2024 · Big Data

Understanding Power Law Distributions in Content Ecosystems: Data Science Insights and Applications

This article explores how power‑law and other heavy‑tailed distributions appear in content ecosystems, explains their statistical foundations, discusses why they are common, and presents data‑driven strategies—including integer programming, graph‑based creator analysis, and causal inference—to optimize content production, recommendation, and settlement policies.

Big DataPower LawStatistical Modeling
0 likes · 18 min read
Understanding Power Law Distributions in Content Ecosystems: Data Science Insights and Applications
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 26, 2024 · Databases

Apache Doris Architecture and Common Q&A: Read/Write Flow, Replication Consistency, Storage, and High Availability

This article provides a comprehensive overview of Apache Doris, explaining its frontend and backend nodes, storage structures such as tablets, rowsets, and segments, replication mechanisms, partitioning versus bucketing, indexing types, compaction processes, and high‑availability strategies through a detailed Q&A format.

Apache DorisBig DataDatabase Architecture
0 likes · 22 min read
Apache Doris Architecture and Common Q&A: Read/Write Flow, Replication Consistency, Storage, and High Availability
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 25, 2024 · Big Data

Fundamental Concepts and File Layout of Paimon: Snapshots, Partitions, Buckets, Consistency, and Compaction

This article explains Paimon's core concepts—including snapshots, partitions, buckets, consistency guarantees, file layout, LSM‑tree organization, and compaction strategies—while also covering table management tasks such as snapshot expiration, rollback, partition expiration, and small‑file mitigation techniques.

Big DataBucketsLSM‑Tree
0 likes · 12 min read
Fundamental Concepts and File Layout of Paimon: Snapshots, Partitions, Buckets, Consistency, and Compaction
StarRocks
StarRocks
Jul 24, 2024 · Big Data

Why Lakehouse Architecture Is Redefining Big Data Infrastructure in the AI Era

The article examines the rapid rise of lakehouse architecture, its market momentum, core components—including storage, metadata, table formats, and compute layers—compares Iceberg, Hudi, and Delta Lake, discusses the shift from HDFS to object storage, and outlines the strategic importance of lakehouses for AI-driven data management and future data infrastructure trends.

AIApache IcebergBig Data
0 likes · 28 min read
Why Lakehouse Architecture Is Redefining Big Data Infrastructure in the AI Era
DataFunSummit
DataFunSummit
Jul 23, 2024 · Big Data

Multi-Cloud Unified Data Acceleration Layer at Xiaohongshu: Challenges, Alluxio Solution, and Performance Gains

This article presents Xiaohongshu's multi‑cloud unified data acceleration layer built with Alluxio, detailing the challenges of multi‑cloud architectures, the design goals, Alluxio's architecture and features, real‑world case studies in AI training and recommendation indexing, performance improvements, and future plans.

AI trainingAlluxioBig Data
0 likes · 22 min read
Multi-Cloud Unified Data Acceleration Layer at Xiaohongshu: Challenges, Alluxio Solution, and Performance Gains
DataFunTalk
DataFunTalk
Jul 23, 2024 · Big Data

Practical Experience with Apache Kyuubi and Apache Celeborn in Big Data Platforms

This article shares detailed practical experiences from DingXiangYuan's big‑data platform on using Apache Kyuubi and Apache Celeborn, covering architecture, flexible configuration, AuthZ fine‑grained permissions, small‑file and Z‑Order optimizations, Arrow‑based large result transmission, and operational tips such as connection‑level issues and Netty cache handling.

Apache CelebornApache KyuubiArrow
0 likes · 17 min read
Practical Experience with Apache Kyuubi and Apache Celeborn in Big Data Platforms
JD Tech
JD Tech
Jul 23, 2024 · Big Data

Design and Architecture of JD's Buffalo Distributed Workflow Scheduling System

This article examines JD's self‑developed Buffalo distributed workflow scheduling system for big‑data ETL, detailing its two‑layer entity model, instance‑based scheduling, high‑availability three‑layer architecture, performance optimizations, cold‑hot data separation, and open APIs to support massive, complex data pipelines.

Big DataSchedulinghigh availability
0 likes · 11 min read
Design and Architecture of JD's Buffalo Distributed Workflow Scheduling System
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jul 22, 2024 · Big Data

Comprehensive Guide to Kafka: Architecture, Core Concepts, and Configuration

This article provides an in‑depth overview of Apache Kafka, covering its use cases, comparison with other message queues, versioning, performance mechanisms, core concepts such as topics, partitions, offsets, consumer groups, rebalancing, replication, leader election, idempotence, transactions, compression, interceptors, request handling, and practical configuration tips for reliable streaming applications.

Big DataConsumerKafka
0 likes · 25 min read
Comprehensive Guide to Kafka: Architecture, Core Concepts, and Configuration
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Jul 15, 2024 · Big Data

Master Distributed Computing: Hadoop, Spark, and Flink Explained

This article introduces the fundamentals of distributed computing, compares major frameworks such as Hadoop, Spark, and Flink, and outlines their key components, performance characteristics, and typical application scenarios including big‑data analytics, cloud services, real‑time streaming, and scientific computing.

Big DataDistributed computingFlink
0 likes · 7 min read
Master Distributed Computing: Hadoop, Spark, and Flink Explained
21CTO
21CTO
Jul 15, 2024 · Big Data

Twitter’s Kappa Architecture: Scaling Real-Time Processing of Billions of Events

Twitter migrated from a Lambda-based dual‑pipeline system to a Kappa architecture that relies on a single real‑time stream using Kafka, Google Pub/Sub, Dataflow, and BigTable, dramatically reducing latency, increasing throughput, and improving data accuracy for processing billions of daily events.

Big DataCloud ComputingDataFlow
0 likes · 9 min read
Twitter’s Kappa Architecture: Scaling Real-Time Processing of Billions of Events
DataFunTalk
DataFunTalk
Jul 15, 2024 · Big Data

Douyin Group E‑commerce Data Tracking Evolution, Solutions, and Attribution Practices

This article examines Douyin Group's e‑commerce data‑tracking journey, detailing the progression from early log collection to Log 3.0, the challenges posed by rapidly evolving user flows, and the comprehensive solution framework—including BTM/BCM management, SDK capabilities, and an attribution platform—that improves data quality, development efficiency, and attribution accuracy.

Big DataData TrackingSDK
0 likes · 20 min read
Douyin Group E‑commerce Data Tracking Evolution, Solutions, and Attribution Practices
DataFunSummit
DataFunSummit
Jul 13, 2024 · Big Data

Blaze: A Native Vectorized Execution Engine for Spark – Architecture, Production Optimizations, and Future Plans

Blaze is Kuaishou's self‑developed native execution engine that leverages Rust, DataFusion, and SIMD vectorization to accelerate Spark workloads, offering a 30%+ compute boost, detailed architectural components, deep production‑grade optimizations, and a roadmap for broader adoption.

Big DataDataFusionPerformance Optimization
0 likes · 13 min read
Blaze: A Native Vectorized Execution Engine for Spark – Architecture, Production Optimizations, and Future Plans
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 12, 2024 · Big Data

How Flink + Hologres Power Real‑Time Streaming Warehouses

This article explains how combining Flink with Hologres creates a unified, real‑time streaming warehouse, detailing traditional layering approaches, the advantages of the Hologres‑based solution, core capabilities like Binlog and resource isolation, and a practical e‑commerce case study demonstrating performance gains.

Big DataFlinkHologres
0 likes · 21 min read
How Flink + Hologres Power Real‑Time Streaming Warehouses
Data Thinking Notes
Data Thinking Notes
Jul 11, 2024 · Big Data

How to Build a Robust Data Lineage Foundation for Scalable Business Insights

This article explains how to construct a full‑chain data lineage system, covering its overall architecture, quality measurement framework, and application layer, and demonstrates practical use cases such as handling data growth, monitoring warehouse changes, accelerating development, ensuring consistency, and automating metric decomposition in real‑world business scenarios.

Big DataData LineageData Warehouse
0 likes · 14 min read
How to Build a Robust Data Lineage Foundation for Scalable Business Insights
Baidu Tech Salon
Baidu Tech Salon
Jul 11, 2024 · Industry Insights

How Baidu Feed Evolved Its Data Warehouse with Multi‑Version Wide Tables

This article outlines the step‑by‑step evolution of Baidu's Feed data warehouse—from traditional layered modeling to hour‑level core tables, then real‑time wide tables, and finally a flow‑batch integrated multi‑version wide‑table architecture—highlighting the motivations, design choices, challenges, and resulting benefits.

Big DataData WarehouseReal-time Analytics
0 likes · 10 min read
How Baidu Feed Evolved Its Data Warehouse with Multi‑Version Wide Tables
DataFunSummit
DataFunSummit
Jul 11, 2024 · Big Data

Design Principles of the Spark Core – DataFun Introduction to Apache Spark (Part 1)

This article provides a comprehensive overview of Apache Spark, covering its origins, key characteristics, core concepts such as RDD, DAG, partitioning and dependencies, the internal architecture including SparkConf, SparkContext, SparkEnv, storage and scheduling systems, as well as deployment models and the company behind the product.

Apache SparkBig DataData Processing
0 likes · 16 min read
Design Principles of the Spark Core – DataFun Introduction to Apache Spark (Part 1)
Python Programming Learning Circle
Python Programming Learning Circle
Jul 10, 2024 · Big Data

Using the TransBigData Python Library for Mobile Signaling Data Processing, Analysis, and Visualization

This article introduces the open‑source Python package TransBigData, explains how to install it, and demonstrates step‑by‑step methods for reading mobile signaling data, preprocessing, identifying stays and moves, extracting home and work locations, and visualizing individual activity patterns using Jupyter notebooks.

Big DataData AnalysisGeospatial
0 likes · 8 min read
Using the TransBigData Python Library for Mobile Signaling Data Processing, Analysis, and Visualization
DataFunTalk
DataFunTalk
Jul 10, 2024 · Big Data

Apache SeaTunnel: A Next‑Generation Data Integration Platform for ETL/ELT and OLAP

This article introduces Apache SeaTunnel, a modern data integration platform designed for the EtLT era, detailing its architecture, core connector APIs, checkpoint mechanism, model inference, multi‑table synchronization, the high‑performance SeaTunnel Zeta engine, OLAP use cases, community roadmap, and the commercial WhaleTunnel product.

Apache SeaTunnelBig DataELT
0 likes · 22 min read
Apache SeaTunnel: A Next‑Generation Data Integration Platform for ETL/ELT and OLAP
Data Thinking Notes
Data Thinking Notes
Jul 9, 2024 · Big Data

How to Build a Robust Enterprise Data Asset Catalog for Better Governance

This article explains why a comprehensive data asset catalog is essential for modern enterprises, outlines its core components such as inventory, metadata, data lineage, standards and access control, details step‑by‑step construction methods, and highlights key applications in governance, quality, compliance, architecture and valuation.

Big DataData CatalogData Lineage
0 likes · 13 min read
How to Build a Robust Enterprise Data Asset Catalog for Better Governance
DataFunSummit
DataFunSummit
Jul 9, 2024 · Big Data

Materialized Views in MaxCompute: Design, Implementation, and Best Practices

This article explains the concept, advantages, and drawbacks of materialized views, describes how MaxCompute implements them—including creation syntax, maintenance properties, automatic query rewrite, smart recommendation, and auto‑materialization—and shares performance results and future improvement plans.

Automatic RefreshBig DataMaxCompute
0 likes · 13 min read
Materialized Views in MaxCompute: Design, Implementation, and Best Practices
360 Smart Cloud
360 Smart Cloud
Jul 9, 2024 · Big Data

Understanding Shuffle in Spark: From Native Shuffle to External and Remote Shuffle Services (Uniffle)

This article examines the critical role of shuffle in big‑data processing, compares Spark's native shuffle with the External Shuffle Service (ESS) and Remote Shuffle Service (RSS) solutions, introduces Uniffle's architecture and configuration, and shares practical deployment experiences and performance results within a 360 internal environment.

Big DataExternal Shuffle ServiceRemote Shuffle Service
0 likes · 15 min read
Understanding Shuffle in Spark: From Native Shuffle to External and Remote Shuffle Services (Uniffle)
DataFunTalk
DataFunTalk
Jul 6, 2024 · Big Data

StarRocks and Paimon Data Lake Capabilities, Migration Solutions, and Future Roadmap

This article presents a practical overview of StarRocks and Apache Paimon data‑lake capabilities, explains their performance advantages, details migration strategies from Trino/Presto and other engines, describes cluster‑to‑cluster migration, and outlines future roadmap for integration and optimization.

Big DataCloud ComputingData Lake
0 likes · 13 min read
StarRocks and Paimon Data Lake Capabilities, Migration Solutions, and Future Roadmap
DataFunSummit
DataFunSummit
Jul 6, 2024 · Artificial Intelligence

Highlights of DataFunCon 2024 Beijing: Big Data, AI, and Large‑Model Trends

The two‑day DataFunCon 2024 Beijing conference gathered hundreds of big‑data and AI experts to discuss the evolution from data lakes to lake‑warehouses, large‑model development, practical applications, and future strategies for enterprises, while showcasing partner exhibitions and a vibrant community spirit.

Artificial IntelligenceBig DataChina
0 likes · 9 min read
Highlights of DataFunCon 2024 Beijing: Big Data, AI, and Large‑Model Trends
DataFunSummit
DataFunSummit
Jul 5, 2024 · Big Data

Highlights of DataFunCon 2024 Beijing: Big Data, Large Models, and AI Integration

The DataFunCon 2024 Beijing conference opened with keynote speeches on the evolution of Alibaba Cloud's big data platform, explored distributed data warehousing, large model research, and practical AI applications, and concluded with a round‑table discussing future trends and enterprise strategies for big data and AI integration.

Artificial IntelligenceBig Dataconference
0 likes · 8 min read
Highlights of DataFunCon 2024 Beijing: Big Data, Large Models, and AI Integration
iQIYI Technical Product Team
iQIYI Technical Product Team
Jul 5, 2024 · Big Data

RiskFactor: An Integrated Real‑Time and Offline Feature Platform for Risk Control

RiskFactor unifies iQIYI’s legacy real‑time and offline feature platforms onto Opal’s DAG‑plus‑SQL engine, accelerating feature production fifteen‑fold, cutting latency from hours to minutes, streamlining development, lowering costs, and delivering more reliable, versioned risk‑control capabilities against sophisticated online threats.

Big DataDAGFeature Engineering
0 likes · 14 min read
RiskFactor: An Integrated Real‑Time and Offline Feature Platform for Risk Control
Data Thinking Notes
Data Thinking Notes
Jul 4, 2024 · Big Data

How Active Metadata Revolutionizes Data Governance and Cuts Costs

This article examines the growing challenges of data management—such as asset discoverability, architectural rigidity, development quality, and rising resource costs—and presents a comprehensive data‑governance framework that leverages standards, agile architecture, development isolation, and active‑metadata‑driven lifecycle evaluation to improve efficiency, reduce expenses, and enable intelligent, automated data back‑filling.

Big DataStorage Optimizationactive metadata
0 likes · 17 min read
How Active Metadata Revolutionizes Data Governance and Cuts Costs
JD Cloud Developers
JD Cloud Developers
Jul 3, 2024 · Big Data

How to Build a High‑Availability Real‑Time Logistics Dashboard with Flink and ClickHouse

This article details the design and implementation of a high‑availability, real‑time logistics supply‑chain dashboard, covering Flink‑based data pipelines, ClickHouse OLAP storage, metric consistency, stability measures, extensible configuration, and comprehensive monitoring to ensure accurate, scalable performance during major promotions.

Big DataClickHouseFlink
0 likes · 9 min read
How to Build a High‑Availability Real‑Time Logistics Dashboard with Flink and ClickHouse
StarRocks
StarRocks
Jul 2, 2024 · Big Data

What’s New in StarRocks 3.3? Deep Dive into Lakehouse‑Optimized Performance and Features

StarRocks 3.3 introduces a comprehensive set of enhancements—including maturity levels, ARM‑optimized performance, advanced caching, materialized‑view rewrites, storage optimizations, and expanded lakehouse ecosystem support—that together boost stability, query speed, and usability for large‑scale analytics workloads.

Big DataCache OptimizationLakehouse
0 likes · 15 min read
What’s New in StarRocks 3.3? Deep Dive into Lakehouse‑Optimized Performance and Features
DataFunSummit
DataFunSummit
Jul 2, 2024 · Cloud Computing

Global Perspective on Multi-Cloud Data Architecture

The forum presents a series of technical talks on multi‑cloud data architecture, covering Xiaomi’s lake‑warehouse practice, cross‑border e‑commerce data platforms, Alluxio‑based machine‑learning acceleration, Qichacha’s cost‑effective data solutions, and Kuaishou’s Flink on Kubernetes migration, highlighting strategies, implementations, and audience benefits.

Big DataCloud ComputingData Architecture
0 likes · 8 min read
Global Perspective on Multi-Cloud Data Architecture
JD Tech
JD Tech
Jul 2, 2024 · Big Data

Real‑Time Monitoring Dashboard for Logistics Supply Chain: Architecture, Data Modeling, and Stability Design

This article presents the design and implementation of a high‑availability, real‑time logistics supply‑chain monitoring dashboard, covering its data processing pipeline with Flink, storage choices between Elasticsearch and ClickHouse, multi‑layer architecture, metric consistency, stability mechanisms, extensibility configurations, and monitoring practices.

Big DataClickHouseElasticsearch
0 likes · 11 min read
Real‑Time Monitoring Dashboard for Logistics Supply Chain: Architecture, Data Modeling, and Stability Design