Showing 100 articles max

Jan 29, 2026 · Big Data

How to Sync MySQL ALTER DDL to Doris Using Flink CDC (Step‑by‑Step)

This guide explains how to extend a Flink CDC pipeline so that, in addition to real‑time data replication, DDL ALTER statements from MySQL are captured, split from the data stream, and applied to Doris using side‑outputs and a custom JDBC sink.

DDL synchronizationFlink CDC

0 likes · 8 min read

How to Sync MySQL ALTER DDL to Doris Using Flink CDC (Step‑by‑Step)

DataFunSummit

Jan 29, 2026 · Big Data

How to Slash Web Scraping Costs by 60%: Proven Strategies from a Bright Data Expert

In the era of massive AI model training, this article presents a step‑by‑step technical guide—covering the full data‑collection pipeline, three acquisition modes, IP‑type choices, bandwidth savings, path and mixed‑request optimizations, and business‑level cost controls—to reduce web‑scraping expenses by more than 60% while maintaining data quality.

AIautomationdata collection

0 likes · 24 min read

How to Slash Web Scraping Costs by 60%: Proven Strategies from a Bright Data Expert

Data Party THU

Jan 29, 2026 · Big Data

How a Tsinghua Big Data Program Turned a Chemistry PhD into an AI‑Powered Process Engineer

This article recounts a Tsinghua University PhD student's journey through a multidisciplinary big‑data training program, detailing the acquisition of AI and data‑science skills, the creation of novel algorithms like MicroFlowSAM and ImageRAG, and their successful application to chemical engineering research and industry projects.

Big DataChemical EngineeringIndustrial Application

0 likes · 8 min read

How a Tsinghua Big Data Program Turned a Chemistry PhD into an AI‑Powered Process Engineer

Big Data Tech Team

Jan 28, 2026 · Big Data

30-Item Data Warehouse Development Checklist for Trustworthy, Efficient Data

This checklist compiles 30 actionable items covering model design, data consistency, performance, quality, metadata governance, cost efficiency, and collaboration to help data warehouse teams build trustworthy, high‑performance, and maintainable data pipelines.

Data WarehouseGovernancePerformance

0 likes · 8 min read

30-Item Data Warehouse Development Checklist for Trustworthy, Efficient Data

Big Data Tech Team

Jan 26, 2026 · Big Data

Master DWD, DWS, and Wide‑Table Modeling for Scalable Data Warehouses

This guide explains the DWD (detail) and DWS (summary) layered modeling approach combined with wide‑table driving, covering model positioning, design principles, concrete schema examples, implementation techniques, performance tips, and common pitfalls to help build clean, reusable, high‑performance enterprise data warehouses.

DWDDWSData Warehouse

0 likes · 9 min read

Master DWD, DWS, and Wide‑Table Modeling for Scalable Data Warehouses

Data Party THU

Jan 25, 2026 · Big Data

How Tsinghua’s Big Data Initiative Boosted Refinery Energy Forecasts with GRU

The Tsinghua University Big Data Capability Project applied GRU‑based deep learning, pulse‑event encoding, and advanced feature engineering to transform discrete refinery energy data into continuous sequences, achieving prediction accuracies of 84.2%, 82.7% and 81.6% for fuel gas, medium‑pressure and low‑pressure steam respectively.

GRUenergy predictionfeature engineering

0 likes · 9 min read

How Tsinghua’s Big Data Initiative Boosted Refinery Energy Forecasts with GRU

Ray's Galactic Tech

Jan 22, 2026 · Big Data

Export 1 Billion Elasticsearch Docs in 3 Hours Using PIT + Slice

This guide explains how to reliably export over a billion Elasticsearch documents within a few hours by using Point‑In‑Time (PIT) snapshots combined with parallel Slice processing, covering diagnostics, performance modeling, consistency levels, failure recovery, and resource isolation.

Big DataData ExportElasticsearch

0 likes · 7 min read

Export 1 Billion Elasticsearch Docs in 3 Hours Using PIT + Slice

StarRocks

Jan 22, 2026 · Big Data

How Paimon + StarRocks Accelerates Double‑11 OLAP Queries by 80% Refresh Speed

This article explains how Taotian Group unified real‑time and offline data using Paimon as lake storage and StarRocks for high‑performance OLAP, eliminating costly sync pipelines, cutting refresh time by about 80%, saving nearly ten million yuan annually, and detailing the architecture, cluster safeguards, configuration tweaks, monitoring, and future roadmap for large‑scale promotional events.

Big DataData ArchitectureOLAP

0 likes · 24 min read

How Paimon + StarRocks Accelerates Double‑11 OLAP Queries by 80% Refresh Speed

Architect's Guide

Jan 22, 2026 · Big Data

Unlock Kafka’s Power: Core Concepts, High‑Performance Architecture & Real‑World Scaling Tips

This comprehensive guide explores Kafka’s core value as a message queue, explains producers, consumers, topics, partitions, and replication, dives into cluster architecture, zero‑copy I/O, resource planning for disks, memory, CPU and network, and provides practical configuration, consumer‑group management, and operational tooling tips for building high‑throughput, highly available Kafka deployments.

Distributed SystemsKafkaMessage Queue

0 likes · 31 min read

Unlock Kafka’s Power: Core Concepts, High‑Performance Architecture & Real‑World Scaling Tips

Big Data Technology Tribe

Jan 20, 2026 · Big Data

Extending Spark SQL with LanceSparkSessionExtensions: A Complete Guide

This article explains how to inject the LanceSpark plugin into Spark, covering the core LanceSparkSessionExtensions class, various ways to register extensions, the custom parser and planner strategy implementations, and the underlying Spark mechanisms such as injectParser, injectPlannerStrategy, and PredicateHelper.

DataSourceV2LanceSparkPlannerStrategy

0 likes · 14 min read

Extending Spark SQL with LanceSparkSessionExtensions: A Complete Guide

Big Data Tech Team

Jan 19, 2026 · Big Data

What Is Data Fabric and How It Can Eliminate Data Silos Today

This article explains the concept of Data Fabric, debunks common misconceptions, outlines the three key drivers behind its rise, and provides a practical four‑step roadmap—including metadata, semantic layers, policy engines, and AI—to help teams of any size adopt the technology.

AIData FabricMetadata Management

0 likes · 7 min read

What Is Data Fabric and How It Can Eliminate Data Silos Today

DeWu Technology

Jan 19, 2026 · Big Data

How to Speed Up Full‑Scale Data Comparison for Massive Migration Projects

This article details the challenges of comparing billions of rows during large‑scale data migrations, presents a multi‑step solution using union‑all grouping, hash‑based aggregation, and intelligent primary‑key detection, and explains platform features, performance optimizations, and future enhancements that reduced comparison time by up to 70%.

data comparisonhash aggregationprimary key detection

0 likes · 16 min read

How to Speed Up Full‑Scale Data Comparison for Massive Migration Projects

DataFunSummit

Jan 18, 2026 · Big Data

How Ray Reinvents AI Data Pipelines for Massive Multimodal Inference

This article examines the shortcomings of traditional big‑data engines for AI workloads, presents a Ray‑based heterogeneous fusion architecture that unifies CPU/GPU scheduling, Python ecosystems, and streaming‑batch processing, and details fault‑tolerance, checkpointing, compute‑storage separation, resource‑utilization, scalability, and observability improvements that enable thousands of nodes and dramatically higher GPU efficiency.

Big DataCloud NativeDistributed computing

0 likes · 31 min read

How Ray Reinvents AI Data Pipelines for Massive Multimodal Inference

Mike Chen's Internet Architecture

Jan 18, 2026 · Big Data

Mastering Kafka High Availability: Replication, Leader‑Follower, ISR, and Ack Strategies

This article explains Kafka's high‑availability architecture, covering multi‑replica replication, leader‑follower election and failover, the role of In‑Sync Replicas, and producer acknowledgment settings with min.insync.replicas for reliable, zero‑data‑loss streaming.

Ack StrategyBig DataISR

0 likes · 4 min read

Mastering Kafka High Availability: Replication, Leader‑Follower, ISR, and Ack Strategies

Big Data Tech Team

Jan 15, 2026 · Big Data

Mastering Data Warehousing: Core Concepts, Tools, and Future Trends

This article outlines a comprehensive roadmap for data warehousing, covering fundamental concepts, essential big‑data tools, practical implementation steps, advanced architectural topics, and emerging trends such as cloud‑native warehouses and machine‑learning integration, helping readers build a solid knowledge base.

Data WarehouseETLOLAP

0 likes · 9 min read

Mastering Data Warehousing: Core Concepts, Tools, and Future Trends

Mingyi World Elasticsearch

Jan 15, 2026 · Big Data

Why Elasticsearch Tokenizers Are on the Soft Exam and How to Master Them

The article breaks down the four Elasticsearch tokenizers tested in the latest Soft Exam, explains their behavior with concrete examples, discusses why search technology is now essential for architects, and predicts future exam trends, offering practical study guidance.

Distributed SystemsElasticsearchExam Preparation

0 likes · 9 min read

Why Elasticsearch Tokenizers Are on the Soft Exam and How to Master Them

Big Data Tech Team

Jan 12, 2026 · Big Data

Avoid the 5 Fatal DWS Design Traps and Build Scalable Data Warehouses

This article dissects the five most common pitfalls when transitioning from DWD to DWS aggregation tables—such as chimney‑style designs, over‑wide tables, grain mismatches, missing drill‑down keys, and performance neglect—and offers concrete, production‑ready solutions to create reusable, efficient, and cost‑effective data‑warehouse layers.

DWS DesignData WarehouseETL

0 likes · 9 min read

Avoid the 5 Fatal DWS Design Traps and Build Scalable Data Warehouses

Instant Consumer Technology Team

Jan 8, 2026 · Big Data

How Vintage Cohort Analysis Transforms Financial Risk Management

This article explains the concept, key terminology, and practical implementation of Vintage (cohort) analysis in financial services, detailing how to build tables and curves, integrate data pipelines, and use the insights to optimize marketing strategies, credit risk assessment, and operational efficiency.

Risk ManagementVintage analysiscohort analysis

0 likes · 18 min read

How Vintage Cohort Analysis Transforms Financial Risk Management

Alibaba Cloud Big Data AI Platform

Jan 8, 2026 · Big Data

How Gaode Maps Built a Real‑Time Lakehouse for Billion‑Scale Trajectory Data

This article details Gaode Maps' end‑to‑end lakehouse solution for massive, high‑frequency trajectory data, covering the challenges of real‑time visibility, query performance, and storage cost, and explaining how a hot‑warm‑cold tiering architecture built on Apache Flink, Paimon, StarRocks, Redis and Lindorm delivers millisecond‑level queries while cutting storage expenses.

Apache FlinkApache PaimonData Tiering

0 likes · 19 min read

How Gaode Maps Built a Real‑Time Lakehouse for Billion‑Scale Trajectory Data

iQIYI Technical Product Team

Jan 8, 2026 · Big Data

How iQIYI Cut Stream Data Costs by 70%: From Private‑Cloud Kafka to AutoMQ

This article details iQIYI's evolution from a tightly coupled private‑cloud Kafka setup to a cloud‑native AutoMQ architecture, describing the challenges of scaling, the development of the Stream platform and Stream‑SDK, the migration to hybrid and public‑cloud Kafka, and the resulting cost and elasticity improvements.

AutoMQData ArchitectureKafka

0 likes · 12 min read

How iQIYI Cut Stream Data Costs by 70%: From Private‑Cloud Kafka to AutoMQ