Showing 100 articles max
ITPUB
ITPUB
Jan 29, 2026 · Big Data

How to Sync MySQL ALTER DDL to Doris Using Flink CDC (Step‑by‑Step)

This guide explains how to extend a Flink CDC pipeline so that, in addition to real‑time data replication, DDL ALTER statements from MySQL are captured, split from the data stream, and applied to Doris using side‑outputs and a custom JDBC sink.

DDL synchronizationFlink CDC
0 likes · 8 min read
How to Sync MySQL ALTER DDL to Doris Using Flink CDC (Step‑by‑Step)
DataFunSummit
DataFunSummit
Jan 29, 2026 · Big Data

How to Slash Web Scraping Costs by 60%: Proven Strategies from a Bright Data Expert

In the era of massive AI model training, this article presents a step‑by‑step technical guide—covering the full data‑collection pipeline, three acquisition modes, IP‑type choices, bandwidth savings, path and mixed‑request optimizations, and business‑level cost controls—to reduce web‑scraping expenses by more than 60% while maintaining data quality.

AIautomationdata collection
0 likes · 24 min read
How to Slash Web Scraping Costs by 60%: Proven Strategies from a Bright Data Expert
Data Party THU
Data Party THU
Jan 29, 2026 · Big Data

How a Tsinghua Big Data Program Turned a Chemistry PhD into an AI‑Powered Process Engineer

This article recounts a Tsinghua University PhD student's journey through a multidisciplinary big‑data training program, detailing the acquisition of AI and data‑science skills, the creation of novel algorithms like MicroFlowSAM and ImageRAG, and their successful application to chemical engineering research and industry projects.

Big DataChemical EngineeringIndustrial Application
0 likes · 8 min read
How a Tsinghua Big Data Program Turned a Chemistry PhD into an AI‑Powered Process Engineer
Big Data Tech Team
Big Data Tech Team
Jan 26, 2026 · Big Data

Master DWD, DWS, and Wide‑Table Modeling for Scalable Data Warehouses

This guide explains the DWD (detail) and DWS (summary) layered modeling approach combined with wide‑table driving, covering model positioning, design principles, concrete schema examples, implementation techniques, performance tips, and common pitfalls to help build clean, reusable, high‑performance enterprise data warehouses.

DWDDWSData Warehouse
0 likes · 9 min read
Master DWD, DWS, and Wide‑Table Modeling for Scalable Data Warehouses
Data Party THU
Data Party THU
Jan 25, 2026 · Big Data

How Tsinghua’s Big Data Initiative Boosted Refinery Energy Forecasts with GRU

The Tsinghua University Big Data Capability Project applied GRU‑based deep learning, pulse‑event encoding, and advanced feature engineering to transform discrete refinery energy data into continuous sequences, achieving prediction accuracies of 84.2%, 82.7% and 81.6% for fuel gas, medium‑pressure and low‑pressure steam respectively.

GRUenergy predictionfeature engineering
0 likes · 9 min read
How Tsinghua’s Big Data Initiative Boosted Refinery Energy Forecasts with GRU
Ray's Galactic Tech
Ray's Galactic Tech
Jan 22, 2026 · Big Data

Export 1 Billion Elasticsearch Docs in 3 Hours Using PIT + Slice

This guide explains how to reliably export over a billion Elasticsearch documents within a few hours by using Point‑In‑Time (PIT) snapshots combined with parallel Slice processing, covering diagnostics, performance modeling, consistency levels, failure recovery, and resource isolation.

Big DataData ExportElasticsearch
0 likes · 7 min read
Export 1 Billion Elasticsearch Docs in 3 Hours Using PIT + Slice
StarRocks
StarRocks
Jan 22, 2026 · Big Data

How Paimon + StarRocks Accelerates Double‑11 OLAP Queries by 80% Refresh Speed

This article explains how Taotian Group unified real‑time and offline data using Paimon as lake storage and StarRocks for high‑performance OLAP, eliminating costly sync pipelines, cutting refresh time by about 80%, saving nearly ten million yuan annually, and detailing the architecture, cluster safeguards, configuration tweaks, monitoring, and future roadmap for large‑scale promotional events.

Big DataData ArchitectureOLAP
0 likes · 24 min read
How Paimon + StarRocks Accelerates Double‑11 OLAP Queries by 80% Refresh Speed
Architect's Guide
Architect's Guide
Jan 22, 2026 · Big Data

Unlock Kafka’s Power: Core Concepts, High‑Performance Architecture & Real‑World Scaling Tips

This comprehensive guide explores Kafka’s core value as a message queue, explains producers, consumers, topics, partitions, and replication, dives into cluster architecture, zero‑copy I/O, resource planning for disks, memory, CPU and network, and provides practical configuration, consumer‑group management, and operational tooling tips for building high‑throughput, highly available Kafka deployments.

Distributed SystemsKafkaMessage Queue
0 likes · 31 min read
Unlock Kafka’s Power: Core Concepts, High‑Performance Architecture & Real‑World Scaling Tips
Big Data Technology Tribe
Big Data Technology Tribe
Jan 20, 2026 · Big Data

Extending Spark SQL with LanceSparkSessionExtensions: A Complete Guide

This article explains how to inject the LanceSpark plugin into Spark, covering the core LanceSparkSessionExtensions class, various ways to register extensions, the custom parser and planner strategy implementations, and the underlying Spark mechanisms such as injectParser, injectPlannerStrategy, and PredicateHelper.

DataSourceV2LanceSparkPlannerStrategy
0 likes · 14 min read
Extending Spark SQL with LanceSparkSessionExtensions: A Complete Guide
Big Data Tech Team
Big Data Tech Team
Jan 19, 2026 · Big Data

What Is Data Fabric and How It Can Eliminate Data Silos Today

This article explains the concept of Data Fabric, debunks common misconceptions, outlines the three key drivers behind its rise, and provides a practical four‑step roadmap—including metadata, semantic layers, policy engines, and AI—to help teams of any size adopt the technology.

AIData FabricMetadata Management
0 likes · 7 min read
What Is Data Fabric and How It Can Eliminate Data Silos Today
DeWu Technology
DeWu Technology
Jan 19, 2026 · Big Data

How to Speed Up Full‑Scale Data Comparison for Massive Migration Projects

This article details the challenges of comparing billions of rows during large‑scale data migrations, presents a multi‑step solution using union‑all grouping, hash‑based aggregation, and intelligent primary‑key detection, and explains platform features, performance optimizations, and future enhancements that reduced comparison time by up to 70%.

data comparisonhash aggregationprimary key detection
0 likes · 16 min read
How to Speed Up Full‑Scale Data Comparison for Massive Migration Projects
DataFunSummit
DataFunSummit
Jan 18, 2026 · Big Data

How Ray Reinvents AI Data Pipelines for Massive Multimodal Inference

This article examines the shortcomings of traditional big‑data engines for AI workloads, presents a Ray‑based heterogeneous fusion architecture that unifies CPU/GPU scheduling, Python ecosystems, and streaming‑batch processing, and details fault‑tolerance, checkpointing, compute‑storage separation, resource‑utilization, scalability, and observability improvements that enable thousands of nodes and dramatically higher GPU efficiency.

Big DataCloud NativeDistributed computing
0 likes · 31 min read
How Ray Reinvents AI Data Pipelines for Massive Multimodal Inference
Big Data Tech Team
Big Data Tech Team
Jan 15, 2026 · Big Data

Mastering Data Warehousing: Core Concepts, Tools, and Future Trends

This article outlines a comprehensive roadmap for data warehousing, covering fundamental concepts, essential big‑data tools, practical implementation steps, advanced architectural topics, and emerging trends such as cloud‑native warehouses and machine‑learning integration, helping readers build a solid knowledge base.

Data WarehouseETLOLAP
0 likes · 9 min read
Mastering Data Warehousing: Core Concepts, Tools, and Future Trends
Big Data Tech Team
Big Data Tech Team
Jan 12, 2026 · Big Data

Avoid the 5 Fatal DWS Design Traps and Build Scalable Data Warehouses

This article dissects the five most common pitfalls when transitioning from DWD to DWS aggregation tables—such as chimney‑style designs, over‑wide tables, grain mismatches, missing drill‑down keys, and performance neglect—and offers concrete, production‑ready solutions to create reusable, efficient, and cost‑effective data‑warehouse layers.

DWS DesignData WarehouseETL
0 likes · 9 min read
Avoid the 5 Fatal DWS Design Traps and Build Scalable Data Warehouses
Instant Consumer Technology Team
Instant Consumer Technology Team
Jan 8, 2026 · Big Data

How Vintage Cohort Analysis Transforms Financial Risk Management

This article explains the concept, key terminology, and practical implementation of Vintage (cohort) analysis in financial services, detailing how to build tables and curves, integrate data pipelines, and use the insights to optimize marketing strategies, credit risk assessment, and operational efficiency.

Risk ManagementVintage analysiscohort analysis
0 likes · 18 min read
How Vintage Cohort Analysis Transforms Financial Risk Management
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jan 8, 2026 · Big Data

How Gaode Maps Built a Real‑Time Lakehouse for Billion‑Scale Trajectory Data

This article details Gaode Maps' end‑to‑end lakehouse solution for massive, high‑frequency trajectory data, covering the challenges of real‑time visibility, query performance, and storage cost, and explaining how a hot‑warm‑cold tiering architecture built on Apache Flink, Paimon, StarRocks, Redis and Lindorm delivers millisecond‑level queries while cutting storage expenses.

Apache FlinkApache PaimonData Tiering
0 likes · 19 min read
How Gaode Maps Built a Real‑Time Lakehouse for Billion‑Scale Trajectory Data
iQIYI Technical Product Team
iQIYI Technical Product Team
Jan 8, 2026 · Big Data

How iQIYI Cut Stream Data Costs by 70%: From Private‑Cloud Kafka to AutoMQ

This article details iQIYI's evolution from a tightly coupled private‑cloud Kafka setup to a cloud‑native AutoMQ architecture, describing the challenges of scaling, the development of the Stream platform and Stream‑SDK, the migration to hybrid and public‑cloud Kafka, and the resulting cost and elasticity improvements.

AutoMQData ArchitectureKafka
0 likes · 12 min read
How iQIYI Cut Stream Data Costs by 70%: From Private‑Cloud Kafka to AutoMQ