Tagged articles

3683 articles

Page 1 of 37

May 28, 2026 · Artificial Intelligence

How DataWorks Data Agent Advances from Augmented Assistance to Full Autonomy

The article analyzes DataWorks Data Agent’s evolution from a helper‑style tool to an autonomous data‑centric AI agent, detailing its five‑stage roadmap, dual‑engine CLI/Claw architecture, unified runtime kernel, open skill ecosystem, and CPU‑GPU joint optimization for enterprise‑grade data automation.

AIAgent ArchitectureBig Data

0 likes · 12 min read

How DataWorks Data Agent Advances from Augmented Assistance to Full Autonomy

DataFunTalk

May 28, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based ad‑hoc analysis to a Lambda‑style architecture and finally to a lakehouse with generic incremental compute, cutting architecture complexity, resource and development costs by one‑third while delivering second‑level queries over trillions of rows.

Big DataClickHouseData Architecture

0 likes · 21 min read

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

DataFunTalk

May 25, 2026 · Big Data

MaxCompute’s AI‑Ready Evolution: Architecture, Features, and Real‑World Use Cases

This article examines how Alibaba Cloud’s MaxCompute platform has been transformed for AI workloads, detailing its multi‑layer architecture, multimodal data storage, SQL AI functions, the Python‑based MaxFrame framework, and real‑world deployments in large‑model preprocessing, autonomous driving, and multimodal image labeling.

AIBig DataDistributed computing

0 likes · 12 min read

MaxCompute’s AI‑Ready Evolution: Architecture, Features, and Real‑World Use Cases

AI Large-Model Wave and Transformation Guide

May 25, 2026 · Artificial Intelligence

AI‑Powered Underwater Simulation: Autonomous Perception, Decision & Execution

The article presents a comprehensive AI‑driven framework for unmanned underwater vehicles, detailing a three‑layer decision architecture, human‑machine collaboration models, conflict‑resolution mechanisms, data acquisition and simulation pipelines, ontology‑based knowledge graphs, and self‑evolution processes to enable reliable autonomous perception, planning, and actuation in complex marine environments.

Artificial IntelligenceBig DataIndustry Insights

0 likes · 30 min read

AI‑Powered Underwater Simulation: Autonomous Perception, Decision & Execution

AI Large-Model Wave and Transformation Guide

May 24, 2026 · Industry Insights

From CIA‑Labeled ‘Garbage’ to Military Disappointment: Palantir’s Series of Failures

The article chronicles Palantir’s two‑decade saga of high‑profile setbacks—from a $5 billion, six‑year military AI project and a failed financial platform to stalled consumer data alliances—showing how advanced algorithms falter when detached from real‑world business needs.

AIBig DataFinance

0 likes · 8 min read

From CIA‑Labeled ‘Garbage’ to Military Disappointment: Palantir’s Series of Failures

Big Data Tech Team

May 24, 2026 · Big Data

Data Warehouse Interview Pitfall Guide 2.0: Avoid Common SQL, Modeling, and ETL Mistakes

This guide compiles the most frequent interview pitfalls for data warehouse roles, covering SQL join and aggregation errors, window function misuse, subquery versus CTE performance myths, dimensional modeling mistakes, SCD implementation traps, layered design issues, data quality handling, ETL traps, Hive and Spark performance questions, real‑time warehousing considerations, and effective interview strategies.

Big DataETLHive

0 likes · 3 min read

Data Warehouse Interview Pitfall Guide 2.0: Avoid Common SQL, Modeling, and ETL Mistakes

DataFunTalk

May 22, 2026 · Big Data

How Xiaohongshu Cut Data Architecture Complexity and Cost by One‑Third in the Big AI Data Era

The article details Xiaohongshu's evolution from a simple ClickHouse‑based analytics layer to a Lambda‑enabled 2.0 stack and finally a Lakehouse‑based 3.0 architecture, showing how each iteration reduced infrastructure complexity, resource consumption and development effort by roughly one‑third while supporting trillions of daily events and AI‑driven use cases.

Big DataClickHouseData Architecture

0 likes · 21 min read

How Xiaohongshu Cut Data Architecture Complexity and Cost by One‑Third in the Big AI Data Era

DataFunSummit

May 21, 2026 · Big Data

Alibaba Cloud’s Agent-Ready Big Data AI Infrastructure: Boosting Data Development from Hours to Minutes

Facing a projected 85% of enterprises deploying internal agents within two years, Alibaba Cloud proposes an Agent-Ready big‑data AI infrastructure—comprising a unified data lake, real‑time processing, high‑dimensional vector retrieval, elastic model serving, and comprehensive security governance—that has already cut data‑development cycles from hours to 5‑10 minutes in internal model‑training and Taobao flash‑sale scenarios.

AIAgent-ReadyBig Data

0 likes · 15 min read

Alibaba Cloud’s Agent-Ready Big Data AI Infrastructure: Boosting Data Development from Hours to Minutes

DataFunSummit

May 20, 2026 · Big Data

How Kuaishou’s Real‑Time Data Lake Boosts AI and BI Architecture

The article explains how Kuaishou partnered with Apache Hudi to overhaul its ODS‑based data lake, addressing latency, storage cost, and complexity for AI and BI workloads, detailing the evolution from mysql‑to‑hive to mysql‑to‑hudi 1.0 and 2.0, the resulting performance gains, cost savings, and future roadmap.

AIBIBig Data

0 likes · 20 min read

How Kuaishou’s Real‑Time Data Lake Boosts AI and BI Architecture

DataFunTalk

May 19, 2026 · Industry Insights

From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Forks for Data Platforms

A live discussion dissected the shift from single‑point Copilot assistants to platform‑level Agentic data platforms, exposing hard architectural, security, knowledge‑base, evaluation, stability‑cost, and governance challenges while debating whether the future will favor a super‑agent or a multi‑agent ecosystem.

Big DataData PlatformEnterprise Governance

0 likes · 18 min read

From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Forks for Data Platforms

DataFunSummit

May 17, 2026 · Industry Insights

From Single‑point Copilot to Platform‑level Agentic: Real Challenges and Future Paths for Data Platforms

A 90‑minute live discussion with data experts from vivo and YangQianGuan reveals that moving from a simple Copilot assistant to a platform‑level Agentic data system requires fundamental architectural changes, new infrastructure for memory, planning, tool orchestration, security guardrails, knowledge management, robust evaluation, and a clear ROI strategy.

AI governanceBig DataData Platform

0 likes · 19 min read

From Single‑point Copilot to Platform‑level Agentic: Real Challenges and Future Paths for Data Platforms

Data Party THU

May 15, 2026 · Artificial Intelligence

2026 Big Data Challenge Announces Monthly Star Winners and Shares Winning Teams’ Insights

The 2026 China University Computer Competition – Big Data Challenge reveals the Monthly Star award winners, each receiving 800 RMB, and presents detailed experience reports from the top teams covering feature engineering, model selection, training validation, and ensemble strategies for stock prediction.

Big DataMachine LearningModel Fusion

0 likes · 7 min read

2026 Big Data Challenge Announces Monthly Star Winners and Shares Winning Teams’ Insights

dbaplus Community

May 14, 2026 · Big Data

Building a ‘One‑Sentence Bank’: Big Data and AI Fusion for Small Banks

The article outlines the evolution of big data in banking, compares management models for heterogeneous data, describes the shift from data engineering to knowledge engineering, introduces LLMOps for high‑quality knowledge bases, and details how integrating AI and data can enable a “one‑sentence bank” that answers queries and executes tasks.

Artificial IntelligenceBig DataKnowledge Engineering

0 likes · 22 min read

Building a ‘One‑Sentence Bank’: Big Data and AI Fusion for Small Banks

vivo Internet Technology

May 13, 2026 · Big Data

How Vivo Upgraded a Million‑Node YARN Cluster: Architecture, Scheduler Switch, and Performance Optimizations

This article details Vivo's end‑to‑end upgrade of a YARN 2.6.0 cluster to a modern version for a million‑node, hundred‑thousand‑tasks‑per‑day platform, covering architectural evolution, scheduler migration, compatibility fixes, performance tuning, and service‑continuity strategies.

Big DataCapacity SchedulerCluster Upgrade

0 likes · 28 min read

How Vivo Upgraded a Million‑Node YARN Cluster: Architecture, Scheduler Switch, and Performance Optimizations

DeWu Technology

May 13, 2026 · Big Data

How BP Claw Solves AI Coding Input Challenges in FlinkSpec’s Real‑Time Data Warehouse

The article explains how BP Claw tackles unstable AI coding results by automatically converting low‑quality PRD documents into structured, high‑quality requirements, applying token‑saving strategies, strict hallucination guards, and multi‑skill orchestration, which together boost FlinkSpec’s real‑time data‑warehouse delivery efficiency by up to 30%.

AI codingBP ClawBig Data

0 likes · 17 min read

How BP Claw Solves AI Coding Input Challenges in FlinkSpec’s Real‑Time Data Warehouse

DataFunTalk

May 11, 2026 · Big Data

How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based ad‑hoc analysis to a Lambda‑style architecture and finally to a lakehouse built on Iceberg, StarRocks, Flink and Spark, cutting architecture complexity, resource and development costs by two‑thirds while supporting trillions of daily events with sub‑second query latency.

Big DataClickHouseFlink

0 likes · 22 min read

How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era

Zhihu Tech Column

May 9, 2026 · Big Data

How Zhihu Built a Unified OneID System to Consolidate Fragmented User Identities

Zhihu created a unified OneID framework that merges scattered account, device, and behavior data into a global unique identifier, using strong and weak IDs, graph‑based connectivity, device governance, and a device half‑life model to improve recommendation, push, and advertising effectiveness.

Big DataDevice GovernanceGraph Computation

0 likes · 11 min read

How Zhihu Built a Unified OneID System to Consolidate Fragmented User Identities

DataFunTalk

May 8, 2026 · Big Data

How MaxCompute Evolves into a Data+AI Platform: Architecture, Core Capabilities, and Real-World Cases

The article explains how Alibaba Cloud's MaxCompute has been transformed into a cloud‑native Data+AI platform, detailing its layered architecture, multimodal storage, model management, hybrid compute scheduling, SQL AI functions, the MaxFrame Python framework, and several enterprise case studies that demonstrate performance gains and flexible resource orchestration.

AI integrationBig DataCloud Native

0 likes · 11 min read

How MaxCompute Evolves into a Data+AI Platform: Architecture, Core Capabilities, and Real-World Cases

DataFunTalk

May 6, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

The article details Xiaohongshu's four‑stage data‑platform evolution—from a simple ClickHouse ad‑hoc setup to a Lambda‑based 2.0 design and finally a lakehouse‑driven 3.0 architecture—highlighting the adoption of general incremental compute, cost‑reduction to one‑third, performance gains of up to ten‑fold, and the SPOT standards that guide the new system.

Big DataClickHouseData Architecture

0 likes · 21 min read

DataFunTalk

Apr 29, 2026 · Big Data

How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based analytics stack to a unified lakehouse with generic incremental compute, cutting architecture complexity, resource cost, and development effort by roughly one‑third while supporting petabyte‑scale, sub‑second queries across its 350 million‑user app.

Big DataClickHouseData Architecture

0 likes · 22 min read

How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era

Model Perspective

Apr 28, 2026 · Big Data

How a Taiwan Ban Became Free Advertising for Amap’s Map App

A recent Taiwan government warning against Amap turned into a viral boost, exposing the app’s superior traffic‑light countdown, massive data‑driven network effects, and the underlying reverse‑propagation model that explains why the ban accelerated downloads rather than suppressing them.

AmapBig DataNetwork Effects

0 likes · 11 min read

How a Taiwan Ban Became Free Advertising for Amap’s Map App

DataFunTalk

Apr 28, 2026 · Artificial Intelligence

From “Lobster” to Ontology: DACon Reveals the Next Trend in Self‑Evolving AI Agents

The DACon conference in Shanghai gathered over 8,000 developers and experts, showcasing 50 talks that explored self‑evolving AI agents, the open‑source GenericAgent framework, data‑governance ontology, Agent‑Ready big‑data infrastructure, and AI+AR ecosystems, while highlighting practical case studies and future industry directions.

AI agentsAI+ARBig Data

0 likes · 11 min read

From “Lobster” to Ontology: DACon Reveals the Next Trend in Self‑Evolving AI Agents

DataFunSummit

Apr 27, 2026 · Artificial Intelligence

How Tencent Games Leverages AI to Turn Data Governance into a Service

Tencent Games’ data governance team details an AI‑driven, end‑to‑end semantic framework that shifts traditional rule‑based data management to a service‑oriented model, cutting storage waste by 30 %, halving development time, and boosting asset recommendation accuracy to 95 % across its global gaming platform.

AIBig DataGaming Industry

0 likes · 19 min read

How Tencent Games Leverages AI to Turn Data Governance into a Service

DataFunSummit

Apr 25, 2026 · Big Data

AI‑Era Multimodal Data Lake Infrastructure: TBDS Design, Storage, Compute, and Governance

The article analyzes how Tencent Cloud's TBDS platform tackles the AI era's multimodal data lake challenges through a native storage format (Lance), elastic Ray‑based compute, standardized metadata with Gravitino, and automated governance via Lakekeeper, citing architecture details, performance numbers, and real‑world deployments.

AI infrastructureBig DataGravitino

0 likes · 13 min read

AI‑Era Multimodal Data Lake Infrastructure: TBDS Design, Storage, Compute, and Governance

DataFunSummit

Apr 24, 2026 · Artificial Intelligence

AI‑Driven Data Governance as a Service: Tencent Games' Paradigm Shift

This talk details how Tencent Games leverages AI to transform its data governance from rule‑based, passive processes into a semantic, service‑oriented paradigm, addressing resource waste, low collaboration efficiency, and scalability challenges while delivering measurable improvements in cost, speed, and asset quality.

AIBig DataData Platform

0 likes · 19 min read

AI‑Driven Data Governance as a Service: Tencent Games' Paradigm Shift

Java Tech Enthusiast

Apr 23, 2026 · Industry Insights

12306 Crackdown Triggers Widespread Failures on Third‑Party Ticket Platforms

Ahead of the May Day holiday, many users reported errors and failed bookings on third‑party ticket services such as Ctrip, Didi and Tongcheng, after 12306’s new big‑data‑driven risk‑control system introduced a “slow‑queue” mechanism that blocked millions of suspicious transactions.

12306Big DataRailway

0 likes · 6 min read

12306 Crackdown Triggers Widespread Failures on Third‑Party Ticket Platforms

DataFunTalk

Apr 22, 2026 · Industry Insights

How Xiaohongshu Cut Data Platform Costs by Two‑Thirds with Incremental Computing

This article details Xiaohongshu's journey from a ClickHouse‑based batch analytics stack to a unified lakehouse architecture powered by generic incremental computing, showing how the company reduced architecture complexity, resource consumption and development effort each to roughly one‑third while supporting trillions of daily events with sub‑10‑second query latency.

Big DataData ArchitectureLakehouse

0 likes · 24 min read

How Xiaohongshu Cut Data Platform Costs by Two‑Thirds with Incremental Computing

Big Data Tech Team

Apr 22, 2026 · Big Data

Inside Big Tech: Full Breakdown of AI Agents for Data Warehouse Governance

The article analyzes how leading internet companies embed AI agents across the entire data‑warehouse lifecycle to automate governance, presenting real‑world case studies from Alibaba, ByteDance, JD.com and Tencent, and quantifies benefits such as over 65% reduction in manual effort, 50% drop in metric duplication, and a 40% boost in resource utilization.

AI agentsBig DataData Warehouse

0 likes · 10 min read

Inside Big Tech: Full Breakdown of AI Agents for Data Warehouse Governance

DataFunSummit

Apr 21, 2026 · Industry Insights

How SelectDB Cuts 60% Costs and Boosts Real‑Time Performance for New Energy Batteries

The whitepaper analyzes the data‑driven transformation of the new‑energy battery sector, outlines four core challenges—massive data streams, fast‑changing R&D demands, long manufacturing cycles, and multi‑dimensional quality standards—and demonstrates how SelectDB’s unified lake‑warehouse architecture delivers million‑level throughput, second‑level latency, up to 30× query speedup, and 60% cost reduction across real‑world case studies.

Big DataData WarehouseNew Energy

0 likes · 18 min read

How SelectDB Cuts 60% Costs and Boosts Real‑Time Performance for New Energy Batteries

DataFunSummit

Apr 19, 2026 · Big Data

How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine

OPPO’s data‑lake team, led by David, detailed their transition from Hive‑Spark to a unified multi‑modal lake, leveraging Gravitino for cross‑engine metadata management and the open‑source Curvine cache to eliminate data silos, boost I/O performance, and support massive image, recommendation, and AI‑Agent workloads.

Big DataDistributed CacheMetadata Management

0 likes · 11 min read

How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine

Big Data Tech Team

Apr 17, 2026 · Industry Insights

Can AI Replace Data Warehouse Engineers? Exploring the Future of Data Modeling

The article examines how large‑language‑model AI can automate data‑warehouse modeling tasks—generating SQL, designing schemas, handling ETL, and tracing lineage—while highlighting current pain points, practical limitations, and four emerging trends that will reshape the role of data engineers over the next few years.

AIBig DataData Warehouse

0 likes · 11 min read

Can AI Replace Data Warehouse Engineers? Exploring the Future of Data Modeling

Ctrip Technology

Apr 16, 2026 · Big Data

How Ray + DuckDB Cut 9B-Row Attribution Queries from 40s to 15s

When attribution analysis on over 900 million rows slowed to more than 40 seconds and threatened cluster stability, Ctrip's smart attribution team rebuilt the architecture with Ray and DuckDB, achieving sub‑15‑second query times, 160 % performance gain, and complete resource isolation.

Attribution AnalysisBig DataDistributed computing

0 likes · 22 min read

How Ray + DuckDB Cut 9B-Row Attribution Queries from 40s to 15s

DataFunTalk

Apr 16, 2026 · Big Data

How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing

This article details Xiaohongshu's data platform evolution from a simple ClickHouse‑based ad‑hoc system to a Lambda‑style architecture and finally a lakehouse solution, highlighting how the adoption of a new incremental computing model reduced architectural complexity, resource consumption and development effort each to roughly one‑third while delivering sub‑second query performance on petabyte‑scale data.

Big DataData ArchitectureLakehouse

0 likes · 21 min read

How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing

DataFunSummit

Apr 15, 2026 · Industry Insights

Why Traditional Data Platforms Fail and How Ontology Drives Triple‑Digit ROI

The article analyzes costly data‑platform failures—such as a $40 million payroll system in San Francisco schools and a collapsed Healthcare.gov launch—identifies the root cause as ineffective data middle platforms, and demonstrates how Palantir’s ontology‑based three‑layer architecture (semantic, dynamics, decision) can turn data into actionable insights, delivering triple‑digit ROI for enterprises like BP, Novartis, and General Mills.

Big DataData PlatformIndustry Insights

0 likes · 5 min read

Why Traditional Data Platforms Fail and How Ontology Drives Triple‑Digit ROI

DataFunTalk

Apr 11, 2026 · Industry Insights

Why Most Intelligent Data Analytics Fail and How Aloudata’s Agent Architecture Solves It

This article examines three common misconceptions in enterprise intelligent data analysis, explains how a semantic metric layer can break data silos, and details Aloudata Agent’s dual‑path engine, multi‑agent collaboration, and product design that together deliver trustworthy, deep, and democratized analytics for modern businesses.

AIAgent ArchitectureAttribution Analysis

0 likes · 18 min read

Why Most Intelligent Data Analytics Fail and How Aloudata’s Agent Architecture Solves It

DataFunTalk

Apr 10, 2026 · Big Data

How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing

This article analyzes Xiaohongshu's data platform evolution—from a simple ClickHouse‑based analytics layer to a Lambda architecture and finally a lakehouse design—highlighting how adopting a new incremental computing model reduced architecture complexity, resource consumption, and development effort each to roughly one‑third while delivering sub‑second query performance on petabyte‑scale data.

Big DataData ArchitectureLakehouse

0 likes · 22 min read

Big Data Tech Team

Apr 9, 2026 · Industry Insights

Why Data Engineers Are the New AI Powerhouses: 4 Core Reasons & Actionable Tips

The article analyzes why data development engineers are becoming more valuable in the AI era, outlining four core reasons—including data‑driven AI limits, the rise of RAG architectures, heightened data compliance, and a talent shortage—while offering concrete advice on mastering real‑time pipelines, unstructured data, and AI infrastructure.

AI infrastructureBig DataRAG

0 likes · 8 min read

Why Data Engineers Are the New AI Powerhouses: 4 Core Reasons & Actionable Tips

Alibaba Cloud Observability

Apr 6, 2026 · Cloud Native

How Alibaba Cloud Built Real‑Time OpenAPI Monitoring with Flink + SLS

This article details the design and implementation of a cloud‑native, real‑time monitoring system for Alibaba Cloud OpenAPI, covering background challenges, a Flink‑SLS architecture, multi‑region data processing, checkpoint and state‑backend tuning, source‑side predicate pushdown, visualization with Grafana, and production results.

Big DataCloud NativeFlink

0 likes · 21 min read

How Alibaba Cloud Built Real‑Time OpenAPI Monitoring with Flink + SLS

Big Data Tech Team

Apr 1, 2026 · Big Data

Why Your 2026 Big Data Resume Is Being Ignored and How to Fix It

In the 2026 spring hiring season, many big‑data job seekers see their resumes disappear because they still focus on offline batch processing, while employers now demand real‑time streaming, AI‑driven data pipelines, and cloud‑native deployment skills such as Flink, vector databases, and Kubernetes.

AI integrationBig DataCloud Native

0 likes · 7 min read

Why Your 2026 Big Data Resume Is Being Ignored and How to Fix It

Big Data Tech Team

Mar 30, 2026 · Big Data

2026 Data Warehouse Interview Guide: Essential Questions for All Three Rounds

This article compiles a comprehensive set of data‑warehouse interview questions—including self‑introduction prompts, SQL and window‑function challenges, data‑skew solutions, architecture design, file‑format trade‑offs, governance, and team‑leadership topics—to help candidates prepare for first, second, and third‑round interviews at leading tech firms.

Big DataInterview preparationSQL

0 likes · 7 min read

2026 Data Warehouse Interview Guide: Essential Questions for All Three Rounds

vivo Internet Technology

Mar 25, 2026 · Industry Insights

How Vivo Scaled Marketing Automation with Presto, Bitmap, and StarRocks

This case study details how Vivo’s marketing automation platform evolved its data‑driven architecture—from a Presto‑based wide‑table design, through a Bitmap optimization, to a StarRocks migration—addressing performance bottlenecks, reducing resource costs, and enhancing data security.

Big DataData ArchitectureOLAP

0 likes · 11 min read

How Vivo Scaled Marketing Automation with Presto, Bitmap, and StarRocks

DeWu Technology

Mar 25, 2026 · Big Data

How Code LLM Transforms E‑commerce Data Warehouses: From Data Rights to AI‑Driven Automation

This article analyzes how large‑language models for code, exemplified by Claude Code, are integrated into an e‑commerce data‑warehouse ecosystem, defining data‑rights boundaries, introducing agentic workflows, decoupling cognitive and execution runtimes, and establishing standardized I/O contracts to achieve safe, scalable AI‑assisted development and governance.

Big DataCode LLMData Warehouse

0 likes · 24 min read

How Code LLM Transforms E‑commerce Data Warehouses: From Data Rights to AI‑Driven Automation

DataFunSummit

Mar 25, 2026 · Big Data

How Apache Gravitino and OpenLineage Transform Data Governance for AI‑Driven Enterprises

In the era of AI and multi‑cloud, this article analyzes the core challenges of data governance—data silos, quality gaps, and compliance risks—and explains how Apache Gravitino’s unified metadata architecture together with OpenLineage’s standardized lineage model provide a scalable, automated solution for intelligent, real‑time data management.

Apache GravitinoBig DataData Lineage

0 likes · 15 min read

How Apache Gravitino and OpenLineage Transform Data Governance for AI‑Driven Enterprises

DataFunSummit

Mar 24, 2026 · Industry Insights

How DataWorks Is Transforming Big Data Development with AI Agents

The article outlines DataWorks' evolution from a decade‑long big‑data governance platform to an AI‑driven Copilot and autonomous Agent system, detailing its technical foundations, tool‑adaptation layer, context engineering, security safeguards, and future vision of a professional, open, and intelligent big‑data development ecosystem.

AI CopilotAgentBig Data

0 likes · 13 min read

How DataWorks Is Transforming Big Data Development with AI Agents

DataFunSummit

Mar 16, 2026 · Big Data

How MaxCompute Evolves into an AI‑Native Data Warehouse: Architecture, Capabilities, and Real‑World Cases

This article outlines MaxCompute's 15‑year transformation from a traditional structured‑compute engine to an AI‑native data warehouse, detailing its data, heterogeneous compute, and model capabilities, showcasing three core ability pillars, real‑world case studies, and future development directions.

AI-nativeBig DataCloud XPU

0 likes · 7 min read

How MaxCompute Evolves into an AI‑Native Data Warehouse: Architecture, Capabilities, and Real‑World Cases

Big Data Technology Tribe

Mar 8, 2026 · Big Data

How Spark Structured Streaming’s Real-Time Mode Achieves Millisecond Latency

This article explains Spark Structured Streaming’s new Real-Time Mode introduced in Spark 4.1, detailing how it reduces latency to the millisecond level by redesigning micro‑batch processing, concurrent stage scheduling, streaming shuffle, and checkpointing, and compares it with Flink’s native streaming.

Big DataReal-Time ModeStreaming

0 likes · 11 min read

How Spark Structured Streaming’s Real-Time Mode Achieves Millisecond Latency

Big Data Technology & Architecture

Mar 6, 2026 · Big Data

What’s New in Big Data Frameworks? ClickHouse, Fluss, Delta Lake, StarRocks & More (Mar 2026)

This roundup compiles the latest releases across major data platforms—including ClickHouse, Apache Fluss, Delta Lake, StarRocks, Apache Pulsar and DolphinScheduler—highlighting version numbers, key feature additions, security fixes, and emerging trends shaping the big‑data ecosystem.

Apache FlussBig DataClickHouse

0 likes · 19 min read

What’s New in Big Data Frameworks? ClickHouse, Fluss, Delta Lake, StarRocks & More (Mar 2026)

DataFunTalk

Mar 3, 2026 · Big Data

Exploring Tencent Cloud’s Iceberg Batch‑Stream Integration and AI‑Driven Data Governance

This article presents a series of seven technical case studies—including Tencent Cloud’s Iceberg‑based batch‑stream integration, AI‑driven data governance with Apache Gravitino, Xiaohongshu’s lakehouse evolution, and a multimodal data‑lake solution—detailing challenges, architectural designs, implementation steps, performance results, and future directions.

AIBig DataIceberg

0 likes · 8 min read

Exploring Tencent Cloud’s Iceberg Batch‑Stream Integration and AI‑Driven Data Governance

DeWu Technology

Mar 2, 2026 · Big Data

Mastering Spark UI: Deep Dive into Metrics, Tuning, and Real‑World Cases

This article provides a comprehensive guide to Spark UI, explaining each primary and secondary tab, the key metrics they expose, and how to interpret them for performance bottleneck detection, followed by two detailed case studies and practical tuning recommendations for Spark workloads.

Big DataMetricsOptimization

0 likes · 19 min read

Mastering Spark UI: Deep Dive into Metrics, Tuning, and Real‑World Cases

Big Data Technology Tribe

Mar 2, 2026 · Big Data

How Ray Data’s LogicalOptimizer Transforms Plans for Faster Execution

This article explains Ray Data’s execution pipeline, detailing the LogicalOptimizer’s architecture, core abstractions, rule‑based optimization process, and both logical and physical rule sets, with concrete code examples and practical illustrations of each optimization technique.

Big DataDistributed computingLogical Optimizer

0 likes · 14 min read

How Ray Data’s LogicalOptimizer Transforms Plans for Faster Execution

DataFunSummit

Mar 1, 2026 · Big Data

How Ant Group’s Flex Engine Supercharges Flink with Vectorization

This article details Ant Group’s Flex vectorized engine built on Velox, covering the current state of vectorization, Flex’s architecture (Flink + Velox), core feature development, correctness guarantees, large‑scale deployment results, and future directions for full‑link vectorization and broader hardware support.

Big DataFlexFlink

0 likes · 18 min read

How Ant Group’s Flex Engine Supercharges Flink with Vectorization

Architecture Digest

Feb 12, 2026 · Operations

How to Build a Scalable Kube‑Prometheus Monitoring Stack for Big Data on Kubernetes

This article explains how to design and implement a robust monitoring solution for big‑data components running on Kubernetes using Prometheus, covering metric exposure methods, scrape configurations, alerting architecture, custom exporters, and practical deployment tips.

AlertmanagerBig DataExporter

0 likes · 18 min read

How to Build a Scalable Kube‑Prometheus Monitoring Stack for Big Data on Kubernetes

DataFunSummit

Feb 8, 2026 · Big Data

Kuaishou’s Data Lake Upgrade with Hudi: Solving AI & BI Challenges

The article explains how Kuaishou modernized its data lake by partnering with Apache Hudi to address latency, storage cost, and consistency issues in both AI and BI pipelines, detailing architectural changes, new ingestion tools, partitioning strategies, compaction mechanisms, performance gains and future plans.

AIBIBig Data

0 likes · 20 min read

Kuaishou’s Data Lake Upgrade with Hudi: Solving AI & BI Challenges

DataFunSummit

Feb 7, 2026 · Big Data

How Flink Enables Real‑Time AI Inference and Agent Construction

This article explains Apache Flink’s stream processing fundamentals, introduces the open‑source Flink Agents framework for building event‑driven AI agents, details Alibaba Cloud’s Flink AI Function for real‑time LLM inference, and showcases demos, architecture, integration patterns, and practical use cases such as VOC analysis, live‑stream analytics, and intelligent operations.

Apache FlinkBig DataCloud Computing

0 likes · 24 min read

How Flink Enables Real‑Time AI Inference and Agent Construction

Alibaba Cloud Big Data AI Platform

Feb 4, 2026 · Big Data

How Paimon + StarRocks Power Real‑Time OLAP for Double‑11 Mega‑Sales

During Double‑11 mega‑sales, Taobao Group faced exploding OLAP query traffic, costly data sync pipelines, and slow near‑real‑time analytics, so they unified real‑time and batch data in Paimon, leveraged StarRocks for high‑performance lake queries, tuned cluster settings, and saved nearly ten‑million yuan annually while cutting refresh latency by 80%.

Big DataOLAPPaimon

0 likes · 22 min read

How Paimon + StarRocks Power Real‑Time OLAP for Double‑11 Mega‑Sales

Alibaba Cloud Big Data AI Platform

Feb 2, 2026 · Big Data

Real‑Time Analytics with Alibaba Cloud Serverless Spark & Paimon for Taobao Flash Sale

This article details how Alibaba Cloud EMR Serverless Spark combined with the Paimon lakehouse framework enables Taobao Flash Sale’s retail data team to achieve low‑latency, high‑throughput real‑time analytics, batch processing, and feature generation, outlining architecture evolution, performance gains, and practical Spark tuning techniques.

Big DataLakehousePaimon

0 likes · 18 min read

Real‑Time Analytics with Alibaba Cloud Serverless Spark & Paimon for Taobao Flash Sale

Alibaba Cloud Big Data AI Platform

Feb 2, 2026 · Big Data

How We Built a Scalable Lakehouse Architecture with StarRocks, Paimon, and Flink

This article details the evolution of a data warehouse at RenliJia from a MaxCompute‑centric setup to a modern lakehouse using StarRocks, Paimon, Flink, and Fluss, describing design goals, technical evaluations, implementation steps for offline, OLAP, and real‑time workloads, and the challenges and future plans that emerged.

Big DataData WarehouseFlink

0 likes · 25 min read

How We Built a Scalable Lakehouse Architecture with StarRocks, Paimon, and Flink

Big Data Tech Team

Feb 2, 2026 · Big Data

Choosing the Right Data Sync Tool: Sqoop vs DataX vs Flink CDC vs Airbyte

This article analyzes the architecture, sync modes, latency, scalability, usability, and deployment aspects of four popular data synchronization solutions—Sqoop, DataX, Flink CDC, and Airbyte—and provides a practical decision tree to avoid common misuse pitfalls in enterprise data pipelines.

AirbyteBig DataDataX

0 likes · 9 min read

Choosing the Right Data Sync Tool: Sqoop vs DataX vs Flink CDC vs Airbyte

Raymond Ops

Jan 30, 2026 · Big Data

Build an Enterprise‑Grade HDFS HA and YARN Scheduler from Scratch

This guide walks you through designing and deploying a highly available HDFS architecture with dual NameNodes, ZooKeeper‑based failover, and a tuned YARN resource scheduler, covering detailed configuration files, failover testing, performance tuning, monitoring, automated health checks, capacity planning, and best‑practice checklists for production‑grade big‑data platforms.

Big DataHAHDFS

0 likes · 28 min read

Build an Enterprise‑Grade HDFS HA and YARN Scheduler from Scratch

Radish, Keep Going!

Jan 30, 2026 · Big Data

How Uber Scaled Data Replication to Petabytes Daily with Distcp Optimizations

Uber tackled the challenge of replicating over 350 PB of data across on‑premise and cloud lakes by redesigning Hadoop Distcp, moving intensive tasks to the Application Master, parallelising copy‑listing and commit phases, and leveraging Uber‑mapper jobs to dramatically cut latency and improve resource efficiency.

Big DataDistcpHadoop

0 likes · 17 min read

How Uber Scaled Data Replication to Petabytes Daily with Distcp Optimizations

Data Party THU

Jan 29, 2026 · Big Data

How a Tsinghua Big Data Program Turned a Chemistry PhD into an AI‑Powered Process Engineer

This article recounts a Tsinghua University PhD student's journey through a multidisciplinary big‑data training program, detailing the acquisition of AI and data‑science skills, the creation of novel algorithms like MicroFlowSAM and ImageRAG, and their successful application to chemical engineering research and industry projects.

Big DataChemical EngineeringIndustrial Application

0 likes · 8 min read

How a Tsinghua Big Data Program Turned a Chemistry PhD into an AI‑Powered Process Engineer

Big Data Tech Team

Jan 22, 2026 · Industry Insights

Top 10 Open‑Source Data Visualization Platforms You Should Know

This article presents a concise overview of ten popular open‑source data visualization tools—including Echarts, D3.js, Grafana, Plotly, Redash, Metabase, Superset, Kibana, AntV, and Pyecharts—highlighting their main features, typical use cases, and visual examples to help readers choose the right solution for their needs.

Big DataD3.jsData visualization

0 likes · 6 min read

Top 10 Open‑Source Data Visualization Platforms You Should Know

Ray's Galactic Tech

Jan 22, 2026 · Big Data

Export 1 Billion Elasticsearch Docs in 3 Hours Using PIT + Slice

This guide explains how to reliably export over a billion Elasticsearch documents within a few hours by using Point‑In‑Time (PIT) snapshots combined with parallel Slice processing, covering diagnostics, performance modeling, consistency levels, failure recovery, and resource isolation.

Big DataData ExportElasticsearch

0 likes · 7 min read

Export 1 Billion Elasticsearch Docs in 3 Hours Using PIT + Slice

StarRocks

Jan 22, 2026 · Big Data

How Paimon + StarRocks Accelerates Double‑11 OLAP Queries by 80% Refresh Speed

This article explains how Taotian Group unified real‑time and offline data using Paimon as lake storage and StarRocks for high‑performance OLAP, eliminating costly sync pipelines, cutting refresh time by about 80%, saving nearly ten million yuan annually, and detailing the architecture, cluster safeguards, configuration tweaks, monitoring, and future roadmap for large‑scale promotional events.

Big DataData ArchitectureOLAP

0 likes · 24 min read

How Paimon + StarRocks Accelerates Double‑11 OLAP Queries by 80% Refresh Speed

DataFunSummit

Jan 18, 2026 · Big Data

How Ray Reinvents AI Data Pipelines for Massive Multimodal Inference

This article examines the shortcomings of traditional big‑data engines for AI workloads, presents a Ray‑based heterogeneous fusion architecture that unifies CPU/GPU scheduling, Python ecosystems, and streaming‑batch processing, and details fault‑tolerance, checkpointing, compute‑storage separation, resource‑utilization, scalability, and observability improvements that enable thousands of nodes and dramatically higher GPU efficiency.

Big DataCloud NativeDistributed computing

0 likes · 31 min read

How Ray Reinvents AI Data Pipelines for Massive Multimodal Inference

Mike Chen's Internet Architecture

Jan 18, 2026 · Big Data

Mastering Kafka High Availability: Replication, Leader‑Follower, ISR, and Ack Strategies

This article explains Kafka's high‑availability architecture, covering multi‑replica replication, leader‑follower election and failover, the role of In‑Sync Replicas, and producer acknowledgment settings with min.insync.replicas for reliable, zero‑data‑loss streaming.

Ack StrategyBig DataISR

0 likes · 4 min read

Mastering Kafka High Availability: Replication, Leader‑Follower, ISR, and Ack Strategies

ByteDance Data Platform

Jan 15, 2026 · Artificial Intelligence

Why Model Evaluation Can Be Cool: Innovative Automated Testing for Data‑Driven LLM Agents

In the era of rapidly advancing large‑model technology, the article outlines the challenges of evaluating data‑centric LLM agents, proposes a three‑layer evaluation framework covering basic capabilities, component‑level checks, and end‑to‑end business impact, and shares practical innovations such as semantic‑equivalence SQL matching, agent‑as‑judge pipelines, and a unified assessment platform.

Agent as judgeBig DataData Agent

0 likes · 22 min read

Why Model Evaluation Can Be Cool: Innovative Automated Testing for Data‑Driven LLM Agents

StarRocks

Jan 15, 2026 · Artificial Intelligence

How AI‑First Lakehouse Redefines Data Platforms for Multimodal Analytics

The article outlines the evolution from traditional OLAP to an AI‑first Lakehouse, detailing unified multimodal storage, CPU/GPU heterogeneous scheduling, native vector search, in‑database AI inference, agent‑centric execution, and self‑evolving platform capabilities that together reshape modern data analytics.

AIAgent ArchitectureBig Data

0 likes · 11 min read

How AI‑First Lakehouse Redefines Data Platforms for Multimodal Analytics

AsiaInfo Technology: New Tech Exploration

Jan 6, 2026 · Industry Insights

Apache Paimon: Boosting Real-Time Data Lakes for Fraud Detection & Manufacturing

This article examines Apache Paimon’s innovative lakehouse architecture, detailing its LSM‑Tree storage, flexible merge engine, and multi‑engine integration, and showcases two real‑world deployments—an operator’s real‑time fraud‑prevention system and a manufacturing firm’s unified data platform—highlighting performance gains and cost reductions.

Apache PaimonBig DataLakehouse

0 likes · 15 min read

Apache Paimon: Boosting Real-Time Data Lakes for Fraud Detection & Manufacturing

Alibaba Cloud Big Data AI Platform

Jan 5, 2026 · Big Data

How Xunlei Boosted Data Processing with Alibaba Cloud EMR Serverless Spark

This article details Xunlei's migration from a fixed Hadoop cluster to Alibaba Cloud EMR Serverless Spark, outlining the platform's background, pain points, technical upgrade goals, serverless capabilities, archive data access methods, Kyuubi integration, and the resulting business and technical benefits.

Big DataCloud ComputingEMR

0 likes · 11 min read

How Xunlei Boosted Data Processing with Alibaba Cloud EMR Serverless Spark

JD Retail Technology

Jan 5, 2026 · Big Data

How JD’s Data Lake Uses Hudi LSM‑Tree to Power Near‑Real‑Time Data Assets

The article details JD’s data lake architecture, its 500 PB scale, self‑developed Hudi extensions—including LSM‑Tree‑based MoR tables, custom indexing, IO optimizations, Flink stream scheduling, and NativeIO SDK—along with benchmarks, community contributions, and future roadmap for real‑time big‑data processing.

Big DataHudiLSM‑Tree

0 likes · 19 min read

How JD’s Data Lake Uses Hudi LSM‑Tree to Power Near‑Real‑Time Data Assets

Alibaba Cloud Big Data AI Platform

Dec 31, 2025 · Big Data

Build a Scalable AI Data Pipeline Using DataWorks, MaxCompute & MaxFrame

This guide walks you through setting up a secure, elastic, and high‑performance AI data processing platform on Alibaba Cloud by combining DataWorks, MaxCompute, and MaxFrame, covering the four essential steps, code examples, best‑practice tips, and common troubleshooting advice.

AIBig DataCloud Computing

0 likes · 10 min read

Build a Scalable AI Data Pipeline Using DataWorks, MaxCompute & MaxFrame

Big Data Tech Team

Dec 29, 2025 · Big Data

Master Big Data Development: A Complete Roadmap from Beginner to Expert

This guide presents a comprehensive big‑data development roadmap, detailing industry opportunities, a six‑module technology stack, four progressive learning stages, hands‑on project ideas, interview question strategies, common pitfalls, and curated resources, helping aspiring engineers become proficient and interview‑ready while avoiding common mistakes.

Big DataInterview preparationLearning Path

0 likes · 11 min read

Master Big Data Development: A Complete Roadmap from Beginner to Expert

Big Data Tech Team

Dec 26, 2025 · Interview Experience

How to Nail a 2‑Minute Data Engineer Self‑Introduction

This guide outlines a concise, 1.5‑2‑minute self‑introduction for data engineering interviews, highlighting essential personal details, technical stack, project achievements, business impact, and common pitfalls to avoid, with a concrete example and actionable tips.

Big DataInterviewcareer advice

0 likes · 5 min read

How to Nail a 2‑Minute Data Engineer Self‑Introduction

Big Data Tech Team

Dec 25, 2025 · Big Data

How to Build an End‑to‑End E‑Commerce Data Warehouse for Interview Success

This guide walks you through designing and implementing a complete e‑commerce data‑warehouse project—from raw data ingestion and ODS/DWD/DWS/ADS layers to optional real‑time analytics—while highlighting interview‑ready resume tips, common pitfalls, and performance‑tuning tricks.

Big DataETLFlink

0 likes · 10 min read

How to Build an End‑to‑End E‑Commerce Data Warehouse for Interview Success

Alibaba Cloud Big Data AI Platform

Dec 24, 2025 · Big Data

How Paimon’s Column‑Separation Architecture Powers Real‑Time Multi‑Modal Lakehouse for AI

This article explains the challenges of frequent column changes in AI feature engineering, introduces Paimon’s column‑separation storage with a global continuous Row ID, details its Blob data type for efficient multi‑modal handling, and outlines production results and future roadmap for building an AI‑native data lakehouse.

Apache PaimonBig DataBlob

0 likes · 11 min read

How Paimon’s Column‑Separation Architecture Powers Real‑Time Multi‑Modal Lakehouse for AI

dbaplus Community

Dec 20, 2025 · Big Data

From Data Lakes to DataOps: Unveiling the Hidden Challenges of Data Governance

The article walks through the evolution of data management—from idealistic visions and messy “shit mountains” to the realities of data lakes, metadata layers, governance challenges, trust breakdowns, and finally the promise of DataOps as a hopeful path forward.

Big DataDataOpsdata governance

0 likes · 3 min read

From Data Lakes to DataOps: Unveiling the Hidden Challenges of Data Governance

DataFunTalk

Dec 17, 2025 · Artificial Intelligence

How Large Language Models Unlock Field‑Level Data Lineage at Scale

This talk explains how a data platform tackled massive, heterogeneous enterprise data by using large language models and prompt engineering to automatically extract field‑level lineage from SQL scripts, achieve over 80% coverage, and raise accuracy above 95%, dramatically cutting impact‑analysis time.

AI for data engineeringBig DataData Lineage

0 likes · 6 min read

How Large Language Models Unlock Field‑Level Data Lineage at Scale

JD Tech Talk

Dec 12, 2025 · Big Data

Understanding Hudi Core Concepts: Timeline, Indexes, and Table Types Explained

This article explains Apache Hudi’s core concepts, including its timeline architecture, file layout, indexing mechanisms, and the two primary table types—Copy on Write and Merge on Read—along with their trade‑offs and the various query modes such as snapshot, time‑travel, and incremental queries.

Apache HudiBig DataTable Types

0 likes · 9 min read

Understanding Hudi Core Concepts: Timeline, Indexes, and Table Types Explained

JD Cloud Developers

Dec 12, 2025 · Big Data

Apache Hudi Core Concepts: Timeline, Indexes, Table Types & Queries

This article explains Apache Hudi’s core architecture, detailing the timeline mechanism, file layout, indexing strategies, the two main table types (Copy‑On‑Write and Merge‑On‑Read), and various query modes such as snapshot, time‑travel, read‑optimized and incremental queries.

Apache HudiBig DataIndexes

0 likes · 9 min read

Apache Hudi Core Concepts: Timeline, Indexes, Table Types & Queries

vivo Internet Technology

Dec 10, 2025 · Big Data

Vivo’s 800‑Day Journey Optimizing Celeborn Remote Shuffle Service at PB Scale

This technical report details how Vivo’s big‑data platform adopted Celeborn as its remote shuffle service, evaluated alternatives, tuned hardware and software configurations, implemented performance and stability enhancements, and outlines future operational and community‑driven improvements for handling petabyte‑scale shuffle workloads.

Big DataKubernetesRemote Shuffle Service

0 likes · 20 min read

Vivo’s 800‑Day Journey Optimizing Celeborn Remote Shuffle Service at PB Scale

Big Data Technology & Architecture

Dec 10, 2025 · Big Data

What’s New in Apache Spark 4.0? Deep Dive into 2025 Core Updates

The 2025 release of Apache Spark 4.0 brings a comprehensive overhaul—including default ANSI SQL mode, full SQL scripting support, a new Real‑Time streaming mode, adaptive query execution, dynamic memory management, and GPU‑accelerated MLlib—significantly boosting performance, reliability, and developer productivity across big‑data workloads.

Apache SparkBig DataGPU Acceleration

0 likes · 9 min read

What’s New in Apache Spark 4.0? Deep Dive into 2025 Core Updates

Raymond Ops

Dec 7, 2025 · Operations

Ceph Uncovered: Architecture, Deployment, and Ops Best Practices

Ceph is an open‑source distributed storage platform offering object, block, and file services with high availability, scalability, and self‑management; the guide explains its core components, CRUSH algorithm, storage interfaces, deployment steps using ceph‑deploy, operational monitoring, performance tuning, and common use cases in cloud and big‑data environments.

Big DataCephCloud Computing

0 likes · 11 min read

Ceph Uncovered: Architecture, Deployment, and Ops Best Practices

dbaplus Community

Dec 6, 2025 · Big Data

Why Precise Data Warehouse Naming Boosts Efficiency and Cuts Costs

In the era of digital transformation, chaotic data warehouse naming wastes resources, while a well‑defined naming convention improves maintainability, collaboration, and business value, as demonstrated by real‑world cases showing three‑fold query speed gains and up to 60% reduction in cross‑team effort.

Best PracticesBig DataData Warehouse

0 likes · 6 min read

Why Precise Data Warehouse Naming Boosts Efficiency and Cuts Costs

Data STUDIO

Dec 5, 2025 · Big Data

Why Parquet Is the Default Choice for Big Data Storage

The article explains how Apache Parquet’s columnar layout, multi‑level row‑group structure, projection and predicate push‑down, and advanced compression and encoding make it the high‑performance, space‑efficient storage format that powers modern big‑data ecosystems and tools like Spark, Python pandas, and ClickHouse.

Big DataClickHouseColumnar Storage

0 likes · 11 min read

Why Parquet Is the Default Choice for Big Data Storage

Code Ape Tech Column

Dec 5, 2025 · Big Data

Optimizing 100K Record Retrieval from 10M‑Row Pools: ClickHouse, ES Scroll, ES+HBase, RediSearch

This article examines several engineering solutions for extracting up to 100,000 records from a ten‑million‑row pool, comparing multi‑threaded ClickHouse pagination, Elasticsearch scroll‑scan, an ES‑plus‑HBase hybrid, and RediSearch + RedisJSON, and presents performance measurements and practical trade‑offs.

Big DataClickHouseElasticsearch

0 likes · 12 min read

Optimizing 100K Record Retrieval from 10M‑Row Pools: ClickHouse, ES Scroll, ES+HBase, RediSearch

JD Retail Technology

Dec 1, 2025 · Big Data

How Apache Hudi 1.1 Powers AI‑Native Lakehouse and Real‑Time Data Lakes

The JD‑hosted Apache Hudi Meetup showcased the 1.1 release’s pluggable table format, Flink performance gains, LSM‑Tree MoR redesign, and AI‑native features such as vector indexing, illustrating how the open‑source lakehouse is evolving to meet BI and multimodal AI workloads.

AIApache HudiBig Data

0 likes · 12 min read

How Apache Hudi 1.1 Powers AI‑Native Lakehouse and Real‑Time Data Lakes

Big Data Technology & Architecture

Nov 28, 2025 · Big Data

What’s New in Apache Paimon 2025? Core Performance, AI Integration & Real‑Time Lakehouse Updates

The 2025 Apache Paimon release brings major performance boosts, AI‑centric multimodal storage, deeper streaming‑batch integration, and broader engine compatibility, detailing query and write optimizations, memory management tweaks, and a unified lake format for structured and unstructured data.

AI integrationApache PaimonBig Data

0 likes · 6 min read

What’s New in Apache Paimon 2025? Core Performance, AI Integration & Real‑Time Lakehouse Updates

DataFunSummit

Nov 27, 2025 · Big Data

How BMW Turned Data Into Growth: A Sensors Data Case Study

This article details BMW's digital transformation journey using Sensors Data, covering the background of rapid app growth, the cross‑regional data collection challenges, the systematic solution architecture—including mapping, preprocessing, and historical data migration—and the resulting business impact and future AI‑driven roadmap.

AnalyticsBig Datadata engineering

0 likes · 13 min read

How BMW Turned Data Into Growth: A Sensors Data Case Study

Ctrip Technology

Nov 27, 2025 · Big Data

How Ctrip Cut Query Latency by 85% with StarRocks’ Compute‑Storage Separation

Ctrip migrated its massive User Behavior Tracking system from ClickHouse to a compute‑storage separated StarRocks cluster on Kubernetes, achieving millisecond‑level query latency, halving storage usage, reducing node count, and sustaining millions‑of‑rows‑per‑second write throughput while simplifying scaling and operations.

Big DataClickHouseCompute-Storage Separation

0 likes · 15 min read

How Ctrip Cut Query Latency by 85% with StarRocks’ Compute‑Storage Separation

Linux Cloud Computing Practice

Nov 27, 2025 · Big Data

How Tencent Cloud Uses Iceberg, Gravitino and Multimodal Lakes for Unified Data Processing

This article series explores Tencent Cloud's Iceberg‑based batch‑stream integration, Apache Gravitino's unified metadata and lineage solution, Xiaohongshu's data‑architecture evolution for the Big AI Data era, and a practical Data+AI multimodal data‑lake implementation, highlighting challenges, architectural designs, and performance gains.

Big DataIcebergMetadata Management

0 likes · 7 min read

How Tencent Cloud Uses Iceberg, Gravitino and Multimodal Lakes for Unified Data Processing

DataFunSummit

Nov 23, 2025 · Artificial Intelligence

How Large Language Models Are Revolutionizing Banking Data Integration

This article examines the challenges of traditional banking data, explains how large language models can fuse structured and unstructured information, outlines a new data‑centric infrastructure and governance approach, and describes the DiFY platform’s AI‑agent and DataOps capabilities for agile, non‑intrusive integration with core banking systems.

AI agentsBig DataData Fusion

0 likes · 16 min read

How Large Language Models Are Revolutionizing Banking Data Integration

Java Architect Handbook

Nov 23, 2025 · Big Data

Master Data Synchronization with Alibaba DataX: From Installation to Incremental Sync

This guide explains how to use Alibaba's open‑source DataX tool to synchronize large MySQL datasets, covering the tool’s architecture, installation on Linux, job configuration with JSON, full‑load and incremental sync examples, and performance results, all without relying on mysqldump or manual storage methods.

Big DataDataXETL

0 likes · 17 min read

Master Data Synchronization with Alibaba DataX: From Installation to Incremental Sync

Alibaba Cloud Developer

Nov 20, 2025 · Big Data

Mastering Large‑Scale Data Migration: Challenges, Strategies and Real‑World Solutions

This article explains why data migration is the essential first step for cloud modernization, outlines the technical challenges of moving terabytes to petabytes, compares physical and logical migration methods, and presents practical solutions and real‑world case studies across Hive, cloud warehouses, lake‑house formats and analytic databases.

Big DataData MigrationETL

0 likes · 56 min read

Mastering Large‑Scale Data Migration: Challenges, Strategies and Real‑World Solutions

Alibaba Cloud Big Data AI Platform

Nov 15, 2025 · Big Data

From a Decade-Long Big Data Journey to a Cloud‑Native Lakehouse

This article chronicles a ten‑year evolution of a self‑built big data platform—detailing early Hadoop clusters, successive migrations to Spark, Hive, Hudi, and StarRocks, the operational challenges encountered, and the comprehensive shift to Alibaba Cloud EMR Serverless that delivered significant cost, performance, and stability gains while outlining future intelligent‑ecosystem plans.

Big DataSparkStarRocks

0 likes · 17 min read

From a Decade-Long Big Data Journey to a Cloud‑Native Lakehouse

Alibaba Cloud Big Data AI Platform

Nov 12, 2025 · Big Data

How MaxCompute’s Resource Advisor Cuts Costs by 60% for Large-Scale Data Workloads

This article details how GoTerra migrated from BigQuery to MaxCompute and used Resource Advisor, tiered quota strategies, and the TopN Fair scheduler to dynamically balance performance and cost across dozens of accounts and hundreds of quota groups, achieving up to 60% cost reduction.

Auto ScalingBig DataMaxCompute

0 likes · 9 min read

How MaxCompute’s Resource Advisor Cuts Costs by 60% for Large-Scale Data Workloads

Instant Consumer Technology Team

Nov 10, 2025 · Big Data

Fixing Multi‑Version, Multi‑Cluster and HA with Apache Kyuubi for Spark/Flink

Apache Kyuubi, an enterprise‑grade multi‑tenant data gateway, replaces Livy and Flink SQL Gateway to support multiple engine versions, cross‑cluster elastic scheduling, high‑availability batch jobs, and traffic control, dramatically reducing deployment complexity, improving resource utilization, and accelerating release cycles for large‑scale Spark and Flink workloads.

Apache KyuubiBig DataData Gateway

0 likes · 18 min read

Fixing Multi‑Version, Multi‑Cluster and HA with Apache Kyuubi for Spark/Flink

DataFunSummit

Nov 10, 2025 · Big Data

How Xiaohongshu Cut Data Architecture Costs by One‑Third with Incremental Computing

This article explains how Xiaohongshu, a lifestyle community with over 350 million monthly users, transformed its data platform from a traditional Lambda architecture to a next‑generation incremental computing model, reducing architectural complexity, resource consumption and development effort each by roughly two‑thirds while supporting massive real‑time and offline data demands.

AIBig DataData Architecture

0 likes · 6 min read

How Xiaohongshu Cut Data Architecture Costs by One‑Third with Incremental Computing

Alibaba Cloud Developer

Nov 7, 2025 · Big Data

Unlock Enterprise‑Grade Data Pipelines with DMS Airflow: Features, Integration & Code Samples

This article introduces DMS Airflow, an enterprise‑level data workflow orchestration platform built on Apache Airflow, covering its advanced DAG capabilities, deep DMS integration, scheduling, task dependency management, dynamic task generation, resource scaling, security features, and practical code examples for SQL, Spark, DTS, and Notebook tasks.

AirflowBig DataDMS

0 likes · 20 min read

Unlock Enterprise‑Grade Data Pipelines with DMS Airflow: Features, Integration & Code Samples