Showing 100 articles max
DataFunSummit
DataFunSummit
May 10, 2026 · Big Data

How Lance File Format v2.2 Accelerates, Cuts Costs, and Governs Multimodal Data

Lance File Format v2.2 tackles the AI data explosion by delivering hundred‑fold random‑read performance, advanced two‑layer compression, zero‑cost schema evolution, Git‑style versioning, external blob handling, and a roadmap toward native media support and intelligent encoding, positioning it as a core infrastructure for large‑scale multimodal workloads.

File FormatIO performanceLance
0 likes · 14 min read
How Lance File Format v2.2 Accelerates, Cuts Costs, and Governs Multimodal Data
DataFunTalk
DataFunTalk
May 8, 2026 · Big Data

How MaxCompute Evolves into a Data+AI Platform: Architecture, Core Capabilities, and Real-World Cases

The article explains how Alibaba Cloud's MaxCompute has been transformed into a cloud‑native Data+AI platform, detailing its layered architecture, multimodal storage, model management, hybrid compute scheduling, SQL AI functions, the MaxFrame Python framework, and several enterprise case studies that demonstrate performance gains and flexible resource orchestration.

AI integrationBig DataCloud Native
0 likes · 11 min read
How MaxCompute Evolves into a Data+AI Platform: Architecture, Core Capabilities, and Real-World Cases
DataFunTalk
DataFunTalk
May 6, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

The article details Xiaohongshu's four‑stage data‑platform evolution—from a simple ClickHouse ad‑hoc setup to a Lambda‑based 2.0 design and finally a lakehouse‑driven 3.0 architecture—highlighting the adoption of general incremental compute, cost‑reduction to one‑third, performance gains of up to ten‑fold, and the SPOT standards that guide the new system.

Big DataClickHouseData Architecture
0 likes · 21 min read
How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era
DataFunSummit
DataFunSummit
May 5, 2026 · Big Data

A New Data Lake Paradigm: Volcano Engine’s Multi‑Modal Data Lake Built on Lance

The article presents Volcano Engine’s AI‑focused data lake built on the Lance format, detailing why traditional lakes fall short for multimodal data, the engineering enhancements such as Binary Copy Compaction, Lance Insight, distributed vector indexing, JSON‑based tagging, Row‑ID shuffle optimization, and real‑world case studies that demonstrate significant performance and cost gains.

AIBinary Copy CompactionDistributed Vector Index
0 likes · 18 min read
A New Data Lake Paradigm: Volcano Engine’s Multi‑Modal Data Lake Built on Lance
DataFunTalk
DataFunTalk
May 2, 2026 · Big Data

Building a One-Person Data Team: Core Skills of a Full‑Stack Data Engineer

The article examines why a single data engineer can run an end‑to‑end data team, outlines the essential abilities—semantic ownership, building an agentic data stack, and leveraging historical context—while discussing ChatBI’s limits, validation loops, and the open‑source Datus 0.3 harness for practical implementation.

ChatBIDatusFull-Stack Data Engineer
0 likes · 14 min read
Building a One-Person Data Team: Core Skills of a Full‑Stack Data Engineer
DataFunTalk
DataFunTalk
Apr 29, 2026 · Big Data

How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based analytics stack to a unified lakehouse with generic incremental compute, cutting architecture complexity, resource cost, and development effort by roughly one‑third while supporting petabyte‑scale, sub‑second queries across its 350 million‑user app.

Big DataClickHouseData Architecture
0 likes · 22 min read
How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era
Lao Guo's Learning Space
Lao Guo's Learning Space
Apr 29, 2026 · Big Data

Designing a Full-Stack Credit Data System: From Ingestion to Real-Time Decision

The article dissects a credit data system architecture, detailing six logical layers—from multi-source data collection and feature engineering (including graph features and feature stores) to model training, real‑time stream processing, decision engine integration, and privacy‑preserving computation—while explaining the trade‑offs, tools, and performance targets needed for accurate, low‑latency risk assessment.

Credit ScoringFeature StoreFlink
0 likes · 16 min read
Designing a Full-Stack Credit Data System: From Ingestion to Real-Time Decision
Model Perspective
Model Perspective
Apr 28, 2026 · Big Data

How a Taiwan Ban Became Free Advertising for Amap’s Map App

A recent Taiwan government warning against Amap turned into a viral boost, exposing the app’s superior traffic‑light countdown, massive data‑driven network effects, and the underlying reverse‑propagation model that explains why the ban accelerated downloads rather than suppressing them.

AmapBig DataNetwork Effects
0 likes · 11 min read
How a Taiwan Ban Became Free Advertising for Amap’s Map App
DataFunSummit
DataFunSummit
Apr 28, 2026 · Big Data

Dynamic Table: A Next‑Generation Data Processing Architecture Powered by Incremental Computing

The article examines the limitations of traditional batch and stream processing, explains how Hologres Dynamic Table combines declarative freshness settings with stateful incremental computation to bridge the gap between low‑cost batch jobs and low‑latency streaming, and presents benchmark results and real‑world case studies.

Dynamic TableHologresbenchmark
0 likes · 13 min read
Dynamic Table: A Next‑Generation Data Processing Architecture Powered by Incremental Computing
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 28, 2026 · Big Data

Inside Apache Paimon 1.4: Core Principles and Design of an AI Multimodal Data Lake

Apache Paimon 1.4 redefines itself as an AI multimodal data lake by introducing row tracking, data evolution, Blob and Vector tables, Variant shredding, and Lumina‑BTree global indexing, each explained with concrete examples, configuration flags, and storage layouts that illustrate how the new capabilities enable unified storage and efficient retrieval of diverse data types.

Apache PaimonBlob TableData Evolution
0 likes · 8 min read
Inside Apache Paimon 1.4: Core Principles and Design of an AI Multimodal Data Lake
DataFunSummit
DataFunSummit
Apr 27, 2026 · Big Data

How MaxCompute Evolves Big Data Platforms for AI: Architecture, Core Capabilities, and Real‑World Cases

The article details MaxCompute's AI‑driven evolution, covering its multilayer architecture, multimodal storage management, SQL AI functions, the Python‑based MaxFrame framework, and several industry case studies that demonstrate performance gains and flexible resource scheduling for large‑scale AI workloads.

Data+AIDistributed computingMaxCompute
0 likes · 12 min read
How MaxCompute Evolves Big Data Platforms for AI: Architecture, Core Capabilities, and Real‑World Cases
DataFunSummit
DataFunSummit
Apr 25, 2026 · Big Data

AI‑Era Multimodal Data Lake Infrastructure: TBDS Design, Storage, Compute, and Governance

The article analyzes how Tencent Cloud's TBDS platform tackles the AI era's multimodal data lake challenges through a native storage format (Lance), elastic Ray‑based compute, standardized metadata with Gravitino, and automated governance via Lakekeeper, citing architecture details, performance numbers, and real‑world deployments.

AI infrastructureBig DataGravitino
0 likes · 13 min read
AI‑Era Multimodal Data Lake Infrastructure: TBDS Design, Storage, Compute, and Governance
Big Data Tech Team
Big Data Tech Team
Apr 22, 2026 · Big Data

Inside Big Tech: Full Breakdown of AI Agents for Data Warehouse Governance

The article analyzes how leading internet companies embed AI agents across the entire data‑warehouse lifecycle to automate governance, presenting real‑world case studies from Alibaba, ByteDance, JD.com and Tencent, and quantifies benefits such as over 65% reduction in manual effort, 50% drop in metric duplication, and a 40% boost in resource utilization.

AI agentsBig DataData Warehouse
0 likes · 10 min read
Inside Big Tech: Full Breakdown of AI Agents for Data Warehouse Governance
DataFunSummit
DataFunSummit
Apr 19, 2026 · Big Data

How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine

OPPO’s data‑lake team, led by David, detailed their transition from Hive‑Spark to a unified multi‑modal lake, leveraging Gravitino for cross‑engine metadata management and the open‑source Curvine cache to eliminate data silos, boost I/O performance, and support massive image, recommendation, and AI‑Agent workloads.

Big DataDistributed CacheMetadata Management
0 likes · 11 min read
How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Apr 17, 2026 · Big Data

What Spark 4.0 Brings: VARIANT Type, Native SQL UDFs, and Serverless Enhancements

Apache Spark 4.0 introduces a high‑performance VARIANT data type for semi‑structured JSON, native SQL UDFs that eliminate Python UDF bottlenecks, a richer Python DataSource API, a new pipeline syntax, upgraded Structured Streaming state management, and Alibaba Cloud EMR Serverless optimizations that together deliver up to 30% speed gains and seamless migration from Spark 3.x.

Apache SparkPython APISQL UDF
0 likes · 12 min read
What Spark 4.0 Brings: VARIANT Type, Native SQL UDFs, and Serverless Enhancements
Ctrip Technology
Ctrip Technology
Apr 16, 2026 · Big Data

How Ray + DuckDB Cut 9B-Row Attribution Queries from 40s to 15s

When attribution analysis on over 900 million rows slowed to more than 40 seconds and threatened cluster stability, Ctrip's smart attribution team rebuilt the architecture with Ray and DuckDB, achieving sub‑15‑second query times, 160 % performance gain, and complete resource isolation.

Attribution AnalysisBig DataDistributed computing
0 likes · 22 min read
How Ray + DuckDB Cut 9B-Row Attribution Queries from 40s to 15s
DataFunTalk
DataFunTalk
Apr 16, 2026 · Big Data

How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing

This article details Xiaohongshu's data platform evolution from a simple ClickHouse‑based ad‑hoc system to a Lambda‑style architecture and finally a lakehouse solution, highlighting how the adoption of a new incremental computing model reduced architectural complexity, resource consumption and development effort each to roughly one‑third while delivering sub‑second query performance on petabyte‑scale data.

Big DataData ArchitectureLakehouse
0 likes · 21 min read
How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing
Architect Chen
Architect Chen
Apr 16, 2026 · Big Data

Supercharge Kafka Consumer Performance: Parallelism, Batching, and Multithreading

This guide explains practical techniques to dramatically increase Kafka consumer throughput, including scaling consumer instances or partitions, tuning fetch and poll parameters, and implementing a multithreaded consumer model, while also covering hardware, JVM, and OS optimizations and monitoring recommendations.

Batch FetchConsumer ParallelismKafka
0 likes · 5 min read
Supercharge Kafka Consumer Performance: Parallelism, Batching, and Multithreading