Showing 100 articles max

May 10, 2026 · Big Data

How Lance File Format v2.2 Accelerates, Cuts Costs, and Governs Multimodal Data

Lance File Format v2.2 tackles the AI data explosion by delivering hundred‑fold random‑read performance, advanced two‑layer compression, zero‑cost schema evolution, Git‑style versioning, external blob handling, and a roadmap toward native media support and intelligent encoding, positioning it as a core infrastructure for large‑scale multimodal workloads.

File FormatIO performanceLance

0 likes · 14 min read

How Lance File Format v2.2 Accelerates, Cuts Costs, and Governs Multimodal Data

Zhihu Tech Column

May 9, 2026 · Big Data

How Zhihu Built a Unified OneID System to Consolidate Fragmented User Identities

Zhihu created a unified OneID framework that merges scattered account, device, and behavior data into a global unique identifier, using strong and weak IDs, graph‑based connectivity, device governance, and a device half‑life model to improve recommendation, push, and advertising effectiveness.

Big DataDevice GovernanceGraph Computation

0 likes · 11 min read

How Zhihu Built a Unified OneID System to Consolidate Fragmented User Identities

StarRocks

May 8, 2026 · Big Data

Scaling Real‑Time Analytics at KaptureCX: Best Practices with RisingWave and StarRocks

KaptureCX migrated its core analytics from ClickHouse to StarRocks, introduced RisingWave and Kafka for CDC, and achieved millisecond‑level query latency, a reporting cycle cut from weeks to one day, and a solid data foundation for AI‑driven services.

CDCKafkaMVP

0 likes · 11 min read

Scaling Real‑Time Analytics at KaptureCX: Best Practices with RisingWave and StarRocks

DataFunTalk

May 8, 2026 · Big Data

How MaxCompute Evolves into a Data+AI Platform: Architecture, Core Capabilities, and Real-World Cases

The article explains how Alibaba Cloud's MaxCompute has been transformed into a cloud‑native Data+AI platform, detailing its layered architecture, multimodal storage, model management, hybrid compute scheduling, SQL AI functions, the MaxFrame Python framework, and several enterprise case studies that demonstrate performance gains and flexible resource orchestration.

AI integrationBig DataCloud Native

0 likes · 11 min read

How MaxCompute Evolves into a Data+AI Platform: Architecture, Core Capabilities, and Real-World Cases

DataFunTalk

May 6, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

The article details Xiaohongshu's four‑stage data‑platform evolution—from a simple ClickHouse ad‑hoc setup to a Lambda‑based 2.0 design and finally a lakehouse‑driven 3.0 architecture—highlighting the adoption of general incremental compute, cost‑reduction to one‑third, performance gains of up to ten‑fold, and the SPOT standards that guide the new system.

Big DataClickHouseData Architecture

0 likes · 21 min read

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

DataFunSummit

May 5, 2026 · Big Data

A New Data Lake Paradigm: Volcano Engine’s Multi‑Modal Data Lake Built on Lance

The article presents Volcano Engine’s AI‑focused data lake built on the Lance format, detailing why traditional lakes fall short for multimodal data, the engineering enhancements such as Binary Copy Compaction, Lance Insight, distributed vector indexing, JSON‑based tagging, Row‑ID shuffle optimization, and real‑world case studies that demonstrate significant performance and cost gains.

AIBinary Copy CompactionDistributed Vector Index

0 likes · 18 min read

A New Data Lake Paradigm: Volcano Engine’s Multi‑Modal Data Lake Built on Lance

DataFunTalk

May 2, 2026 · Big Data

Building a One-Person Data Team: Core Skills of a Full‑Stack Data Engineer

The article examines why a single data engineer can run an end‑to‑end data team, outlines the essential abilities—semantic ownership, building an agentic data stack, and leveraging historical context—while discussing ChatBI’s limits, validation loops, and the open‑source Datus 0.3 harness for practical implementation.

ChatBIDatusFull-Stack Data Engineer

0 likes · 14 min read

Building a One-Person Data Team: Core Skills of a Full‑Stack Data Engineer

DataFunTalk

Apr 29, 2026 · Big Data

How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based analytics stack to a unified lakehouse with generic incremental compute, cutting architecture complexity, resource cost, and development effort by roughly one‑third while supporting petabyte‑scale, sub‑second queries across its 350 million‑user app.

Big DataClickHouseData Architecture

0 likes · 22 min read

How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era

Lao Guo's Learning Space

Apr 29, 2026 · Big Data

Designing a Full-Stack Credit Data System: From Ingestion to Real-Time Decision

The article dissects a credit data system architecture, detailing six logical layers—from multi-source data collection and feature engineering (including graph features and feature stores) to model training, real‑time stream processing, decision engine integration, and privacy‑preserving computation—while explaining the trade‑offs, tools, and performance targets needed for accurate, low‑latency risk assessment.

Credit ScoringFeature StoreFlink

0 likes · 16 min read

Designing a Full-Stack Credit Data System: From Ingestion to Real-Time Decision

Model Perspective

Apr 28, 2026 · Big Data

How a Taiwan Ban Became Free Advertising for Amap’s Map App

A recent Taiwan government warning against Amap turned into a viral boost, exposing the app’s superior traffic‑light countdown, massive data‑driven network effects, and the underlying reverse‑propagation model that explains why the ban accelerated downloads rather than suppressing them.

AmapBig DataNetwork Effects

0 likes · 11 min read

How a Taiwan Ban Became Free Advertising for Amap’s Map App

DataFunSummit

Apr 28, 2026 · Big Data

Dynamic Table: A Next‑Generation Data Processing Architecture Powered by Incremental Computing

The article examines the limitations of traditional batch and stream processing, explains how Hologres Dynamic Table combines declarative freshness settings with stateful incremental computation to bridge the gap between low‑cost batch jobs and low‑latency streaming, and presents benchmark results and real‑world case studies.

Dynamic TableHologresbenchmark

0 likes · 13 min read

Dynamic Table: A Next‑Generation Data Processing Architecture Powered by Incremental Computing

Big Data Technology & Architecture

Apr 28, 2026 · Big Data

Inside Apache Paimon 1.4: Core Principles and Design of an AI Multimodal Data Lake

Apache Paimon 1.4 redefines itself as an AI multimodal data lake by introducing row tracking, data evolution, Blob and Vector tables, Variant shredding, and Lumina‑BTree global indexing, each explained with concrete examples, configuration flags, and storage layouts that illustrate how the new capabilities enable unified storage and efficient retrieval of diverse data types.

Apache PaimonBlob TableData Evolution

0 likes · 8 min read

Inside Apache Paimon 1.4: Core Principles and Design of an AI Multimodal Data Lake

DataFunSummit

Apr 27, 2026 · Big Data

How MaxCompute Evolves Big Data Platforms for AI: Architecture, Core Capabilities, and Real‑World Cases

The article details MaxCompute's AI‑driven evolution, covering its multilayer architecture, multimodal storage management, SQL AI functions, the Python‑based MaxFrame framework, and several industry case studies that demonstrate performance gains and flexible resource scheduling for large‑scale AI workloads.

Data+AIDistributed computingMaxCompute

0 likes · 12 min read

How MaxCompute Evolves Big Data Platforms for AI: Architecture, Core Capabilities, and Real‑World Cases

DataFunSummit

Apr 25, 2026 · Big Data

AI‑Era Multimodal Data Lake Infrastructure: TBDS Design, Storage, Compute, and Governance

The article analyzes how Tencent Cloud's TBDS platform tackles the AI era's multimodal data lake challenges through a native storage format (Lance), elastic Ray‑based compute, standardized metadata with Gravitino, and automated governance via Lakekeeper, citing architecture details, performance numbers, and real‑world deployments.

AI infrastructureBig DataGravitino

0 likes · 13 min read

AI‑Era Multimodal Data Lake Infrastructure: TBDS Design, Storage, Compute, and Governance

Big Data Tech Team

Apr 22, 2026 · Big Data

Inside Big Tech: Full Breakdown of AI Agents for Data Warehouse Governance

The article analyzes how leading internet companies embed AI agents across the entire data‑warehouse lifecycle to automate governance, presenting real‑world case studies from Alibaba, ByteDance, JD.com and Tencent, and quantifies benefits such as over 65% reduction in manual effort, 50% drop in metric duplication, and a 40% boost in resource utilization.

AI agentsBig DataData Warehouse

0 likes · 10 min read

Inside Big Tech: Full Breakdown of AI Agents for Data Warehouse Governance

DataFunSummit

Apr 19, 2026 · Big Data

How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine

OPPO’s data‑lake team, led by David, detailed their transition from Hive‑Spark to a unified multi‑modal lake, leveraging Gravitino for cross‑engine metadata management and the open‑source Curvine cache to eliminate data silos, boost I/O performance, and support massive image, recommendation, and AI‑Agent workloads.

Big DataDistributed CacheMetadata Management

0 likes · 11 min read

How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine

Alibaba Cloud Big Data AI Platform

Apr 17, 2026 · Big Data

What Spark 4.0 Brings: VARIANT Type, Native SQL UDFs, and Serverless Enhancements

Apache Spark 4.0 introduces a high‑performance VARIANT data type for semi‑structured JSON, native SQL UDFs that eliminate Python UDF bottlenecks, a richer Python DataSource API, a new pipeline syntax, upgraded Structured Streaming state management, and Alibaba Cloud EMR Serverless optimizations that together deliver up to 30% speed gains and seamless migration from Spark 3.x.

Apache SparkPython APISQL UDF

0 likes · 12 min read

What Spark 4.0 Brings: VARIANT Type, Native SQL UDFs, and Serverless Enhancements

Ctrip Technology

Apr 16, 2026 · Big Data

How Ray + DuckDB Cut 9B-Row Attribution Queries from 40s to 15s

When attribution analysis on over 900 million rows slowed to more than 40 seconds and threatened cluster stability, Ctrip's smart attribution team rebuilt the architecture with Ray and DuckDB, achieving sub‑15‑second query times, 160 % performance gain, and complete resource isolation.

Attribution AnalysisBig DataDistributed computing

0 likes · 22 min read

How Ray + DuckDB Cut 9B-Row Attribution Queries from 40s to 15s

DataFunTalk

Apr 16, 2026 · Big Data

How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing

This article details Xiaohongshu's data platform evolution from a simple ClickHouse‑based ad‑hoc system to a Lambda‑style architecture and finally a lakehouse solution, highlighting how the adoption of a new incremental computing model reduced architectural complexity, resource consumption and development effort each to roughly one‑third while delivering sub‑second query performance on petabyte‑scale data.

Big DataData ArchitectureLakehouse

0 likes · 21 min read

How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing

Architect Chen

Apr 16, 2026 · Big Data

Supercharge Kafka Consumer Performance: Parallelism, Batching, and Multithreading

This guide explains practical techniques to dramatically increase Kafka consumer throughput, including scaling consumer instances or partitions, tuning fetch and poll parameters, and implementing a multithreaded consumer model, while also covering hardware, JVM, and OS optimizations and monitoring recommendations.

Batch FetchConsumer ParallelismKafka

0 likes · 5 min read

Supercharge Kafka Consumer Performance: Parallelism, Batching, and Multithreading