Tagged articles
3697 articles
Page 7 of 37
DataFunTalk
DataFunTalk
Jun 28, 2024 · Big Data

Accelerating Spark with ClickHouse: Native Optimization Techniques and Performance Evaluation

This article presents a comprehensive technical overview of using ClickHouse as a native backend to accelerate Spark SQL execution, covering Spark performance bottlenecks, ClickHouse's CPU‑level optimizations, the design and implementation of the Spark‑Native integration, and detailed TPC‑DS benchmark results demonstrating up to 3.5× speedup.

Big DataClickHousePerformance Optimization
0 likes · 33 min read
Accelerating Spark with ClickHouse: Native Optimization Techniques and Performance Evaluation
Tencent Cloud Developer
Tencent Cloud Developer
Jun 28, 2024 · Big Data

Capacity-Constrained Influence Maximization: Algorithms and Applications

The paper introduces Capacity‑Constrained Influence Maximization (CIM), a framework that selects up to k neighbors per active user to maximize spread under node capacity limits, proposes MG‑Greedy and RR‑Greedy algorithms with ≥½ approximation, and demonstrates the near‑linear RR‑OPIM+ method’s superior accuracy and speed on large social networks and a Tencent game recommendation system.

Big DataCapacity ConstraintKDD 2023
0 likes · 8 min read
Capacity-Constrained Influence Maximization: Algorithms and Applications
DevOps
DevOps
Jun 27, 2024 · Big Data

Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration

This article explores agile data engineering, advocating code‑as‑infrastructure practices such as code‑everything, data and code reuse, and ETL‑level continuous integration, while discussing the trade‑offs between data‑centric and code‑centric reuse, cost considerations, and practical implementation tips for modern data projects.

Agile DevelopmentBig DataCode as Infrastructure
0 likes · 22 min read
Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration
Past Memory Big Data
Past Memory Big Data
Jun 27, 2024 · Big Data

Inside Presto 2.0: The Native C++ Query Engine Explained

This article provides a detailed technical overview of Presto 2.0, the native C++ query engine built on the Velox library, covering its motivation, vectorized architecture, memory management, performance benchmarks from Meta and IBM, and deployment practices for large‑scale data warehouses.

Big DataC++Data Warehouse
0 likes · 15 min read
Inside Presto 2.0: The Native C++ Query Engine Explained
DataFunTalk
DataFunTalk
Jun 27, 2024 · Big Data

Data Warehouse Construction and Data Governance Practices at Wing Payment

This presentation by senior data warehouse engineer Huang Luo details Wing Payment’s end‑to‑end data warehouse build, covering background challenges, governance framework, platform architecture, layered modeling, naming standards, asset management, monitoring, and future plans, illustrating how systematic data governance drives cost reduction, efficiency, and security.

AnalyticsBig DataData Security
0 likes · 14 min read
Data Warehouse Construction and Data Governance Practices at Wing Payment
DataFunTalk
DataFunTalk
Jun 26, 2024 · Big Data

Evolution of the Big Data + AI Development Paradigm and Alibaba Cloud’s Integrated Architecture

This article examines how the big‑data AI development paradigm has shifted from model‑centric to data‑centric workflows, outlines the challenges of integrating data and AI teams, and details Alibaba Cloud’s end‑to‑end, serverless big‑data platform—including MaxCompute, Hologres, MaxFrame, Object Table, and vector search—designed to accelerate large‑scale AI applications.

AI IntegrationBig DataData Platform
0 likes · 20 min read
Evolution of the Big Data + AI Development Paradigm and Alibaba Cloud’s Integrated Architecture
Baidu Geek Talk
Baidu Geek Talk
Jun 24, 2024 · Big Data

Accelerating Spark with ClickHouse Native Techniques: Design, Implementation, and Performance Evaluation

The paper presents a Spark acceleration framework that replaces Java‑based task operators with a ClickHouse native library, converting plans via Protobuf and JNI, leveraging columnar storage, SIMD and JIT to achieve up to 3× speed‑up on TPC‑DS workloads while providing fallback mechanisms to ensure no performance loss.

Big DataClickHouseNative Acceleration
0 likes · 31 min read
Accelerating Spark with ClickHouse Native Techniques: Design, Implementation, and Performance Evaluation
DataFunTalk
DataFunTalk
Jun 22, 2024 · Big Data

Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits

This article details Zhihu's migration of massive Spark and MapReduce shuffle workloads from the External Shuffle Service (ESS) to a push‑based Remote Shuffle Service (RSS) powered by Celeborn, covering background problems, evaluation of open‑source implementations, deployment architecture, encountered issues, solutions, performance gains, and future plans.

Big DataPerformanceRSS
0 likes · 19 min read
Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits
DataFunSummit
DataFunSummit
Jun 21, 2024 · Big Data

Building a Complete Data System with Apache Arrow: Architecture, Dynamic Schema Modeling, and Practical Tips

This article explains why new data systems are needed, introduces Apache Arrow and its columnar in‑memory format, describes dynamic read‑time modeling, outlines the system’s execution flow, storage and indexing strategies, and shares practical tips and extensions for building scalable big‑data solutions.

AceroApache ArrowBig Data
0 likes · 20 min read
Building a Complete Data System with Apache Arrow: Architecture, Dynamic Schema Modeling, and Practical Tips
Past Memory Big Data
Past Memory Big Data
Jun 20, 2024 · Big Data

How Meituan Scaled Spark with Vectorized Execution Using Gluten + Velox

This article details Meituan's production‑grade adoption of Spark vectorized execution via the open‑source Gluten and Velox stack, explaining SIMD fundamentals, performance motivations, the end‑to‑end integration workflow, staged rollout, encountered challenges, and the resulting resource savings and speedups.

Big DataGlutenORC
0 likes · 33 min read
How Meituan Scaled Spark with Vectorized Execution Using Gluten + Velox
Meituan Technology Team
Meituan Technology Team
Jun 20, 2024 · Big Data

Vectorized Execution in Apache Spark: Meituan’s Practice with Gluten and Velox

Meituan enhances Apache Spark by integrating the Gluten‑Velox vectorized execution engine, converting row‑wise operations to columnar SIMD processing, which yields over 40 % memory savings and up to 13 % faster runtimes across thousands of ETL jobs, while addressing stability, ORC support, shuffle redesign, and off‑heap memory optimization.

Apache SparkBig DataC++
0 likes · 30 min read
Vectorized Execution in Apache Spark: Meituan’s Practice with Gluten and Velox
DataFunSummit
DataFunSummit
Jun 20, 2024 · Big Data

Data+AI Data Lake Technologies: Apache Iceberg, PyIceberg, and Vector Table Solutions

This article presents a comprehensive overview of modern Data+AI data lake challenges and solutions, covering the evolution of data lakes, an introduction to Apache Iceberg, practical use of PyIceberg for AI training and inference pipelines, and advanced vector table and indexing techniques for efficient similarity search.

AI trainingApache IcebergBig Data
0 likes · 22 min read
Data+AI Data Lake Technologies: Apache Iceberg, PyIceberg, and Vector Table Solutions
AI Architecture Hub
AI Architecture Hub
Jun 20, 2024 · Big Data

How GeoHash Powers Efficient Large-Scale Location Queries Without Pagination

This article explains the GeoHash algorithm, shows how it converts latitude‑longitude pairs into compact binary strings, demonstrates the encoding process with a concrete example, and discusses how the resulting prefixes can be used to quickly locate nearby users in massive datasets while highlighting remaining edge‑case challenges.

Big DataGeoHashLocation Query
0 likes · 7 min read
How GeoHash Powers Efficient Large-Scale Location Queries Without Pagination
vivo Internet Technology
vivo Internet Technology
Jun 19, 2024 · Big Data

Understanding BitMap and Roaring BitMap: Principles, Containers, and Java API Usage

The article explains BitMap fundamentals and introduces Roaring BitMap’s compressed container architecture—Array, BitMap, and Run containers—detailing their conversion logic, Java implementation snippets, performance advantages over traditional BitSets, and practical API usage for high‑performance, memory‑efficient big‑data applications.

Big DataContainersRoaring Bitmap
0 likes · 18 min read
Understanding BitMap and Roaring BitMap: Principles, Containers, and Java API Usage
DataFunTalk
DataFunTalk
Jun 19, 2024 · Big Data

Evolution and Practices of E‑commerce Data Warehouse Governance

This article analyzes the current state, development stages, and comprehensive solutions of e‑commerce data‑warehouse governance, covering data quality, cost, security, and efficiency requirements, and presents a roadmap from early‑stage standardization to mature tool‑driven governance with future outlooks.

Big DataCost ManagementData Warehouse
0 likes · 13 min read
Evolution and Practices of E‑commerce Data Warehouse Governance
Architect
Architect
Jun 18, 2024 · Big Data

How GeoHash Powers Real‑Time Ride‑Hailing: From Theory to Practice

This article explains the GeoHash algorithm, demonstrates how binary subdivision of latitude and longitude yields compact base‑32 strings, and shows how these hashes can efficiently locate nearby ride‑hailing drivers while highlighting precision limitations and edge cases.

Big DataGeoHashLocation Services
0 likes · 8 min read
How GeoHash Powers Real‑Time Ride‑Hailing: From Theory to Practice
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 16, 2024 · Big Data

Real-time Big Data Analytics with Apache Paimon and the Streaming Lakehouse Architecture

This article summarizes Wang Feng's presentation on the next‑generation Lakehouse architecture, explaining how Apache Paimon provides a unified, real‑time data lake format that bridges batch and streaming workloads, enabling low‑latency analytics and AI integration for modern big‑data applications.

Apache PaimonBig DataReal-time Analytics
0 likes · 9 min read
Real-time Big Data Analytics with Apache Paimon and the Streaming Lakehouse Architecture
DataFunSummit
DataFunSummit
Jun 14, 2024 · Big Data

JD Logistics One‑Stop Agile BI Solution: Architecture, Challenges, and Product Evolution

This article presents JD Logistics' one‑stop agile BI platform, detailing the complex data sources, rapid business demands, the UData solution architecture, performance and usability improvements, and future upgrade plans that together enable faster data integration, self‑service reporting, and enhanced decision‑making across the organization.

Agile AnalyticsBIBig Data
0 likes · 25 min read
JD Logistics One‑Stop Agile BI Solution: Architecture, Challenges, and Product Evolution
DataFunTalk
DataFunTalk
Jun 12, 2024 · Big Data

Technical Maturity Curve of Indicator Systems: Framework, Requirements, and the Role of Large Models

This article explores the technical maturity curve of indicator systems, covering data collection, modeling, production, management, governance, and application, while analyzing the security, stability, and usability requirements and discussing how large language models can enhance certain clear and complicated scenarios.

AI IntegrationBig DataMaturity Model
0 likes · 10 min read
Technical Maturity Curve of Indicator Systems: Framework, Requirements, and the Role of Large Models
ZhongAn Tech Team
ZhongAn Tech Team
Jun 11, 2024 · Artificial Intelligence

AI and Big Data Developments in Tech News

This article covers recent AI developments, big data challenges, and industry insights including AI course expansions, regulatory discussions, and tech company updates.

AIAI DevelopmentsBig Data
0 likes · 9 min read
AI and Big Data Developments in Tech News
DataFunTalk
DataFunTalk
Jun 9, 2024 · Big Data

Optimizing ClickHouse Performance in WeChat: Observation Tools, Lakehouse Reading, Bitmap Acceleration, and AI Integration

This article details how the WeChat team leverages ClickHouse at massive scale, introduces a suite of performance observation tools, describes lakehouse reading and bitmap optimizations, and explains the integration of AI workloads, demonstrating overall query speedups of up to tenfold across diverse scenarios.

Big DataClickHouseLakehouse
0 likes · 10 min read
Optimizing ClickHouse Performance in WeChat: Observation Tools, Lakehouse Reading, Bitmap Acceleration, and AI Integration
StarRocks
StarRocks
Jun 6, 2024 · Big Data

Why StarRocks Beats Trino: A Deep Technical Comparison

This article provides a detailed technical comparison between StarRocks and Trino, covering their shared MPP architecture, cost‑based optimizer, pipeline execution, ANSI SQL support, differences in vectorized execution, materialized view capabilities, caching systems, data source connectors, benchmark results, high‑availability designs, join algorithms, and real‑world user case studies.

Big DataCacheMPP
0 likes · 20 min read
Why StarRocks Beats Trino: A Deep Technical Comparison
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jun 6, 2024 · Databases

How StarRocks Redefines Lakehouse Architecture with Ultra-Fast Unified Analytics

StarRocks combines extreme query speed and a unified architecture to deliver a lakehouse solution that separates storage and compute, supports multi‑warehouse resource isolation, offers Trino compatibility, materialized‑view acceleration, and cost‑effective scaling, making it suitable for real‑time analytics, data‑lake queries, and traditional OLAP workloads.

Big DataLakehouseReal-time Analytics
0 likes · 23 min read
How StarRocks Redefines Lakehouse Architecture with Ultra-Fast Unified Analytics
Sohu Tech Products
Sohu Tech Products
Jun 5, 2024 · Big Data

Why Kafka Is the Backbone of Modern Data Pipelines: Core Architecture and Use Cases

This article explains Kafka's role as a high‑throughput distributed message queue, detailing its core components, topic‑partition model, consumer groups, storage mechanisms, fault‑tolerance features, delivery guarantees, ZooKeeper coordination, and scalability strategies for building reliable real‑time data pipelines.

Big DataDistributed SystemsKafka
0 likes · 14 min read
Why Kafka Is the Backbone of Modern Data Pipelines: Core Architecture and Use Cases
DataFunTalk
DataFunTalk
Jun 4, 2024 · Databases

From Lambda Architecture to an All‑in‑One Apache Doris Real‑Time/Offline Data Platform for 5G Connected Factories

The article explains how China Unicom transformed its 5G fully‑connected factory data pipeline from a complex Lambda architecture into a streamlined, real‑time and offline‑integrated solution built on Apache Doris, detailing system requirements, architectural redesign, performance gains, and future plans.

5GApache DorisBig Data
0 likes · 15 min read
From Lambda Architecture to an All‑in‑One Apache Doris Real‑Time/Offline Data Platform for 5G Connected Factories
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 4, 2024 · Big Data

Ant Group's Data Governance Practices: Quality, Storage, and Future Directions

This article presents Ant Group's comprehensive data governance experience, covering data quality management, storage governance, architectural design, operational strategies, case studies, and forward‑looking thoughts on integrated lake‑warehouse governance, data value realization, and AI‑driven automation.

Ant GroupBig DataData Quality
0 likes · 19 min read
Ant Group's Data Governance Practices: Quality, Storage, and Future Directions
Data Thinking Notes
Data Thinking Notes
Jun 2, 2024 · Big Data

How JD Retail’s Data Platform Boosts Efficiency with Unified Modeling and AI‑Driven Insights

This article details JD Retail’s end‑to‑end data platform, covering data asset certification, 5W2H modeling, unified query DSL, intelligent acceleration, robust governance, visualization components, low‑code orchestration, and large‑model AI applications that together reduce query latency, cut development costs, and empower analysts across the retail business.

AIBig DataData Platform
0 likes · 39 min read
How JD Retail’s Data Platform Boosts Efficiency with Unified Modeling and AI‑Driven Insights
Su San Talks Tech
Su San Talks Tech
Jun 2, 2024 · Big Data

Mastering Kafka: Core Architecture, Use Cases, and Design Principles

This article provides a comprehensive overview of Apache Kafka, covering its role as a message queue, core components, topic and partition design, consumer groups, storage mechanisms, high‑availability features, delivery guarantees, ZooKeeper coordination, and scalability strategies for building robust real‑time data pipelines.

Big DataKafkaStreaming
0 likes · 15 min read
Mastering Kafka: Core Architecture, Use Cases, and Design Principles
Data Thinking Notes
Data Thinking Notes
May 30, 2024 · Databases

Why Your Data Team Is Drowning in Requests—and How OLAP Can Save You

This article examines why data departments get overwhelmed by massive data‑retrieval requests, identifies root causes such as mindset, requirement handling, and lack of tools, and presents a technical solution centered on dimensional modeling and OLAP multi‑dimensional reporting to streamline data access and empower teams.

Big DataData EngineeringData Warehouse
0 likes · 12 min read
Why Your Data Team Is Drowning in Requests—and How OLAP Can Save You
DataFunTalk
DataFunTalk
May 28, 2024 · Big Data

Building and Managing a Metric System in Data Warehouse: Practices from Dongchedi

This article details how the Dongchedi business team designs, implements, and monitors a comprehensive metric system within its data warehouse, covering metric standards, model construction, metadata management, quality monitoring, application scenarios, and future directions using the DataLeap platform.

Big DataData Warehousedata governance
0 likes · 18 min read
Building and Managing a Metric System in Data Warehouse: Practices from Dongchedi
DataFunTalk
DataFunTalk
May 27, 2024 · Big Data

JD Retail’s Unified HDFS Storage: Cross‑Region and Hierarchical Storage Practices

This article details JD Retail’s large‑scale HDFS deployment, describing how cross‑region storage challenges were solved with a full‑copy topology, asynchronous block replication, flow‑control mechanisms, and a tiered storage strategy that automatically moves hot, warm, and cold data among SSD, HDD, and high‑density HDD nodes to improve performance and cut costs.

Big DataData ManagementHDFS
0 likes · 20 min read
JD Retail’s Unified HDFS Storage: Cross‑Region and Hierarchical Storage Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
May 27, 2024 · Big Data

Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact

The Athena Data Factory, built by Spark Thinking, is a comprehensive one‑stop data development and governance platform that integrates data integration, development, analysis, and services, offering offline, real‑time, and AI pipelines, modular architecture, extensive monitoring, and cost‑optimisation to empower thousands of users across the company.

AirflowBig DataCloud Computing
0 likes · 26 min read
Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact
DataFunSummit
DataFunSummit
May 24, 2024 · Big Data

Ctrip's Experience with Alluxio in Its Big Data Platform: Architecture, Transparent Access, Custom Authentication, CallerContext, and Dynamic Configuration

This article details how Ctrip, a leading travel company, leverages Alluxio as a distributed cache within its extensive big‑data infrastructure to improve data access speed, implement transparent storage access, support custom authentication and multi‑tenant features, enhance audit logging with CallerContext, and dynamically distribute client configurations via Kyuubi.

AlluxioBig DataCallerContext
0 likes · 14 min read
Ctrip's Experience with Alluxio in Its Big Data Platform: Architecture, Transparent Access, Custom Authentication, CallerContext, and Dynamic Configuration
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
May 24, 2024 · Cloud Computing

Exploring Arm Neoverse: Business Innovation with Yitian Arm Architecture – Insights from the Feitian Technology Salon

The Feitian Technology Salon held on May 16 in Shanghai showcased Arm Neoverse's core advantages and demonstrated how Yitian 710‑based ECS instances deliver significant cost‑effective performance gains for big‑data and video workloads through cloud‑native optimizations and software acceleration techniques.

Big DataVideo Encoding
0 likes · 5 min read
Exploring Arm Neoverse: Business Innovation with Yitian Arm Architecture – Insights from the Feitian Technology Salon
DataFunTalk
DataFunTalk
May 23, 2024 · Big Data

Berserker Big Data Platform: Architecture, Development Practices, and Operational Enhancements

This article presents a comprehensive overview of the Berserker big‑data platform, detailing its overall design, data‑development components, key architectural challenges such as state management, release processes, two‑phase commit, RPC duplication, task routing, message handling, execution isolation, dependency model redesign, and outlines future work including stateless execution nodes, Kubernetes integration, and unified stream‑batch processing.

Big DataData PlatformDistributed Scheduling
0 likes · 15 min read
Berserker Big Data Platform: Architecture, Development Practices, and Operational Enhancements
DataFunTalk
DataFunTalk
May 19, 2024 · Big Data

Tencent's Multi-Engine Unified Metadata and Permission Management for Big Data

This article introduces Tencent's Big Data Processing Suite (TBDS), discusses challenges of data silos, and presents Gravitino's open‑source unified metadata service and permission model, detailing how it integrates Hadoop, MPP, and various catalog plugins to provide consistent access control across heterogeneous data platforms.

Big DataGravitinoHadoop
0 likes · 12 min read
Tencent's Multi-Engine Unified Metadata and Permission Management for Big Data
DataFunSummit
DataFunSummit
May 17, 2024 · Big Data

Comprehensive Hudi Real-Time Data Lake Ingestion Solutions

This article presents a complete guide to Hudi-based real-time data lake ingestion, covering overall data integration architecture, batch and streaming ingestion strategies, advanced table design, and practical recommendations for handling challenges such as deduplication, latency, partitioning, and performance optimization.

Batch processingBig DataData Lake
0 likes · 12 min read
Comprehensive Hudi Real-Time Data Lake Ingestion Solutions
Data Thinking Notes
Data Thinking Notes
May 16, 2024 · Information Security

How a Data Security Governance Platform Secures the Full Data Lifecycle

This article explains how a data security governance platform protects data across its entire lifecycle—from warehouse construction and collection to application—by implementing fine‑grained permission controls, encryption, masking, authentication, and comprehensive auditing, while addressing scalability, high availability, and regulatory compliance challenges.

AuthenticationAuthorizationBig Data
0 likes · 13 min read
How a Data Security Governance Platform Secures the Full Data Lifecycle
Didi Tech
Didi Tech
May 14, 2024 · Databases

Didi Elasticsearch Overview: Architecture, Deployment, Performance, and Operations

Didi’s Elasticsearch platform, built on ES 7.6 and deployed on physical machines with containerized gateway and control layers, provides a multi‑tenant, high‑performance search service—featuring a user console, operational controls, ZGC‑based latency reductions, cost‑saving compression, custom security, real‑time cross‑datacenter replication, and a roadmap toward ES 8.13.

Big DataDidiElasticsearch
0 likes · 17 min read
Didi Elasticsearch Overview: Architecture, Deployment, Performance, and Operations
DataFunTalk
DataFunTalk
May 14, 2024 · Cloud Computing

Hybrid Cloud Architecture and AI Storage Evolution at Zhihu: From UnionStore to Alluxio

This article describes Zhihu's hybrid cloud architecture—including offline, online, and GPU data centers—its self‑built UnionStore cache, the performance and latency challenges faced during large‑scale AI model training, and the subsequent evaluation and migration to Alluxio community and enterprise editions to achieve higher throughput, stability, and lower operational overhead.

AI storageAlluxioBig Data
0 likes · 14 min read
Hybrid Cloud Architecture and AI Storage Evolution at Zhihu: From UnionStore to Alluxio
DataFunTalk
DataFunTalk
May 13, 2024 · Big Data

Data Integration Maturity Model: From ETL to EtLT

The article examines the evolution of data integration architectures—from traditional ETL through ELT to the emerging EtLT model—highlighting their advantages, disadvantages, industry trends, maturity stages, and practical guidance for enterprises and professionals navigating modern big‑data pipelines.

Big DataData IntegrationDataOps
0 likes · 31 min read
Data Integration Maturity Model: From ETL to EtLT
DaTaobao Tech
DaTaobao Tech
May 13, 2024 · Big Data

Interview Algorithms and System Design: Bloom Filter, TopK, Median, and Concurrency Implementations

The article presents a suite of interview‑style algorithm and system‑design solutions—including Bloom‑filter URL blacklists, hash‑partitioned word frequencies, missing‑number bit arrays, top‑K min‑heap, low‑memory median, short‑URL encoding, Redis user counting, and extensive Java implementations of sorting, singleton, LRU cache, custom thread pools, producer‑consumer models and various FooBar synchronization techniques.

Big DataConcurrencyData Structures
0 likes · 35 min read
Interview Algorithms and System Design: Bloom Filter, TopK, Median, and Concurrency Implementations
Big Data Technology & Architecture
Big Data Technology & Architecture
May 13, 2024 · Big Data

Apache Paimon 0.8 Release: Deletion Vectors, File Index, Performance Boosts, and Flink/Spark Integration Enhancements

The article introduces Apache Paimon 0.8, highlighting new Deletion Vectors, a universal file index, memory and I/O optimizations, record‑level TTL, and integration improvements with Flink and Spark, while also discussing broader lake‑house performance trends and future directions.

Apache PaimonBig DataDeletion Vectors
0 likes · 8 min read
Apache Paimon 0.8 Release: Deletion Vectors, File Index, Performance Boosts, and Flink/Spark Integration Enhancements
DataFunSummit
DataFunSummit
May 12, 2024 · Big Data

Practice of Lakehouse‑Integrated Data Platform Architecture in the Financial Innovation Sector

This article presents the evolution of data platform architectures, the specific challenges of financial‑sector information‑technology innovation, and the design, core components, deployment path, and real‑world case studies of the cloud‑native lakehouse solution DataCyber developed by Shuxin Network.

Big DataData PlatformFinancial Innovation
0 likes · 21 min read
Practice of Lakehouse‑Integrated Data Platform Architecture in the Financial Innovation Sector
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
May 11, 2024 · Big Data

Comprehensive Introduction to Apache Kafka: Architecture, Features, and Use Cases

This article provides a detailed overview of Apache Kafka, covering its core characteristics, distributed architecture, key components such as topics, partitions, brokers, producers, consumers, ZooKeeper, and common application scenarios like log collection, event‑driven architecture, real‑time analytics, and monitoring.

Big DataDistributed SystemsKafka
0 likes · 7 min read
Comprehensive Introduction to Apache Kafka: Architecture, Features, and Use Cases
Data Thinking Notes
Data Thinking Notes
May 9, 2024 · Big Data

How to Build an Effective Indicator System: From Concept to Productization

This article explores the complete lifecycle of an indicator system—from defining metrics and addressing common ambiguities, through designing concept consensus, semantic layers, mechanisms, and governance, to productizing platforms, optimizing development, and envisioning future AI‑driven enhancements.

Big DataData PlatformIndicator System
0 likes · 22 min read
How to Build an Effective Indicator System: From Concept to Productization
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
May 9, 2024 · Artificial Intelligence

On‑Device AI and Federated Learning: Era Background, Theory, and Practical Applications

This article outlines the evolution from 1G to 6G communications, explains the third AI wave driven by big data, theory, and compute, introduces federated learning (horizontal, vertical, transfer), and details on‑device AI architectures, decision tree and neural network models, and real‑world use cases such as video preloading and autonomous driving.

Artificial IntelligenceBig Dataedge computing
0 likes · 13 min read
On‑Device AI and Federated Learning: Era Background, Theory, and Practical Applications
Baidu MEUX
Baidu MEUX
May 8, 2024 · Big Data

Why KNIME Is a Powerful Open‑Source Solution for Big Data Analytics

In the data‑driven era, KNIME offers a free, visual, and highly scalable platform that streamlines massive data ingestion, preprocessing, analysis, automation, and visualization, enabling researchers to handle millions of records efficiently without extensive coding or costly software.

Big DataData AnalysisKNIME
0 likes · 9 min read
Why KNIME Is a Powerful Open‑Source Solution for Big Data Analytics
DataFunTalk
DataFunTalk
May 8, 2024 · Big Data

Risk Control and Data Application in the Bulk Commodity Industry: Challenges, Solutions, and Core Capabilities

The article presents Ant Group's exploration of applying its data‑driven risk control and credit assessment capabilities to the traditional bulk commodity sector, detailing industry background, data pain points, core technical solutions, and the construction of a secure, explainable data‑model platform for digital transformation.

AIBig DataBulk Industry
0 likes · 13 min read
Risk Control and Data Application in the Bulk Commodity Industry: Challenges, Solutions, and Core Capabilities
DataFunTalk
DataFunTalk
May 6, 2024 · Big Data

OPPO Next‑Generation Big Data & AI Integrated Architecture on Functional Cloud

This article presents OPPO’s next‑generation big‑data and AI integrated architecture on functional cloud, detailing a cloud‑native elastic compute framework, a unified data‑lake solution, real‑time feature platforms, machine‑learning data acceleration, and hybrid‑cloud deployments, highlighting performance gains and cost reductions.

Big Datacloud-nativeelastic computing
0 likes · 11 min read
OPPO Next‑Generation Big Data & AI Integrated Architecture on Functional Cloud
DataFunSummit
DataFunSummit
May 5, 2024 · Big Data

Alluxio in Lakehouse Architecture: Benefits, Challenges, and Real‑World Use Cases

This article explains how Alluxio enables a unified lake‑warehouse architecture by decoupling compute and storage, outlines its core capabilities, evaluates the cost‑saving and performance benefits, discusses the technical challenges, and presents several practical deployment scenarios in finance and AI workloads.

AlluxioBig DataData Orchestration
0 likes · 15 min read
Alluxio in Lakehouse Architecture: Benefits, Challenges, and Real‑World Use Cases
DataFunTalk
DataFunTalk
May 4, 2024 · Big Data

JD Retail Data Visualization Platform: Product Practice and Insights

This article presents an in‑depth overview of JD.com’s retail data visualization platform, detailing its product matrix—including EasyBI, a low‑code platform, and JDV large‑screen tool—its architectural layers, key capabilities, business case studies, challenges faced, and future development directions.

AnalyticsBig DataData visualization
0 likes · 14 min read
JD Retail Data Visualization Platform: Product Practice and Insights
DataFunSummit
DataFunSummit
May 2, 2024 · Big Data

Building an Attribution System for NetEase Cloud Music Data Warehouse: Challenges and Solutions

This article presents the problems faced by NetEase Cloud Music's data warehouse attribution system and details a comprehensive solution that includes upgrading the event‑tracking framework, redesigning the attribution model, and launching a unified management platform to improve stability, accuracy, and scalability.

AnalyticsBig DataData Warehouse
0 likes · 13 min read
Building an Attribution System for NetEase Cloud Music Data Warehouse: Challenges and Solutions
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 30, 2024 · Big Data

Apache Paimon Becomes a Top-Level Project: A Comprehensive Overview of Lakehouse Framework Capabilities and Future Trends

The article reviews Apache Paimon's graduation to an Apache Top-Level Project, outlines the essential capabilities of modern lakehouse frameworks—including streaming and batch I/O, multi‑engine integration, and advanced features—and discusses the problems they solve and the promising direction of the lakehouse ecosystem.

Apache PaimonBatch processingBig Data
0 likes · 5 min read
Apache Paimon Becomes a Top-Level Project: A Comprehensive Overview of Lakehouse Framework Capabilities and Future Trends
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 30, 2024 · Big Data

Mastering ODPS SQL: Proven Tips to Slash Query Time and Tackle Data Skew

This article explores practical SQL optimization techniques for Alibaba's ODPS platform, covering fundamentals, common pitfalls like null handling and select *, advanced strategies such as multi‑insert, partition limiting, UDF placement, data‑skew mitigation, parameter tuning, and real‑world case studies that dramatically reduce query runtimes.

Big DataData SkewHive
0 likes · 23 min read
Mastering ODPS SQL: Proven Tips to Slash Query Time and Tackle Data Skew
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Apr 27, 2024 · Cloud Computing

Understanding Cloud Computing: Types, Benefits, and Core Technologies

This article provides a comprehensive overview of cloud computing, explaining its definition, major service models (IaaS, PaaS, SaaS), key advantages and challenges, and the essential technologies such as virtualization, distributed systems, automation, security, storage, and big data that enable modern cloud solutions.

Big DataCloud ComputingIaaS
0 likes · 6 min read
Understanding Cloud Computing: Types, Benefits, and Core Technologies
Bilibili Tech
Bilibili Tech
Apr 26, 2024 · Big Data

Fine-Grained Lock Optimization for HDFS NameNode to Improve Metadata Read/Write Performance

To overcome the NameNode write bottleneck caused by a single global read/write lock in Bilibili’s massive HDFS deployment, the team introduced hierarchical fine‑grained locking—splitting the lock into Namespace, BlockPool, and per‑INode levels—which yielded up to three‑fold write throughput gains, a 90 % drop in RPC queue time, and shifted performance limits from lock contention to log synchronization.

Big DataHDFSMetadata
0 likes · 15 min read
Fine-Grained Lock Optimization for HDFS NameNode to Improve Metadata Read/Write Performance
AntTech
AntTech
Apr 26, 2024 · Databases

Data Processing Technologies in the AI Era: Trends and Integration of Vector and Relational Databases

The talk explores how the rapid growth of multimodal data and large language models is reshaping data processing, highlighting three key trends—online‑offline integration, vector‑relational database convergence, and the fusion of data processing with AI computation—while presenting practical solutions and future visions for unified data‑AI ecosystems.

AIBig DataData Processing
0 likes · 12 min read
Data Processing Technologies in the AI Era: Trends and Integration of Vector and Relational Databases
DataFunSummit
DataFunSummit
Apr 25, 2024 · Big Data

Paimon Project Overview: Recent Developments, Core Capabilities, and Future Roadmap

This article presents a comprehensive overview of the Apache‑incubated Paimon project, covering its evolution from Flink Table Store, the current features of primary‑key and log tables, management tools such as snapshots, tags and branches, performance optimizations for Flink and Spark, and a detailed roadmap of upcoming functionalities.

Big DataData ManagementFlink
0 likes · 23 min read
Paimon Project Overview: Recent Developments, Core Capabilities, and Future Roadmap
DataFunTalk
DataFunTalk
Apr 25, 2024 · Big Data

Apache Hudi 1.0: Design Reconsiderations and Key New Features

This article provides a comprehensive overview of Apache Hudi 1.0, detailing its architectural redesign, five major development directions, and the most important new capabilities such as LSM‑tree timeline, function indexes, file‑group readers/writers, partial updates, and non‑blocking concurrency control, along with performance evaluations and resource links.

Apache HudiBig DataFunction Index
0 likes · 14 min read
Apache Hudi 1.0: Design Reconsiderations and Key New Features
Python Programming Learning Circle
Python Programming Learning Circle
Apr 24, 2024 · Big Data

Using the TransBigData Python Library for Mobile Signaling Data Processing, Analysis, and Visualization

This article introduces the TransBigData Python package, explains how to install it, read mobile signaling data with pandas, preprocess and grid the data, identify stay and move events, determine home and work locations, and visualize individual user activity using built‑in functions.

Big DataData visualizationPython
0 likes · 7 min read
Using the TransBigData Python Library for Mobile Signaling Data Processing, Analysis, and Visualization
Efficient Ops
Efficient Ops
Apr 23, 2024 · Big Data

How to Plan, Configure, and Launch a Hadoop 3.3.5 Cluster on Three Nodes

This guide walks through planning a three‑node Hadoop 3.3.5 cluster, explains default and custom configuration files, details core‑site, hdfs‑site, yarn‑site, and mapred‑site settings, shows how to distribute configs, start HDFS and YARN, and perform basic file‑system tests.

Big DataCluster SetupHDFS
0 likes · 11 min read
How to Plan, Configure, and Launch a Hadoop 3.3.5 Cluster on Three Nodes
DataFunSummit
DataFunSummit
Apr 23, 2024 · Big Data

Building a Data System with Apache Arrow: Design, Implementation, and Practical Tips

This article explains why new data systems are needed, introduces Apache Arrow’s columnar in‑memory format and its zero‑copy advantages, describes how to model data at read time, outlines the execution flow with Acero and SQL planning, and shares practical tips and extensions for building robust, dynamic‑schema data platforms.

AceroApache ArrowBig Data
0 likes · 20 min read
Building a Data System with Apache Arrow: Design, Implementation, and Practical Tips
DataFunTalk
DataFunTalk
Apr 23, 2024 · Big Data

Apache Paimon Graduates to Top‑Level Project – Milestones, Core Capabilities, and Community Highlights

Apache Paimon, originally launched as Flink Table Store, has graduated to an Apache Top‑Level Project after a year of incubation, showcasing real‑time lakehouse capabilities, extensive ecosystem integration, and strong adoption by major enterprises, marking a significant milestone for streaming and batch data processing.

Apache PaimonBig DataLakehouse
0 likes · 9 min read
Apache Paimon Graduates to Top‑Level Project – Milestones, Core Capabilities, and Community Highlights
21CTO
21CTO
Apr 22, 2024 · Big Data

Inside Uber’s Real‑Time Data Infrastructure: How They Scale Streaming at Massive Scale

This article explores Uber’s sophisticated real‑time data infrastructure, detailing how the company leverages open‑source technologies such as Apache Kafka, Flink, Pinot, and Presto, and describing the architectural components, scaling challenges, multi‑region resilience, data back‑filling, and operational practices that enable low‑latency analytics for millions of daily rides and deliveries.

Big DataFlinkKafka
0 likes · 25 min read
Inside Uber’s Real‑Time Data Infrastructure: How They Scale Streaming at Massive Scale
DataFunTalk
DataFunTalk
Apr 20, 2024 · Big Data

Tencent Video Metrics Middle Platform and Lakehouse Integration: Architecture, Governance, and Practices

This article details Tencent Video’s data business, describing the design and implementation of its metrics middle platform and lake‑warehouse integration, covering architecture, governance, consistency, timeliness, usability, cost optimization, and future plans, with insights into technology choices such as Iceberg, StarRocks, and MQL.

Big DataData EngineeringLakehouse
0 likes · 18 min read
Tencent Video Metrics Middle Platform and Lakehouse Integration: Architecture, Governance, and Practices
DataFunSummit
DataFunSummit
Apr 19, 2024 · Big Data

Design Insights of Bilibili's Big Data Development Governance Platform

This article outlines Bilibili's data‑driven approach, describing the five‑year development of its big‑data development governance platform, its user segmentation, product positioning, data‑map and governance product designs, operational methods, value evaluation, and future roadmap, highlighting significant efficiency gains and user impact.

Big DataBilibiliData Platform
0 likes · 10 min read
Design Insights of Bilibili's Big Data Development Governance Platform
DataFunTalk
DataFunTalk
Apr 19, 2024 · Artificial Intelligence

Technology Maturity Curve – Financial Risk Control Overview

This article provides a comprehensive overview of the evolution, current state, and future trends of financial risk control technologies, covering data, feature engineering, modeling, decision-making, product development, challenges, and the impact of large AI models on the industry.

Big DataTechnology Maturityfinancial risk
0 likes · 29 min read
Technology Maturity Curve – Financial Risk Control Overview
Python Programming Learning Circle
Python Programming Learning Circle
Apr 17, 2024 · Big Data

Comparative Analysis of Starbucks and Luckin Coffee Store Distribution in China Using Python Data Visualization

Using Python data visualization and geospatial analysis, this article compares the nationwide distribution of Starbucks and Luckin Coffee stores in China, revealing differences in regional concentration, proximity patterns, and statistical insights such as average Luckin stores within 500 m of each Starbucks location.

Big DataPythonStore Distribution
0 likes · 11 min read
Comparative Analysis of Starbucks and Luckin Coffee Store Distribution in China Using Python Data Visualization
DataFunTalk
DataFunTalk
Apr 16, 2024 · Big Data

Materialized Views in MaxCompute: Design, Implementation, and Best Practices

This article explains how MaxCompute leverages materialized views as a query accelerator, covering their history, advantages and drawbacks, creation and maintenance details, automatic query rewriting, intelligent recommendation, auto‑materialization, and future enhancements for large‑scale data warehousing.

Automatic RefreshBig DataIntelligent Recommendation
0 likes · 13 min read
Materialized Views in MaxCompute: Design, Implementation, and Best Practices
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Apr 16, 2024 · Big Data

MaxCompute’s Integrated Offline & Near‑Real‑Time Architecture: Transaction Table 2.0 Explained

This article explains MaxCompute’s new integrated offline‑and‑near‑real‑time architecture, Transaction Table 2.0, detailing its unified storage and compute design, automatic data governance, schema evolution, upsert and time‑travel capabilities, and how it simplifies complex big‑data pipelines while delivering minute‑level latency and lower costs.

Big DataMaxComputeTransaction Table
0 likes · 27 min read
MaxCompute’s Integrated Offline & Near‑Real‑Time Architecture: Transaction Table 2.0 Explained
Architect
Architect
Apr 15, 2024 · Big Data

Understanding the Underlying Working Principles of ElasticSearch

This article explains ElasticSearch’s architecture and core mechanisms—including its reliance on Lucene segments, inverted indexes, stored fields, document values, caching, shard routing, and scaling strategies—while answering common questions about wildcard matching, index compression, and memory usage.

Big Datalucenesearch engine
0 likes · 11 min read
Understanding the Underlying Working Principles of ElasticSearch
DataFunTalk
DataFunTalk
Apr 14, 2024 · Big Data

Third‑Generation Metric Platform: Enabling a Light Data Warehouse with NoETL

This article explains how a third‑generation metric platform replaces traditional ETL‑heavy data‑warehouse pipelines with a semantic‑driven NoETL approach, reducing cost, improving quality and efficiency, and delivering automated, self‑service analytics for both IT and business users.

Big DataData EngineeringData Warehouse
0 likes · 16 min read
Third‑Generation Metric Platform: Enabling a Light Data Warehouse with NoETL
DataFunTalk
DataFunTalk
Apr 12, 2024 · Big Data

Building and Managing an Indicator System in a Data Warehouse: Practices from the Dongchedi Business

This article explains how the Dongchedi team designed, implemented, and monitored a comprehensive indicator system within a petabyte‑scale data warehouse, covering standards, metadata management, model construction, quality monitoring, and diverse application scenarios to improve data reliability and business insight.

Big DataData WarehouseIndicator Management
0 likes · 18 min read
Building and Managing an Indicator System in a Data Warehouse: Practices from the Dongchedi Business
ITPUB
ITPUB
Apr 11, 2024 · Big Data

Query 100K Items from 10M+ Records: CK, ES Scroll, HBase, RediSearch

When faced with a business requirement to filter up to 100 000 records from a pool of tens of millions and then sort and de‑duplicate them, this article explores four technical solutions—multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, a combined Elasticsearch‑HBase approach, and RediSearch with RedisJSON—detailing their design, implementation, performance testing, and trade‑offs.

Big DataClickHouseElasticsearch
0 likes · 12 min read
Query 100K Items from 10M+ Records: CK, ES Scroll, HBase, RediSearch
DataFunSummit
DataFunSummit
Apr 11, 2024 · Big Data

Building Integrated Data Governance and R&D Operations with DataOps: Practices and Insights from China Unicom Digital Technology

This article shares how China Unicom Digital Technology leverages DataOps to build an integrated data governance, research and development, and operations capability, outlining challenges, methodological considerations, a seven-step governance framework, and a multi-center collaborative mechanism to achieve sustainable data-driven value.

Big Datadata operations
0 likes · 15 min read
Building Integrated Data Governance and R&D Operations with DataOps: Practices and Insights from China Unicom Digital Technology
Sohu Tech Products
Sohu Tech Products
Apr 10, 2024 · Big Data

Bloom Filter: Principles, False Positive Rate, and Implementations with Guava and Redis

Bloom filters are space‑efficient probabilistic structures that answer “definitely not” or “maybe” membership queries, with a controllable false‑positive rate derived from bit array size, element count, and hash functions, and can be implemented via Guava’s Java library, Redisson’s Redis wrapper, native Redis modules, or custom bitmap code, dramatically reducing memory usage and latency in large‑scale systems such as URL deduplication or user‑product checks.

Big DataGuavaRedis
0 likes · 21 min read
Bloom Filter: Principles, False Positive Rate, and Implementations with Guava and Redis