Tagged articles

3697 articles

Page 7 of 37

Jun 28, 2024 · Big Data

Accelerating Spark with ClickHouse: Native Optimization Techniques and Performance Evaluation

This article presents a comprehensive technical overview of using ClickHouse as a native backend to accelerate Spark SQL execution, covering Spark performance bottlenecks, ClickHouse's CPU‑level optimizations, the design and implementation of the Spark‑Native integration, and detailed TPC‑DS benchmark results demonstrating up to 3.5× speedup.

Big DataClickHousePerformance Optimization

0 likes · 33 min read

Accelerating Spark with ClickHouse: Native Optimization Techniques and Performance Evaluation

Tencent Cloud Developer

Jun 28, 2024 · Big Data

Capacity-Constrained Influence Maximization: Algorithms and Applications

The paper introduces Capacity‑Constrained Influence Maximization (CIM), a framework that selects up to k neighbors per active user to maximize spread under node capacity limits, proposes MG‑Greedy and RR‑Greedy algorithms with ≥½ approximation, and demonstrates the near‑linear RR‑OPIM+ method’s superior accuracy and speed on large social networks and a Tencent game recommendation system.

Big DataCapacity ConstraintKDD 2023

0 likes · 8 min read

Capacity-Constrained Influence Maximization: Algorithms and Applications

DevOps

Jun 27, 2024 · Big Data

Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration

This article explores agile data engineering, advocating code‑as‑infrastructure practices such as code‑everything, data and code reuse, and ETL‑level continuous integration, while discussing the trade‑offs between data‑centric and code‑centric reuse, cost considerations, and practical implementation tips for modern data projects.

Agile DevelopmentBig DataCode as Infrastructure

0 likes · 22 min read

Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration

Past Memory Big Data

Jun 27, 2024 · Big Data

Inside Presto 2.0: The Native C++ Query Engine Explained

This article provides a detailed technical overview of Presto 2.0, the native C++ query engine built on the Velox library, covering its motivation, vectorized architecture, memory management, performance benchmarks from Meta and IBM, and deployment practices for large‑scale data warehouses.

Big DataC++Data Warehouse

0 likes · 15 min read

Inside Presto 2.0: The Native C++ Query Engine Explained

DataFunTalk

Jun 27, 2024 · Big Data

Data Warehouse Construction and Data Governance Practices at Wing Payment

This presentation by senior data warehouse engineer Huang Luo details Wing Payment’s end‑to‑end data warehouse build, covering background challenges, governance framework, platform architecture, layered modeling, naming standards, asset management, monitoring, and future plans, illustrating how systematic data governance drives cost reduction, efficiency, and security.

AnalyticsBig DataData Security

0 likes · 14 min read

Data Warehouse Construction and Data Governance Practices at Wing Payment

DataFunTalk

Jun 26, 2024 · Big Data

Evolution of the Big Data + AI Development Paradigm and Alibaba Cloud’s Integrated Architecture

This article examines how the big‑data AI development paradigm has shifted from model‑centric to data‑centric workflows, outlines the challenges of integrating data and AI teams, and details Alibaba Cloud’s end‑to‑end, serverless big‑data platform—including MaxCompute, Hologres, MaxFrame, Object Table, and vector search—designed to accelerate large‑scale AI applications.

AI IntegrationBig DataData Platform

0 likes · 20 min read

Evolution of the Big Data + AI Development Paradigm and Alibaba Cloud’s Integrated Architecture

Alibaba Cloud Big Data AI Platform

Jun 25, 2024 · Big Data

Build Real-Time Data Lake Analytics with Flink, Paimon, and EMR Serverless Spark

This guide demonstrates how to use Alibaba Cloud's EMR Serverless Spark and Flink Serverless services together with Apache Paimon to ingest streaming data, perform interactive queries, and schedule offline compaction jobs, creating a unified real‑time and batch data lake solution.

Big DataData LakeEMR Serverless

0 likes · 6 min read

Build Real-Time Data Lake Analytics with Flink, Paimon, and EMR Serverless Spark

Baidu Geek Talk

Jun 24, 2024 · Big Data

Accelerating Spark with ClickHouse Native Techniques: Design, Implementation, and Performance Evaluation

The paper presents a Spark acceleration framework that replaces Java‑based task operators with a ClickHouse native library, converting plans via Protobuf and JNI, leveraging columnar storage, SIMD and JIT to achieve up to 3× speed‑up on TPC‑DS workloads while providing fallback mechanisms to ensure no performance loss.

Big DataClickHouseNative Acceleration

0 likes · 31 min read

Accelerating Spark with ClickHouse Native Techniques: Design, Implementation, and Performance Evaluation

DataFunTalk

Jun 22, 2024 · Big Data

Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits

This article details Zhihu's migration of massive Spark and MapReduce shuffle workloads from the External Shuffle Service (ESS) to a push‑based Remote Shuffle Service (RSS) powered by Celeborn, covering background problems, evaluation of open‑source implementations, deployment architecture, encountered issues, solutions, performance gains, and future plans.

Big DataPerformanceRSS

0 likes · 19 min read

Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits

DataFunSummit

Jun 21, 2024 · Big Data

Building a Complete Data System with Apache Arrow: Architecture, Dynamic Schema Modeling, and Practical Tips

This article explains why new data systems are needed, introduces Apache Arrow and its columnar in‑memory format, describes dynamic read‑time modeling, outlines the system’s execution flow, storage and indexing strategies, and shares practical tips and extensions for building scalable big‑data solutions.

AceroApache ArrowBig Data

0 likes · 20 min read

Building a Complete Data System with Apache Arrow: Architecture, Dynamic Schema Modeling, and Practical Tips

Past Memory Big Data

Jun 20, 2024 · Big Data

How Meituan Scaled Spark with Vectorized Execution Using Gluten + Velox

This article details Meituan's production‑grade adoption of Spark vectorized execution via the open‑source Gluten and Velox stack, explaining SIMD fundamentals, performance motivations, the end‑to‑end integration workflow, staged rollout, encountered challenges, and the resulting resource savings and speedups.

Big DataGlutenORC

0 likes · 33 min read

How Meituan Scaled Spark with Vectorized Execution Using Gluten + Velox

Meituan Technology Team

Jun 20, 2024 · Big Data

Vectorized Execution in Apache Spark: Meituan’s Practice with Gluten and Velox

Meituan enhances Apache Spark by integrating the Gluten‑Velox vectorized execution engine, converting row‑wise operations to columnar SIMD processing, which yields over 40 % memory savings and up to 13 % faster runtimes across thousands of ETL jobs, while addressing stability, ORC support, shuffle redesign, and off‑heap memory optimization.

Apache SparkBig DataC++

0 likes · 30 min read

Vectorized Execution in Apache Spark: Meituan’s Practice with Gluten and Velox

DataFunSummit

Jun 20, 2024 · Big Data

Data+AI Data Lake Technologies: Apache Iceberg, PyIceberg, and Vector Table Solutions

This article presents a comprehensive overview of modern Data+AI data lake challenges and solutions, covering the evolution of data lakes, an introduction to Apache Iceberg, practical use of PyIceberg for AI training and inference pipelines, and advanced vector table and indexing techniques for efficient similarity search.

AI trainingApache IcebergBig Data

0 likes · 22 min read

Data+AI Data Lake Technologies: Apache Iceberg, PyIceberg, and Vector Table Solutions

IT Architects Alliance

Jun 20, 2024 · Fundamentals

Understanding GeoHash: Principles, Encoding Process, and Application in Ride‑Hailing

This article introduces the GeoHash algorithm, explains how latitude and longitude are recursively bisected into binary strings, compressed with Base32, and demonstrates its use for efficiently locating nearby drivers in ride‑hailing services while discussing precision trade‑offs and edge cases.

Big DataGeoHashRide Hailing

0 likes · 8 min read

Understanding GeoHash: Principles, Encoding Process, and Application in Ride‑Hailing

AI Architecture Hub

Jun 20, 2024 · Big Data

How GeoHash Powers Efficient Large-Scale Location Queries Without Pagination

This article explains the GeoHash algorithm, shows how it converts latitude‑longitude pairs into compact binary strings, demonstrates the encoding process with a concrete example, and discusses how the resulting prefixes can be used to quickly locate nearby users in massive datasets while highlighting remaining edge‑case challenges.

Big DataGeoHashLocation Query

0 likes · 7 min read

How GeoHash Powers Efficient Large-Scale Location Queries Without Pagination

vivo Internet Technology

Jun 19, 2024 · Big Data

Understanding BitMap and Roaring BitMap: Principles, Containers, and Java API Usage

The article explains BitMap fundamentals and introduces Roaring BitMap’s compressed container architecture—Array, BitMap, and Run containers—detailing their conversion logic, Java implementation snippets, performance advantages over traditional BitSets, and practical API usage for high‑performance, memory‑efficient big‑data applications.

Big DataContainersRoaring Bitmap

0 likes · 18 min read

Understanding BitMap and Roaring BitMap: Principles, Containers, and Java API Usage

DataFunSummit

Jun 19, 2024 · Big Data

Apache Hudi from Zero to One: Introduction to Hudi’s Storage Format (Part 1)

This article introduces Apache Hudi’s storage format, explaining the table layout, metadata and data file organization, the naming conventions of timeline actions, and the trade‑offs between Copy‑on‑Write and Merge‑on‑Read table types for transactional data lakes.

Apache HudiBig DataData Lake

0 likes · 8 min read

Apache Hudi from Zero to One: Introduction to Hudi’s Storage Format (Part 1)

DataFunTalk

Jun 19, 2024 · Big Data

Evolution and Practices of E‑commerce Data Warehouse Governance

This article analyzes the current state, development stages, and comprehensive solutions of e‑commerce data‑warehouse governance, covering data quality, cost, security, and efficiency requirements, and presents a roadmap from early‑stage standardization to mature tool‑driven governance with future outlooks.

Big DataCost ManagementData Warehouse

0 likes · 13 min read

Evolution and Practices of E‑commerce Data Warehouse Governance

Architect

Jun 18, 2024 · Big Data

How GeoHash Powers Real‑Time Ride‑Hailing: From Theory to Practice

This article explains the GeoHash algorithm, demonstrates how binary subdivision of latitude and longitude yields compact base‑32 strings, and shows how these hashes can efficiently locate nearby ride‑hailing drivers while highlighting precision limitations and edge cases.

Big DataGeoHashLocation Services

0 likes · 8 min read

How GeoHash Powers Real‑Time Ride‑Hailing: From Theory to Practice

Beijing SF i-TECH City Technology Team

Jun 18, 2024 · Big Data

Apache Kylin in Logistics: Optimizing OLAP for Big Data Analytics

This article discusses the implementation of Apache Kylin as an OLAP engine for logistics data, focusing on optimizing cube building and query performance to handle large-scale, high-dimensional data analytics.

Apache KylinBig DataCube Building

0 likes · 15 min read

Apache Kylin in Logistics: Optimizing OLAP for Big Data Analytics

Big Data Technology & Architecture

Jun 16, 2024 · Big Data

Real-time Big Data Analytics with Apache Paimon and the Streaming Lakehouse Architecture

This article summarizes Wang Feng's presentation on the next‑generation Lakehouse architecture, explaining how Apache Paimon provides a unified, real‑time data lake format that bridges batch and streaming workloads, enabling low‑latency analytics and AI integration for modern big‑data applications.

Apache PaimonBig DataReal-time Analytics

0 likes · 9 min read

Real-time Big Data Analytics with Apache Paimon and the Streaming Lakehouse Architecture

DataFunSummit

Jun 14, 2024 · Big Data

JD Logistics One‑Stop Agile BI Solution: Architecture, Challenges, and Product Evolution

This article presents JD Logistics' one‑stop agile BI platform, detailing the complex data sources, rapid business demands, the UData solution architecture, performance and usability improvements, and future upgrade plans that together enable faster data integration, self‑service reporting, and enhanced decision‑making across the organization.

Agile AnalyticsBIBig Data

0 likes · 25 min read

JD Logistics One‑Stop Agile BI Solution: Architecture, Challenges, and Product Evolution

Test Development Learning Exchange

Jun 12, 2024 · Big Data

Getting Started with PySpark: Install, Code, and Performance Tips

This guide introduces Apache Spark's Python API, showing how to install PySpark, launch an interactive shell, create a SparkSession, read and write data from various sources, perform transformations, and apply key performance‑tuning practices for efficient big‑data processing.

Apache SparkBig DataData Processing

0 likes · 5 min read

Getting Started with PySpark: Install, Code, and Performance Tips

DataFunTalk

Jun 12, 2024 · Big Data

Technical Maturity Curve of Indicator Systems: Framework, Requirements, and the Role of Large Models

This article explores the technical maturity curve of indicator systems, covering data collection, modeling, production, management, governance, and application, while analyzing the security, stability, and usability requirements and discussing how large language models can enhance certain clear and complicated scenarios.

AI IntegrationBig DataMaturity Model

0 likes · 10 min read

Technical Maturity Curve of Indicator Systems: Framework, Requirements, and the Role of Large Models

ZhongAn Tech Team

Jun 11, 2024 · Artificial Intelligence

AI and Big Data Developments in Tech News

This article covers recent AI developments, big data challenges, and industry insights including AI course expansions, regulatory discussions, and tech company updates.

AIAI DevelopmentsBig Data

0 likes · 9 min read

AI and Big Data Developments in Tech News

DataFunTalk

Jun 9, 2024 · Big Data

Optimizing ClickHouse Performance in WeChat: Observation Tools, Lakehouse Reading, Bitmap Acceleration, and AI Integration

This article details how the WeChat team leverages ClickHouse at massive scale, introduces a suite of performance observation tools, describes lakehouse reading and bitmap optimizations, and explains the integration of AI workloads, demonstrating overall query speedups of up to tenfold across diverse scenarios.

Big DataClickHouseLakehouse

0 likes · 10 min read

Optimizing ClickHouse Performance in WeChat: Observation Tools, Lakehouse Reading, Bitmap Acceleration, and AI Integration

DataFunSummit

Jun 8, 2024 · Big Data

Case Study: Building a High‑Performance Advertising Platform with ClickHouse Enterprise

This article presents a detailed case study of how EasyPoint built a scalable, stable advertising platform using ClickHouse Enterprise, covering company background, data architecture with Kafka and Druid, ClickHouse advantages, serverless resource scaling, and extensive performance benchmarks.

Big DataClickHouseData Architecture

0 likes · 11 min read

Case Study: Building a High‑Performance Advertising Platform with ClickHouse Enterprise

Data Thinking Notes

Jun 6, 2024 · Big Data

How to Build a Robust Data Indicator System: From Design to Future AI Integration

This article explains how to construct a comprehensive data indicator system by outlining its background, design, standardization, metadata management, and future applications, while addressing business, technical, and product challenges and showcasing practical examples and visual workflows.

Big DataIndicator Systemdata governance

0 likes · 9 min read

How to Build a Robust Data Indicator System: From Design to Future AI Integration

StarRocks

Jun 6, 2024 · Big Data

Why StarRocks Beats Trino: A Deep Technical Comparison

This article provides a detailed technical comparison between StarRocks and Trino, covering their shared MPP architecture, cost‑based optimizer, pipeline execution, ANSI SQL support, differences in vectorized execution, materialized view capabilities, caching systems, data source connectors, benchmark results, high‑availability designs, join algorithms, and real‑world user case studies.

Big DataCacheMPP

0 likes · 20 min read

Why StarRocks Beats Trino: A Deep Technical Comparison

Alibaba Cloud Big Data AI Platform

Jun 6, 2024 · Databases

How StarRocks Redefines Lakehouse Architecture with Ultra-Fast Unified Analytics

StarRocks combines extreme query speed and a unified architecture to deliver a lakehouse solution that separates storage and compute, supports multi‑warehouse resource isolation, offers Trino compatibility, materialized‑view acceleration, and cost‑effective scaling, making it suitable for real‑time analytics, data‑lake queries, and traditional OLAP workloads.

Big DataLakehouseReal-time Analytics

0 likes · 23 min read

How StarRocks Redefines Lakehouse Architecture with Ultra-Fast Unified Analytics

Sohu Tech Products

Jun 5, 2024 · Big Data

Why Kafka Is the Backbone of Modern Data Pipelines: Core Architecture and Use Cases

This article explains Kafka's role as a high‑throughput distributed message queue, detailing its core components, topic‑partition model, consumer groups, storage mechanisms, fault‑tolerance features, delivery guarantees, ZooKeeper coordination, and scalability strategies for building reliable real‑time data pipelines.

Big DataDistributed SystemsKafka

0 likes · 14 min read

Why Kafka Is the Backbone of Modern Data Pipelines: Core Architecture and Use Cases

DataFunSummit

Jun 5, 2024 · Big Data

Databricks Acquires Tabular to Unite Delta Lake and Apache Iceberg for an Open Lakehouse

Databricks announced the acquisition of Tabular, the company founded by the original creators of Apache Iceberg, aiming to integrate Delta Lake and Iceberg into a unified, open lakehouse architecture that enhances format compatibility, reduces data silos, and supports AI workloads.

Apache IcebergBig DataDatabricks

0 likes · 5 min read

Databricks Acquires Tabular to Unite Delta Lake and Apache Iceberg for an Open Lakehouse

DataFunTalk

Jun 4, 2024 · Databases

From Lambda Architecture to an All‑in‑One Apache Doris Real‑Time/Offline Data Platform for 5G Connected Factories

The article explains how China Unicom transformed its 5G fully‑connected factory data pipeline from a complex Lambda architecture into a streamlined, real‑time and offline‑integrated solution built on Apache Doris, detailing system requirements, architectural redesign, performance gains, and future plans.

5GApache DorisBig Data

0 likes · 15 min read

From Lambda Architecture to an All‑in‑One Apache Doris Real‑Time/Offline Data Platform for 5G Connected Factories

Big Data Technology & Architecture

Jun 4, 2024 · Big Data

Ant Group's Data Governance Practices: Quality, Storage, and Future Directions

This article presents Ant Group's comprehensive data governance experience, covering data quality management, storage governance, architectural design, operational strategies, case studies, and forward‑looking thoughts on integrated lake‑warehouse governance, data value realization, and AI‑driven automation.

Ant GroupBig DataData Quality

0 likes · 19 min read

Ant Group's Data Governance Practices: Quality, Storage, and Future Directions

Data Thinking Notes

Jun 2, 2024 · Big Data

How JD Retail’s Data Platform Boosts Efficiency with Unified Modeling and AI‑Driven Insights

This article details JD Retail’s end‑to‑end data platform, covering data asset certification, 5W2H modeling, unified query DSL, intelligent acceleration, robust governance, visualization components, low‑code orchestration, and large‑model AI applications that together reduce query latency, cut development costs, and empower analysts across the retail business.

AIBig DataData Platform

0 likes · 39 min read

How JD Retail’s Data Platform Boosts Efficiency with Unified Modeling and AI‑Driven Insights

DataFunTalk

Jun 2, 2024 · Big Data

Applying Data Lake (Hudi) at Kuaishou: Architecture Evolution, Use Cases, and Lessons Learned

This article shares Kuaishou's practical experience with data lake technology (Hudi), detailing the challenges of growing data warehouses, the migration from Hive to Hudi, the promotion strategy, real-world use cases such as CDC sync and batch‑stream integration, and key takeaways for future deployments.

Big DataData WarehouseHudi

0 likes · 12 min read

Applying Data Lake (Hudi) at Kuaishou: Architecture Evolution, Use Cases, and Lessons Learned

Su San Talks Tech

Jun 2, 2024 · Big Data

Mastering Kafka: Core Architecture, Use Cases, and Design Principles

This article provides a comprehensive overview of Apache Kafka, covering its role as a message queue, core components, topic and partition design, consumer groups, storage mechanisms, high‑availability features, delivery guarantees, ZooKeeper coordination, and scalability strategies for building robust real‑time data pipelines.

Big DataKafkaStreaming

0 likes · 15 min read

Mastering Kafka: Core Architecture, Use Cases, and Design Principles

Data Thinking Notes

May 30, 2024 · Databases

Why Your Data Team Is Drowning in Requests—and How OLAP Can Save You

This article examines why data departments get overwhelmed by massive data‑retrieval requests, identifies root causes such as mindset, requirement handling, and lack of tools, and presents a technical solution centered on dimensional modeling and OLAP multi‑dimensional reporting to streamline data access and empower teams.

Big DataData EngineeringData Warehouse

0 likes · 12 min read

Why Your Data Team Is Drowning in Requests—and How OLAP Can Save You

DataFunTalk

May 28, 2024 · Big Data

Building and Managing a Metric System in Data Warehouse: Practices from Dongchedi

This article details how the Dongchedi business team designs, implements, and monitors a comprehensive metric system within its data warehouse, covering metric standards, model construction, metadata management, quality monitoring, application scenarios, and future directions using the DataLeap platform.

Big DataData Warehousedata governance

0 likes · 18 min read

Building and Managing a Metric System in Data Warehouse: Practices from Dongchedi

DataFunTalk

May 27, 2024 · Big Data

JD Retail’s Unified HDFS Storage: Cross‑Region and Hierarchical Storage Practices

This article details JD Retail’s large‑scale HDFS deployment, describing how cross‑region storage challenges were solved with a full‑copy topology, asynchronous block replication, flow‑control mechanisms, and a tiered storage strategy that automatically moves hot, warm, and cold data among SSD, HDD, and high‑density HDD nodes to improve performance and cut costs.

Big DataData ManagementHDFS

0 likes · 20 min read

JD Retail’s Unified HDFS Storage: Cross‑Region and Hierarchical Storage Practices

Big Data Technology & Architecture

May 27, 2024 · Big Data

Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact

The Athena Data Factory, built by Spark Thinking, is a comprehensive one‑stop data development and governance platform that integrates data integration, development, analysis, and services, offering offline, real‑time, and AI pipelines, modular architecture, extensive monitoring, and cost‑optimisation to empower thousands of users across the company.

AirflowBig DataCloud Computing

0 likes · 26 min read

Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact

DataFunSummit

May 24, 2024 · Big Data

Ctrip's Experience with Alluxio in Its Big Data Platform: Architecture, Transparent Access, Custom Authentication, CallerContext, and Dynamic Configuration

This article details how Ctrip, a leading travel company, leverages Alluxio as a distributed cache within its extensive big‑data infrastructure to improve data access speed, implement transparent storage access, support custom authentication and multi‑tenant features, enhance audit logging with CallerContext, and dynamically distribute client configurations via Kyuubi.

AlluxioBig DataCallerContext

0 likes · 14 min read

Ctrip's Experience with Alluxio in Its Big Data Platform: Architecture, Transparent Access, Custom Authentication, CallerContext, and Dynamic Configuration

Alibaba Cloud Infrastructure

May 24, 2024 · Cloud Computing

Exploring Arm Neoverse: Business Innovation with Yitian Arm Architecture – Insights from the Feitian Technology Salon

The Feitian Technology Salon held on May 16 in Shanghai showcased Arm Neoverse's core advantages and demonstrated how Yitian 710‑based ECS instances deliver significant cost‑effective performance gains for big‑data and video workloads through cloud‑native optimizations and software acceleration techniques.

Big DataVideo Encoding

0 likes · 5 min read

Exploring Arm Neoverse: Business Innovation with Yitian Arm Architecture – Insights from the Feitian Technology Salon

DevOps Operations Practice

May 23, 2024 · Big Data

Understanding Elasticsearch: Architecture, Core Concepts, and How It Works

This article introduces Elasticsearch, an open‑source distributed search and analytics engine, explaining its architecture, core concepts such as clusters, nodes, shards, replicas, indices, inverted indexes, documents and fields, and how these components enable fast, scalable searching and data analysis.

Big DataDistributed SystemsElasticsearch

0 likes · 7 min read

Understanding Elasticsearch: Architecture, Core Concepts, and How It Works

Data Thinking Notes

May 23, 2024 · Big Data

How to Ensure Data Quality During System Rebuild with Automated Data Comparison

This article explains common data‑quality challenges when rebuilding business systems, compares manual SQL‑based validation with a dedicated data‑comparison product, and walks through practical steps for configuring, executing, and reviewing automated data‑matching tasks in a big‑data environment.

Big DataData MigrationData Quality

0 likes · 9 min read

How to Ensure Data Quality During System Rebuild with Automated Data Comparison

360 Smart Cloud

May 23, 2024 · Big Data

Archer Engine: Integrating Inverted Index with Iceberg for Scalable Big Data Log Analytics

The article introduces Archer, a new big‑data warehouse engine built on Iceberg that adds an inverted‑index mechanism using Tantivy to provide full‑text and JSON search, storage‑compute separation, and significant performance gains over traditional Elasticsearch and Iceberg connectors.

Archer EngineBig DataParquet

0 likes · 9 min read

Archer Engine: Integrating Inverted Index with Iceberg for Scalable Big Data Log Analytics

DataFunTalk

May 23, 2024 · Big Data

Berserker Big Data Platform: Architecture, Development Practices, and Operational Enhancements

This article presents a comprehensive overview of the Berserker big‑data platform, detailing its overall design, data‑development components, key architectural challenges such as state management, release processes, two‑phase commit, RPC duplication, task routing, message handling, execution isolation, dependency model redesign, and outlines future work including stateless execution nodes, Kubernetes integration, and unified stream‑batch processing.

Big DataData PlatformDistributed Scheduling

0 likes · 15 min read

Berserker Big Data Platform: Architecture, Development Practices, and Operational Enhancements

Rare Earth Juejin Tech Community

May 20, 2024 · Big Data

Why Use Message Queues and an Introduction to Kafka with Practical Examples

This article explains the motivations for adopting message queues, outlines core concepts and protocols, compares mainstream MQ products, and provides a detailed walkthrough of Kafka architecture, cluster setup, native Java APIs, and Spring Boot integration with extensive code examples.

Big DataDistributed SystemsKafka

0 likes · 23 min read

Why Use Message Queues and an Introduction to Kafka with Practical Examples

DataFunTalk

May 19, 2024 · Big Data

Tencent's Multi-Engine Unified Metadata and Permission Management for Big Data

This article introduces Tencent's Big Data Processing Suite (TBDS), discusses challenges of data silos, and presents Gravitino's open‑source unified metadata service and permission model, detailing how it integrates Hadoop, MPP, and various catalog plugins to provide consistent access control across heterogeneous data platforms.

Big DataGravitinoHadoop

0 likes · 12 min read

Tencent's Multi-Engine Unified Metadata and Permission Management for Big Data

DataFunSummit

May 18, 2024 · Big Data

Building a User Profile Platform with ClickHouse at 58.com: Architecture and Optimization

This article describes how 58.com designed and implemented a large‑scale user profiling platform using ClickHouse, covering system overview, core modules, major challenges of scale, complexity and performance, and the detailed storage, query, and optimization techniques applied to meet business needs.

Big DataClickHouseData Architecture

0 likes · 11 min read

21CTO

May 17, 2024 · Big Data

Why Polars Beats Pandas and PySpark in Single‑Node Benchmarks – A Deep Dive

This article compares Pandas, Polars, and PySpark across five dataset sizes, showing how Polars' eager and lazy modes dramatically outperform the other tools, and discusses when each framework is the most suitable choice for data processing workloads.

BenchmarkBig DataData Processing

0 likes · 9 min read

Why Polars Beats Pandas and PySpark in Single‑Node Benchmarks – A Deep Dive

DataFunSummit

May 17, 2024 · Big Data

Comprehensive Hudi Real-Time Data Lake Ingestion Solutions

This article presents a complete guide to Hudi-based real-time data lake ingestion, covering overall data integration architecture, batch and streaming ingestion strategies, advanced table design, and practical recommendations for handling challenges such as deduplication, latency, partitioning, and performance optimization.

Batch processingBig DataData Lake

0 likes · 12 min read

Comprehensive Hudi Real-Time Data Lake Ingestion Solutions

Data Thinking Notes

May 16, 2024 · Information Security

How a Data Security Governance Platform Secures the Full Data Lifecycle

This article explains how a data security governance platform protects data across its entire lifecycle—from warehouse construction and collection to application—by implementing fine‑grained permission controls, encryption, masking, authentication, and comprehensive auditing, while addressing scalability, high availability, and regulatory compliance challenges.

AuthenticationAuthorizationBig Data

0 likes · 13 min read

How a Data Security Governance Platform Secures the Full Data Lifecycle

DataFunSummit

May 15, 2024 · Big Data

Xiaomi Sales Data Warehouse: Architecture, Construction Theory, and Capability Evolution

This article details Xiaomi's sales data warehouse development, covering its history, architecture, dimensional modeling, layer design, streaming‑batch integration, governance, security, and future directions, while also addressing practical Q&A on implementation challenges and best practices.

Big DataData WarehouseFlink

0 likes · 15 min read

Xiaomi Sales Data Warehouse: Architecture, Construction Theory, and Capability Evolution

Didi Tech

May 14, 2024 · Databases

Didi Elasticsearch Overview: Architecture, Deployment, Performance, and Operations

Didi’s Elasticsearch platform, built on ES 7.6 and deployed on physical machines with containerized gateway and control layers, provides a multi‑tenant, high‑performance search service—featuring a user console, operational controls, ZGC‑based latency reductions, cost‑saving compression, custom security, real‑time cross‑datacenter replication, and a roadmap toward ES 8.13.

Big DataDidiElasticsearch

0 likes · 17 min read

Didi Elasticsearch Overview: Architecture, Deployment, Performance, and Operations

DataFunTalk

May 14, 2024 · Cloud Computing

Hybrid Cloud Architecture and AI Storage Evolution at Zhihu: From UnionStore to Alluxio

This article describes Zhihu's hybrid cloud architecture—including offline, online, and GPU data centers—its self‑built UnionStore cache, the performance and latency challenges faced during large‑scale AI model training, and the subsequent evaluation and migration to Alluxio community and enterprise editions to achieve higher throughput, stability, and lower operational overhead.

AI storageAlluxioBig Data

0 likes · 14 min read

Hybrid Cloud Architecture and AI Storage Evolution at Zhihu: From UnionStore to Alluxio

DataFunTalk

May 13, 2024 · Big Data

Data Integration Maturity Model: From ETL to EtLT

The article examines the evolution of data integration architectures—from traditional ETL through ELT to the emerging EtLT model—highlighting their advantages, disadvantages, industry trends, maturity stages, and practical guidance for enterprises and professionals navigating modern big‑data pipelines.

Big DataData IntegrationDataOps

0 likes · 31 min read

Data Integration Maturity Model: From ETL to EtLT

DaTaobao Tech

May 13, 2024 · Big Data

Interview Algorithms and System Design: Bloom Filter, TopK, Median, and Concurrency Implementations

The article presents a suite of interview‑style algorithm and system‑design solutions—including Bloom‑filter URL blacklists, hash‑partitioned word frequencies, missing‑number bit arrays, top‑K min‑heap, low‑memory median, short‑URL encoding, Redis user counting, and extensive Java implementations of sorting, singleton, LRU cache, custom thread pools, producer‑consumer models and various FooBar synchronization techniques.

Big DataConcurrencyData Structures

0 likes · 35 min read

Interview Algorithms and System Design: Bloom Filter, TopK, Median, and Concurrency Implementations

Big Data Technology & Architecture

May 13, 2024 · Big Data

Apache Paimon 0.8 Release: Deletion Vectors, File Index, Performance Boosts, and Flink/Spark Integration Enhancements

The article introduces Apache Paimon 0.8, highlighting new Deletion Vectors, a universal file index, memory and I/O optimizations, record‑level TTL, and integration improvements with Flink and Spark, while also discussing broader lake‑house performance trends and future directions.

Apache PaimonBig DataDeletion Vectors

0 likes · 8 min read

Apache Paimon 0.8 Release: Deletion Vectors, File Index, Performance Boosts, and Flink/Spark Integration Enhancements

DataFunSummit

May 12, 2024 · Big Data

Practice of Lakehouse‑Integrated Data Platform Architecture in the Financial Innovation Sector

This article presents the evolution of data platform architectures, the specific challenges of financial‑sector information‑technology innovation, and the design, core components, deployment path, and real‑world case studies of the cloud‑native lakehouse solution DataCyber developed by Shuxin Network.

Big DataData PlatformFinancial Innovation

0 likes · 21 min read

Practice of Lakehouse‑Integrated Data Platform Architecture in the Financial Innovation Sector

Mike Chen's Internet Architecture

May 11, 2024 · Big Data

Comprehensive Introduction to Apache Kafka: Architecture, Features, and Use Cases

This article provides a detailed overview of Apache Kafka, covering its core characteristics, distributed architecture, key components such as topics, partitions, brokers, producers, consumers, ZooKeeper, and common application scenarios like log collection, event‑driven architecture, real‑time analytics, and monitoring.

Big DataDistributed SystemsKafka

0 likes · 7 min read

Comprehensive Introduction to Apache Kafka: Architecture, Features, and Use Cases

Data Thinking Notes

May 9, 2024 · Big Data

How to Build an Effective Indicator System: From Concept to Productization

This article explores the complete lifecycle of an indicator system—from defining metrics and addressing common ambiguities, through designing concept consensus, semantic layers, mechanisms, and governance, to productizing platforms, optimizing development, and envisioning future AI‑driven enhancements.

Big DataData PlatformIndicator System

0 likes · 22 min read

How to Build an Effective Indicator System: From Concept to Productization

Rare Earth Juejin Tech Community

May 9, 2024 · Artificial Intelligence

On‑Device AI and Federated Learning: Era Background, Theory, and Practical Applications

This article outlines the evolution from 1G to 6G communications, explains the third AI wave driven by big data, theory, and compute, introduces federated learning (horizontal, vertical, transfer), and details on‑device AI architectures, decision tree and neural network models, and real‑world use cases such as video preloading and autonomous driving.

Artificial IntelligenceBig Dataedge computing

0 likes · 13 min read

On‑Device AI and Federated Learning: Era Background, Theory, and Practical Applications

Alibaba Cloud Big Data AI Platform

May 9, 2024 · Big Data

How RoaringBitmap Supercharged Lazada’s Selection Platform and Cut Processing Time by 99%

This article explains how Lazada’s internal selection platform leveraged Hologres and the RoaringBitmap compression algorithm to dramatically reduce storage costs, accelerate set operations, and break the 200,000‑item pool limit, achieving up to a 99% speed improvement in scheduling.

Big DataBitmap CompressionData Warehouse

0 likes · 16 min read

How RoaringBitmap Supercharged Lazada’s Selection Platform and Cut Processing Time by 99%

Baidu MEUX

May 8, 2024 · Big Data

Why KNIME Is a Powerful Open‑Source Solution for Big Data Analytics

In the data‑driven era, KNIME offers a free, visual, and highly scalable platform that streamlines massive data ingestion, preprocessing, analysis, automation, and visualization, enabling researchers to handle millions of records efficiently without extensive coding or costly software.

Big DataData AnalysisKNIME

0 likes · 9 min read

Why KNIME Is a Powerful Open‑Source Solution for Big Data Analytics

DataFunTalk

May 8, 2024 · Big Data

Risk Control and Data Application in the Bulk Commodity Industry: Challenges, Solutions, and Core Capabilities

The article presents Ant Group's exploration of applying its data‑driven risk control and credit assessment capabilities to the traditional bulk commodity sector, detailing industry background, data pain points, core technical solutions, and the construction of a secure, explainable data‑model platform for digital transformation.

AIBig DataBulk Industry

0 likes · 13 min read

Risk Control and Data Application in the Bulk Commodity Industry: Challenges, Solutions, and Core Capabilities

DataFunTalk

May 6, 2024 · Big Data

OPPO Next‑Generation Big Data & AI Integrated Architecture on Functional Cloud

This article presents OPPO’s next‑generation big‑data and AI integrated architecture on functional cloud, detailing a cloud‑native elastic compute framework, a unified data‑lake solution, real‑time feature platforms, machine‑learning data acceleration, and hybrid‑cloud deployments, highlighting performance gains and cost reductions.

Big Datacloud-nativeelastic computing

0 likes · 11 min read

OPPO Next‑Generation Big Data & AI Integrated Architecture on Functional Cloud

DataFunSummit

May 5, 2024 · Big Data

Alluxio in Lakehouse Architecture: Benefits, Challenges, and Real‑World Use Cases

This article explains how Alluxio enables a unified lake‑warehouse architecture by decoupling compute and storage, outlines its core capabilities, evaluates the cost‑saving and performance benefits, discusses the technical challenges, and presents several practical deployment scenarios in finance and AI workloads.

AlluxioBig DataData Orchestration

0 likes · 15 min read

Alluxio in Lakehouse Architecture: Benefits, Challenges, and Real‑World Use Cases

DataFunTalk

May 4, 2024 · Big Data

JD Retail Data Visualization Platform: Product Practice and Insights

This article presents an in‑depth overview of JD.com’s retail data visualization platform, detailing its product matrix—including EasyBI, a low‑code platform, and JDV large‑screen tool—its architectural layers, key capabilities, business case studies, challenges faced, and future development directions.

AnalyticsBig DataData visualization

0 likes · 14 min read

JD Retail Data Visualization Platform: Product Practice and Insights

DataFunSummit

May 2, 2024 · Big Data

Building an Attribution System for NetEase Cloud Music Data Warehouse: Challenges and Solutions

This article presents the problems faced by NetEase Cloud Music's data warehouse attribution system and details a comprehensive solution that includes upgrading the event‑tracking framework, redesigning the attribution model, and launching a unified management platform to improve stability, accuracy, and scalability.

AnalyticsBig DataData Warehouse

0 likes · 13 min read

Building an Attribution System for NetEase Cloud Music Data Warehouse: Challenges and Solutions

Big Data Technology & Architecture

Apr 30, 2024 · Big Data

Apache Paimon Becomes a Top-Level Project: A Comprehensive Overview of Lakehouse Framework Capabilities and Future Trends

The article reviews Apache Paimon's graduation to an Apache Top-Level Project, outlines the essential capabilities of modern lakehouse frameworks—including streaming and batch I/O, multi‑engine integration, and advanced features—and discusses the problems they solve and the promising direction of the lakehouse ecosystem.

Apache PaimonBatch processingBig Data

0 likes · 5 min read

Apache Paimon Becomes a Top-Level Project: A Comprehensive Overview of Lakehouse Framework Capabilities and Future Trends

Alibaba Cloud Developer

Apr 30, 2024 · Big Data

Mastering ODPS SQL: Proven Tips to Slash Query Time and Tackle Data Skew

This article explores practical SQL optimization techniques for Alibaba's ODPS platform, covering fundamentals, common pitfalls like null handling and select *, advanced strategies such as multi‑insert, partition limiting, UDF placement, data‑skew mitigation, parameter tuning, and real‑world case studies that dramatically reduce query runtimes.

Big DataData SkewHive

0 likes · 23 min read

Mastering ODPS SQL: Proven Tips to Slash Query Time and Tackle Data Skew

DataFunTalk

Apr 28, 2024 · Big Data

Ant Group’s Data Governance Practices: Overview, Data Quality, and Data Storage Governance

This article shares Ant Group's extensive experience in big data governance, detailing the overall data governance framework, data quality management, data storage governance, and future considerations, illustrated with practical cases and strategies for ensuring compliance, reliability, and cost efficiency.

Ant GroupBig DataData Architecture

0 likes · 17 min read

Ant Group’s Data Governance Practices: Overview, Data Quality, and Data Storage Governance

DataFunSummit

Apr 27, 2024 · Big Data

Delta Lake 3.1: New Features, Metadata Optimization, and Universal Format Overview

This article introduces Delta Lake 3.1, detailing its release background, the addition of Deletion Vector to Update and Merge commands, metadata‑driven count/min/max optimizations, the Universal Format for cross‑engine compatibility, and a comparative evaluation with Iceberg and Hudi.

Big DataData LakeDeletion Vector

0 likes · 8 min read

Delta Lake 3.1: New Features, Metadata Optimization, and Universal Format Overview

Mike Chen's Internet Architecture

Apr 27, 2024 · Cloud Computing

Understanding Cloud Computing: Types, Benefits, and Core Technologies

This article provides a comprehensive overview of cloud computing, explaining its definition, major service models (IaaS, PaaS, SaaS), key advantages and challenges, and the essential technologies such as virtualization, distributed systems, automation, security, storage, and big data that enable modern cloud solutions.

Big DataCloud ComputingIaaS

0 likes · 6 min read

Understanding Cloud Computing: Types, Benefits, and Core Technologies

Bilibili Tech

Apr 26, 2024 · Big Data

Fine-Grained Lock Optimization for HDFS NameNode to Improve Metadata Read/Write Performance

To overcome the NameNode write bottleneck caused by a single global read/write lock in Bilibili’s massive HDFS deployment, the team introduced hierarchical fine‑grained locking—splitting the lock into Namespace, BlockPool, and per‑INode levels—which yielded up to three‑fold write throughput gains, a 90 % drop in RPC queue time, and shifted performance limits from lock contention to log synchronization.

Big DataHDFSMetadata

0 likes · 15 min read

Fine-Grained Lock Optimization for HDFS NameNode to Improve Metadata Read/Write Performance

AntTech

Apr 26, 2024 · Databases

Data Processing Technologies in the AI Era: Trends and Integration of Vector and Relational Databases

The talk explores how the rapid growth of multimodal data and large language models is reshaping data processing, highlighting three key trends—online‑offline integration, vector‑relational database convergence, and the fusion of data processing with AI computation—while presenting practical solutions and future visions for unified data‑AI ecosystems.

AIBig DataData Processing

0 likes · 12 min read

Data Processing Technologies in the AI Era: Trends and Integration of Vector and Relational Databases

DataFunSummit

Apr 25, 2024 · Big Data

Paimon Project Overview: Recent Developments, Core Capabilities, and Future Roadmap

This article presents a comprehensive overview of the Apache‑incubated Paimon project, covering its evolution from Flink Table Store, the current features of primary‑key and log tables, management tools such as snapshots, tags and branches, performance optimizations for Flink and Spark, and a detailed roadmap of upcoming functionalities.

Big DataData ManagementFlink

0 likes · 23 min read

Paimon Project Overview: Recent Developments, Core Capabilities, and Future Roadmap

DataFunTalk

Apr 25, 2024 · Big Data

Apache Hudi 1.0: Design Reconsiderations and Key New Features

This article provides a comprehensive overview of Apache Hudi 1.0, detailing its architectural redesign, five major development directions, and the most important new capabilities such as LSM‑tree timeline, function indexes, file‑group readers/writers, partial updates, and non‑blocking concurrency control, along with performance evaluations and resource links.

Apache HudiBig DataFunction Index

0 likes · 14 min read

Apache Hudi 1.0: Design Reconsiderations and Key New Features

Sohu Tech Products

Apr 24, 2024 · Big Data

How to Build a ClickHouse‑Powered Retention Analysis Model for User Behavior

This article explains the concepts, formulas, and step‑by‑step implementation of a user‑retention analysis model, covering both Hive‑based offline processing and ClickHouse‑accelerated real‑time queries, complete with SQL examples, architecture diagrams, and practical optimization tips.

Big DataClickHouseData visualization

0 likes · 19 min read

How to Build a ClickHouse‑Powered Retention Analysis Model for User Behavior

Python Programming Learning Circle

Apr 24, 2024 · Big Data

Using the TransBigData Python Library for Mobile Signaling Data Processing, Analysis, and Visualization

This article introduces the TransBigData Python package, explains how to install it, read mobile signaling data with pandas, preprocess and grid the data, identify stay and move events, determine home and work locations, and visualize individual user activity using built‑in functions.

Big DataData visualizationPython

0 likes · 7 min read

Using the TransBigData Python Library for Mobile Signaling Data Processing, Analysis, and Visualization

Efficient Ops

Apr 23, 2024 · Big Data

How to Plan, Configure, and Launch a Hadoop 3.3.5 Cluster on Three Nodes

This guide walks through planning a three‑node Hadoop 3.3.5 cluster, explains default and custom configuration files, details core‑site, hdfs‑site, yarn‑site, and mapred‑site settings, shows how to distribute configs, start HDFS and YARN, and perform basic file‑system tests.

Big DataCluster SetupHDFS

0 likes · 11 min read

How to Plan, Configure, and Launch a Hadoop 3.3.5 Cluster on Three Nodes

DataFunSummit

Apr 23, 2024 · Big Data

Building a Data System with Apache Arrow: Design, Implementation, and Practical Tips

This article explains why new data systems are needed, introduces Apache Arrow’s columnar in‑memory format and its zero‑copy advantages, describes how to model data at read time, outlines the execution flow with Acero and SQL planning, and shares practical tips and extensions for building robust, dynamic‑schema data platforms.

AceroApache ArrowBig Data

0 likes · 20 min read

Building a Data System with Apache Arrow: Design, Implementation, and Practical Tips

DataFunTalk

Apr 23, 2024 · Big Data

Apache Paimon Graduates to Top‑Level Project – Milestones, Core Capabilities, and Community Highlights

Apache Paimon, originally launched as Flink Table Store, has graduated to an Apache Top‑Level Project after a year of incubation, showcasing real‑time lakehouse capabilities, extensive ecosystem integration, and strong adoption by major enterprises, marking a significant milestone for streaming and batch data processing.

Apache PaimonBig DataLakehouse

0 likes · 9 min read

Apache Paimon Graduates to Top‑Level Project – Milestones, Core Capabilities, and Community Highlights

DataFunSummit

Apr 22, 2024 · Big Data

Intelligent Optimization of Bilibili’s Iceberg‑Based Lakehouse for Query Acceleration

This article describes Bilibili’s intelligent optimization project that automatically analyzes historical query workloads to configure multi‑dimensional sorting, various indexes, and pre‑aggregation on Iceberg tables, thereby reducing scan volume by 28% across dozens of tables and improving OLAP query latency.

Big DataData WarehouseIceberg

0 likes · 15 min read

Intelligent Optimization of Bilibili’s Iceberg‑Based Lakehouse for Query Acceleration

DataFunTalk

Apr 22, 2024 · Big Data

Construction and Application of a Metric System: Business, Technical, and Product Perspectives

This article explains how to build and apply a comprehensive metric system by addressing business, technical, and product challenges, outlining design, standardization, metadata management, and future AI‑driven use cases to support data‑driven decision making.

AI IntegrationBig Datadata governance

0 likes · 9 min read

Construction and Application of a Metric System: Business, Technical, and Product Perspectives

21CTO

Apr 22, 2024 · Big Data

Inside Uber’s Real‑Time Data Infrastructure: How They Scale Streaming at Massive Scale

This article explores Uber’s sophisticated real‑time data infrastructure, detailing how the company leverages open‑source technologies such as Apache Kafka, Flink, Pinot, and Presto, and describing the architectural components, scaling challenges, multi‑region resilience, data back‑filling, and operational practices that enable low‑latency analytics for millions of daily rides and deliveries.

Big DataFlinkKafka

0 likes · 25 min read

Inside Uber’s Real‑Time Data Infrastructure: How They Scale Streaming at Massive Scale

DataFunTalk

Apr 20, 2024 · Big Data

Tencent Video Metrics Middle Platform and Lakehouse Integration: Architecture, Governance, and Practices

This article details Tencent Video’s data business, describing the design and implementation of its metrics middle platform and lake‑warehouse integration, covering architecture, governance, consistency, timeliness, usability, cost optimization, and future plans, with insights into technology choices such as Iceberg, StarRocks, and MQL.

Big DataData EngineeringLakehouse

0 likes · 18 min read

Tencent Video Metrics Middle Platform and Lakehouse Integration: Architecture, Governance, and Practices

DataFunSummit

Apr 19, 2024 · Big Data

Design Insights of Bilibili's Big Data Development Governance Platform

This article outlines Bilibili's data‑driven approach, describing the five‑year development of its big‑data development governance platform, its user segmentation, product positioning, data‑map and governance product designs, operational methods, value evaluation, and future roadmap, highlighting significant efficiency gains and user impact.

Big DataBilibiliData Platform

0 likes · 10 min read

Design Insights of Bilibili's Big Data Development Governance Platform

DataFunTalk

Apr 19, 2024 · Artificial Intelligence

Technology Maturity Curve – Financial Risk Control Overview

This article provides a comprehensive overview of the evolution, current state, and future trends of financial risk control technologies, covering data, feature engineering, modeling, decision-making, product development, challenges, and the impact of large AI models on the industry.

Big DataTechnology Maturityfinancial risk

0 likes · 29 min read

Technology Maturity Curve – Financial Risk Control Overview

Python Programming Learning Circle

Apr 17, 2024 · Big Data

Comparative Analysis of Starbucks and Luckin Coffee Store Distribution in China Using Python Data Visualization

Using Python data visualization and geospatial analysis, this article compares the nationwide distribution of Starbucks and Luckin Coffee stores in China, revealing differences in regional concentration, proximity patterns, and statistical insights such as average Luckin stores within 500 m of each Starbucks location.

Big DataPythonStore Distribution

0 likes · 11 min read

Comparative Analysis of Starbucks and Luckin Coffee Store Distribution in China Using Python Data Visualization

DataFunTalk

Apr 16, 2024 · Big Data

Materialized Views in MaxCompute: Design, Implementation, and Best Practices

This article explains how MaxCompute leverages materialized views as a query accelerator, covering their history, advantages and drawbacks, creation and maintenance details, automatic query rewriting, intelligent recommendation, auto‑materialization, and future enhancements for large‑scale data warehousing.

Automatic RefreshBig DataIntelligent Recommendation

0 likes · 13 min read

Materialized Views in MaxCompute: Design, Implementation, and Best Practices

Alibaba Cloud Big Data AI Platform

Apr 16, 2024 · Big Data

MaxCompute’s Integrated Offline & Near‑Real‑Time Architecture: Transaction Table 2.0 Explained

This article explains MaxCompute’s new integrated offline‑and‑near‑real‑time architecture, Transaction Table 2.0, detailing its unified storage and compute design, automatic data governance, schema evolution, upsert and time‑travel capabilities, and how it simplifies complex big‑data pipelines while delivering minute‑level latency and lower costs.

Big DataMaxComputeTransaction Table

0 likes · 27 min read

MaxCompute’s Integrated Offline & Near‑Real‑Time Architecture: Transaction Table 2.0 Explained

Data Thinking Notes

Apr 15, 2024 · Big Data

How This Company Built a Powerful Data Governance Platform: A Visual Case Study

This article presents a visual case study of a company's data governance and data middle‑platform implementation, outlining the project background, solution architecture, and the resulting business value and effects through a series of illustrative images.

Big DataData Platformdata governance

0 likes · 2 min read

How This Company Built a Powerful Data Governance Platform: A Visual Case Study

Architect

Apr 15, 2024 · Big Data

Understanding the Underlying Working Principles of ElasticSearch

This article explains ElasticSearch’s architecture and core mechanisms—including its reliance on Lucene segments, inverted indexes, stored fields, document values, caching, shard routing, and scaling strategies—while answering common questions about wildcard matching, index compression, and memory usage.

Big Datalucenesearch engine

0 likes · 11 min read

Understanding the Underlying Working Principles of ElasticSearch

DataFunTalk

Apr 14, 2024 · Big Data

Third‑Generation Metric Platform: Enabling a Light Data Warehouse with NoETL

This article explains how a third‑generation metric platform replaces traditional ETL‑heavy data‑warehouse pipelines with a semantic‑driven NoETL approach, reducing cost, improving quality and efficiency, and delivering automated, self‑service analytics for both IT and business users.

Big DataData EngineeringData Warehouse

0 likes · 16 min read

Third‑Generation Metric Platform: Enabling a Light Data Warehouse with NoETL

DataFunTalk

Apr 12, 2024 · Big Data

Building and Managing an Indicator System in a Data Warehouse: Practices from the Dongchedi Business

This article explains how the Dongchedi team designed, implemented, and monitored a comprehensive indicator system within a petabyte‑scale data warehouse, covering standards, metadata management, model construction, quality monitoring, and diverse application scenarios to improve data reliability and business insight.

Big DataData WarehouseIndicator Management

0 likes · 18 min read

Building and Managing an Indicator System in a Data Warehouse: Practices from the Dongchedi Business

ITPUB

Apr 11, 2024 · Big Data

Query 100K Items from 10M+ Records: CK, ES Scroll, HBase, RediSearch

When faced with a business requirement to filter up to 100 000 records from a pool of tens of millions and then sort and de‑duplicate them, this article explores four technical solutions—multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, a combined Elasticsearch‑HBase approach, and RediSearch with RedisJSON—detailing their design, implementation, performance testing, and trade‑offs.

Big DataClickHouseElasticsearch

0 likes · 12 min read

Query 100K Items from 10M+ Records: CK, ES Scroll, HBase, RediSearch

DataFunSummit

Apr 11, 2024 · Big Data

Building Integrated Data Governance and R&D Operations with DataOps: Practices and Insights from China Unicom Digital Technology

This article shares how China Unicom Digital Technology leverages DataOps to build an integrated data governance, research and development, and operations capability, outlining challenges, methodological considerations, a seven-step governance framework, and a multi-center collaborative mechanism to achieve sustainable data-driven value.

Big Datadata operations

0 likes · 15 min read

Building Integrated Data Governance and R&D Operations with DataOps: Practices and Insights from China Unicom Digital Technology

Sohu Tech Products

Apr 10, 2024 · Big Data

Bloom Filter: Principles, False Positive Rate, and Implementations with Guava and Redis

Bloom filters are space‑efficient probabilistic structures that answer “definitely not” or “maybe” membership queries, with a controllable false‑positive rate derived from bit array size, element count, and hash functions, and can be implemented via Guava’s Java library, Redisson’s Redis wrapper, native Redis modules, or custom bitmap code, dramatically reducing memory usage and latency in large‑scale systems such as URL deduplication or user‑product checks.

Big DataGuavaRedis

0 likes · 21 min read

Bloom Filter: Principles, False Positive Rate, and Implementations with Guava and Redis