Tagged articles

3697 articles

Page 15 of 37

Nov 9, 2022 · Operations

How Ctrip Handles Billions of Logs Daily: Real‑Time Monitoring, Clog, CAT & TSDB

This article details Ctrip’s large‑scale log monitoring architecture, covering the overall Overview, the Clog log system, the CAT tracing platform, and the internal TSDB solution, explaining how billions of logs are processed in real time with low latency, high reliability, and efficient querying.

Big DataDistributed SystemsLog Monitoring

0 likes · 12 min read

How Ctrip Handles Billions of Logs Daily: Real‑Time Monitoring, Clog, CAT & TSDB

360 Smart Cloud

Nov 9, 2022 · Databases

StarRocks Adoption and Application Practices at 360: Performance Comparison and Use Cases

This article details why 360 selected StarRocks as its OLAP engine, compares its performance and resource usage against MySQL, Hive, Spark, Druid, ClickHouse and Doris, and describes the concrete deployment scenarios and data products built on StarRocks within the company.

Big DataData WarehouseOLAP

0 likes · 12 min read

StarRocks Adoption and Application Practices at 360: Performance Comparison and Use Cases

政采云技术

Nov 8, 2022 · Industry Insights

How Small Big‑Data Frontend Teams Can Thrive: A Survival Guide

This guide outlines the essential concepts of big data, the roles of a front‑end data team, practical workflow steps, platform architecture, industry benchmarks, and actionable strategies for small teams to improve efficiency, visualization capabilities, and digital operations.

Big DataData PlatformData visualization

0 likes · 14 min read

How Small Big‑Data Frontend Teams Can Thrive: A Survival Guide

Architecture & Thinking

Nov 8, 2022 · Databases

Mastering Redis HyperLogLog: Efficient Cardinality Estimation for Big Data

This article explains Redis HyperLogLog, its underlying principles, memory efficiency, typical use cases like UV/PV counting, and provides practical command examples (PFADD, PFCOUNT, PFMERGE) to perform high‑performance cardinality estimation on massive datasets.

Big DataCardinalityHyperLogLog

0 likes · 9 min read

Mastering Redis HyperLogLog: Efficient Cardinality Estimation for Big Data

DataFunSummit

Nov 8, 2022 · Big Data

Building YiPay's Big Data BI Analysis Platform: Architecture, OLAP Engine Practices, and Future Plans

This article details YiPay's big data BI analysis platform construction, covering its financial data use cases, platform architecture, OLAP engine implementations with ClickHouse, Presto, and Kylin, as well as identified challenges and future development directions.

AnalyticsBI platformBig Data

0 likes · 11 min read

Building YiPay's Big Data BI Analysis Platform: Architecture, OLAP Engine Practices, and Future Plans

政采云技术

Nov 8, 2022 · Big Data

User Path Analysis in the Hunyi System: Design, Computation Logic, and StarRocks Implementation

This article explains user path analysis as a method to visualize and optimize user flow, describes its productization in the Hunyi analytics platform, details the underlying computation logic, presents a complex StarRocks SQL solution, discusses performance challenges, and suggests future improvements and recruitment opportunities.

Big DataPerformance OptimizationStarRocks

0 likes · 21 min read

User Path Analysis in the Hunyi System: Design, Computation Logic, and StarRocks Implementation

DataFunSummit

Nov 7, 2022 · Big Data

Huolala's Data Governance Practices: Data Quality, Metadata, and Cost Management Platforms

This article details Huolala's end‑to‑end data governance practice, covering the construction of a data governance framework, the implementation of a zero‑code data quality platform, a metadata management platform, and a cost‑governance system that together improve data reliability, reduce waste, and support scalable big‑data operations.

Big DataCost Managementdata governance

0 likes · 14 min read

Huolala's Data Governance Practices: Data Quality, Metadata, and Cost Management Platforms

Tencent Cloud Developer

Nov 7, 2022 · Big Data

Data Engineering and Data Warehouse Design: Principles, Practices, and Governance

The article outlines comprehensive data‑engineering and warehouse‑design principles—covering collection (four Ws and methods like SDK, point‑code, binlog), reporting strategies, source selection, modeling with fact, aggregation, dimension and model tables, quality checks, and governance practices such as standardized SDKs, metric libraries, automated lineage, and cost optimization—to share actionable experience for any organization.

Big DataData WarehouseData engineering

0 likes · 32 min read

Data Engineering and Data Warehouse Design: Principles, Practices, and Governance

DataFunSummit

Nov 6, 2022 · Artificial Intelligence

Guangfa Group’s Federated Learning Exploration, Platform Construction, and the Book “Federated Learning Principles and Applications”

This article outlines Guangfa Group’s initiatives in privacy computing and federated learning, detailing the development of its federated learning platform, contributions to open‑source FATE, industry standards, various application scenarios such as joint statistics, precise marketing, risk control, cross‑domain verification, and introduces their newly published book on federated learning principles and applications.

Artificial IntelligenceBig DataFATE

0 likes · 23 min read

Guangfa Group’s Federated Learning Exploration, Platform Construction, and the Book “Federated Learning Principles and Applications”

Architects' Tech Alliance

Nov 5, 2022 · Databases

Data Replication: Fundamentals, Technologies, and Future Trends

This article explains the concept of data replication, its three-stage process, key principles of compliance, timeliness, and diversity, various replication methods, layered technologies across storage, operating system, and database levels, emerging cloud and big‑data solutions, and heterogeneous use‑case scenarios.

Big DataDatabasesdata replication

0 likes · 15 min read

Data Replication: Fundamentals, Technologies, and Future Trends

StarRocks

Nov 4, 2022 · Big Data

Building a High‑Performance, Cost‑Effective Cloud Lakehouse with StarRocks and EMR

This article explains how to design and implement a cloud‑native Lakehouse using StarRocks and Tencent Cloud EMR, covering core technical requirements, a five‑layer architecture, data ingestion with Iceberg/Hudi, performance tricks like Z‑order clustering, cost‑control through elastic scaling, and the key product features of EMR StarRocks.

Big DataCloud ComputingEMR

0 likes · 24 min read

Building a High‑Performance, Cost‑Effective Cloud Lakehouse with StarRocks and EMR

dbaplus Community

Nov 3, 2022 · Big Data

Why Kafka Stores Data the Way It Does: A Deep Dive into Its Log Architecture

This article thoroughly examines Kafka's storage system, explaining why it uses sequential log writes combined with sparse indexing, how different log formats evolved, and the mechanisms for log retention and compaction that enable high‑throughput, fault‑tolerant streaming at massive scale.

Big DataDistributed SystemsKafka

0 likes · 22 min read

Why Kafka Stores Data the Way It Does: A Deep Dive into Its Log Architecture

Alibaba Cloud Big Data AI Platform

Nov 3, 2022 · Big Data

How Alibaba Cloud’s ODPS Upgrade Redefines Big Data Processing and AI Integration

Alibaba Cloud announced that its ODPS platform has been upgraded into an integrated big‑data solution that supports massive batch jobs, real‑time analytics, and AI workloads, delivering record‑breaking performance and enabling use cases from smart city traffic optimization to accelerated autonomous‑driving model training.

AIBig DataPerformance Benchmark

0 likes · 5 min read

How Alibaba Cloud’s ODPS Upgrade Redefines Big Data Processing and AI Integration

Zhongtong Tech

Nov 3, 2022 · Databases

How ZTO’s Database Operations Platform Evolved from Manual to Intelligent Automation

The article recounts Chen Jianhua’s presentation at the GOPS Global Operations Conference, detailing ZTO’s three‑stage journey in building a database operations platform—from initial automation to self‑service and finally to fine‑grained, data‑driven intelligent management—while sharing lessons and future plans.

Big DataDatabase operationsPlatform Engineering

0 likes · 4 min read

How ZTO’s Database Operations Platform Evolved from Manual to Intelligent Automation

DataFunSummit

Nov 2, 2022 · Big Data

Evolution and Construction of Huolala's Doris‑Based OLAP System

This article details Huolala's journey from a MySQL‑centric analytics pipeline to a multi‑engine OLAP platform built on Doris, covering system architecture, data flow, stage‑wise evolution, engine selection, POC validation, performance tuning, stability measures, and future roadmap for self‑service analytics.

Big DataDorisOLAP

0 likes · 15 min read

Evolution and Construction of Huolala's Doris‑Based OLAP System

Data Thinking Notes

Nov 1, 2022 · Big Data

Mastering Spark Task Performance: A Deep Dive into JVM GC Optimization

This article explains how JVM memory management and various garbage collection algorithms affect Spark task performance, covering JVM fundamentals, GC concepts, common collectors, and practical tuning strategies to avoid full GC pauses and improve throughput.

Big DataGarbage CollectionJVM

0 likes · 14 min read

Mastering Spark Task Performance: A Deep Dive into JVM GC Optimization

DataFunSummit

Nov 1, 2022 · Big Data

Case Study of DCMM Standard Implementation at State Grid Tianjin Electric Power

This article details State Grid Tianjin Electric Power's early adoption and successful certification of the national DCMM data management maturity model, outlining background, certification milestones, systematic practices, and lessons learned that illustrate how data governance, architecture, and application strategies drive digital transformation.

Big DataDCMMData Management

0 likes · 11 min read

Case Study of DCMM Standard Implementation at State Grid Tianjin Electric Power

Java Architect Essentials

Oct 31, 2022 · Big Data

How to Process 10 GB of Age Data on a 4 GB Machine Using Java

This article walks through generating a 10 GB file of age values, reading it line‑by‑line on a 4 GB RAM, 2‑core machine, measuring single‑thread performance, then redesigning the pipeline with a producer‑consumer model, blocking queues and multithreaded string splitting to dramatically boost CPU utilization and cut processing time while managing memory consumption.

Big DataFile ProcessingJava

0 likes · 12 min read

How to Process 10 GB of Age Data on a 4 GB Machine Using Java

Architects' Tech Alliance

Oct 31, 2022 · Industry Insights

What Drives Distributed Storage: Product Forms, Ecosystem, and Key Use Cases

Distributed storage encompasses integrated appliances and pure‑software solutions, each with distinct hardware strategies, and forms a multi‑dimensional industry ecosystem that spans commercial and open‑source software, specialized and generic hardware, serving critical scenarios such as virtualization/cloud, high‑performance computing, and big‑data analytics.

Big DataCloud ComputingHigh Performance Computing

0 likes · 15 min read

What Drives Distributed Storage: Product Forms, Ecosystem, and Key Use Cases

Data Thinking Notes

Oct 31, 2022 · Big Data

Mastering Spark’s Unified Memory Management: A Deep Dive into On‑Heap & Off‑Heap Tuning

This article explains Spark's unified memory manager, detailing on‑heap and off‑heap memory regions, dynamic memory sharing, task memory allocation, and practical tuning techniques to optimize performance and avoid common out‑of‑memory errors.

Big DataPerformance tuningUnified Memory

0 likes · 13 min read

Mastering Spark’s Unified Memory Management: A Deep Dive into On‑Heap & Off‑Heap Tuning

21CTO

Oct 30, 2022 · Fundamentals

Top 10 IoT Trends That Will Transform Industries

This article explores the rapid growth of the Internet of Things, outlines the key drivers behind its expansion, highlights major challenges such as chip shortages and bandwidth limits, and presents ten emerging trends—including AI integration, 5G, edge computing, and security—that will shape multiple sectors in the coming years.

5GAIBig Data

0 likes · 9 min read

Top 10 IoT Trends That Will Transform Industries

DataFunSummit

Oct 30, 2022 · Big Data

Integrating Apache Spark with Cloud‑Native Technologies: Principles, Kubernetes Deployments, EMR on ACK, and Serverless Spark on DLF

This article examines the challenges of traditional Spark clusters and explains how integrating Spark with cloud‑native platforms—through Kubernetes deployment modes, EMR on ACK practices, Remote Shuffle Service, and serverless Spark on DLF—provides elastic scaling, lower operational costs, and advanced features such as executor rolling and custom scheduler support.

Big DataDLFServerless

0 likes · 18 min read

Integrating Apache Spark with Cloud‑Native Technologies: Principles, Kubernetes Deployments, EMR on ACK, and Serverless Spark on DLF

Python Crawling & Data Mining

Oct 30, 2022 · Big Data

Why Ozone Is the Next‑Generation Distributed Object Store for Big Data

This article explains how Ozone, the Hadoop community’s new distributed object‑storage system, overcomes HDFS’s small‑file limitations with a hierarchical Volume‑Bucket‑Object model, detailing its architecture, components, data flow for creating and reading objects, and the benefits of its scalable, fault‑tolerant design.

Big DataHadoopObject Storage

0 likes · 12 min read

Why Ozone Is the Next‑Generation Distributed Object Store for Big Data

DataFunSummit

Oct 29, 2022 · Big Data

Apache Iceberg in Tencent: Architecture, Spark Read/Write, Production Practices, and Data Governance

This article presents an in‑depth overview of Apache Iceberg as used at Tencent, covering its table format architecture, Spark read/write mechanisms, production challenges and optimizations such as schema evolution, file filtering, upsert strategies, and the surrounding data‑governance services.

Apache IcebergBig DataData Lake

0 likes · 19 min read

Apache Iceberg in Tencent: Architecture, Spark Read/Write, Production Practices, and Data Governance

Past Memory Big Data

Oct 29, 2022 · Big Data

How to Adapt Hadoop for Domestic Big Data Requirements

The article analyzes Hadoop’s declining relevance, the dominance of CDH/HDP, security pressures from vulnerabilities, and outlines ten technical steps—including hardware adaptation, component selection, dependency resolution, compilation, Ambari integration, packaging, testing, and functional verification—required to create a domestic ARM‑based Hadoop distribution, which the authors have released as a free HDP 3.3.1 build.

ARMAmbariBig Data

0 likes · 15 min read

How to Adapt Hadoop for Domestic Big Data Requirements

DevOps Cloud Academy

Oct 27, 2022 · Big Data

Understanding DataOps: Concepts, Standards, and Enterprise Practices

This article explains DataOps as a methodology for improving data analysis quality and efficiency, outlines its origins, standards, and maturity model, and presents practical insights and case studies from Chinese enterprises on how DataOps addresses common data engineering challenges and drives digital transformation.

Big DataData ManagementDataOps

0 likes · 12 min read

Understanding DataOps: Concepts, Standards, and Enterprise Practices

Huolala Tech

Oct 27, 2022 · Big Data

Turning Big Data into Valuable Assets: The Business Case for Data Governance

Amid the explosive growth of big data, this article explains why systematic data governance—covering metadata, quality, lifecycle, and security—is essential for turning raw data into measurable business assets, reducing costs, and enhancing operational efficiency.

Big DataData LifecycleData Quality

0 likes · 11 min read

Turning Big Data into Valuable Assets: The Business Case for Data Governance

Data Thinking Notes

Oct 27, 2022 · Big Data

Boost Spark Performance: Proven Code Optimizations & Tuning Tips

This article outlines practical Spark job optimization techniques—from code-level improvements and resource tuning to data skew handling, persistence strategies, shuffle reduction, broadcast variables, Kryo serialization, and efficient data structures—demonstrating how each can dramatically cut execution time.

Big DataKryo SerializationPerformance tuning

0 likes · 19 min read

Boost Spark Performance: Proven Code Optimizations & Tuning Tips

Practical DevOps Architecture

Oct 27, 2022 · Big Data

Introduction to the ELK Stack and Kafka with Docker Compose

This article introduces Elasticsearch, Logstash, Kibana, and Kafka, explains their roles in data collection, analysis, and visualization, and provides a complete Docker‑Compose configuration to deploy these components together for scalable log processing and search solutions.

Big DataKafkaKibana

0 likes · 4 min read

Introduction to the ELK Stack and Kafka with Docker Compose

ITPUB

Oct 26, 2022 · Big Data

Why Kafka Stores Data the Way It Does: Inside Its Architecture

This article provides an in‑depth technical analysis of Kafka’s storage architecture, covering its design goals, storage mechanisms, log segment layout, sparse indexing, log cleanup policies, and the performance techniques such as sequential writes, page cache, and zero‑copy that enable high‑throughput streaming.

Big DataLog SegmentsSparse Index

0 likes · 22 min read

Why Kafka Stores Data the Way It Does: Inside Its Architecture

DataFunTalk

Oct 26, 2022 · Big Data

Metadata Management and Governance Practices at Wing Payment: Architecture, Techniques, and Future Outlook

This article explains how metadata serves as the foundation of enterprise data governance, outlines common data governance challenges, describes Wing Payment's metadata governance framework and platform architecture, and presents future directions such as multi‑source management, cross‑cluster disaster recovery, and intelligent recommendation.

Big DataData Lineagedata governance

0 likes · 18 min read

Metadata Management and Governance Practices at Wing Payment: Architecture, Techniques, and Future Outlook

DataFunSummit

Oct 25, 2022 · Databases

Design and Implementation of Meituan's Database Autonomy Service (DAS)

This article presents the background, challenges, architectural design, technical solutions, and future roadmap of Meituan's Database Autonomy Service (DAS), a platform that leverages big‑data collection, AI‑assisted root‑cause analysis, and automated operations to improve database performance, reliability, and self‑service capabilities.

AIBig DataDatabase Autonomy

0 likes · 18 min read

Design and Implementation of Meituan's Database Autonomy Service (DAS)

Kuaishou Big Data

Oct 25, 2022 · Big Data

How Kuaishou Built a Scalable Big Data Platform with Unified Data Quality and Metric Services

This article details Kuaishou's end‑to‑end big data platform, describing its organizational model, unified data governance framework, comprehensive data‑quality solution, the design of a headless metric platform, key technologies such as automatic modeling and code generation, and future directions toward a decentralized, smart data fabric.

Big DataData Qualitydata governance

0 likes · 21 min read

How Kuaishou Built a Scalable Big Data Platform with Unified Data Quality and Metric Services

dbaplus Community

Oct 24, 2022 · Big Data

Mastering Data Warehouse Modeling: From ER to Data Vault

This article explains what a data warehouse is, why modeling it matters, and compares four major modeling approaches—ER, dimensional, Data Vault, and Anchor—detailing their structures, steps, advantages, and typical use cases, while also offering guidance on selecting tools and designing models.

Big DataData VaultData Warehouse

0 likes · 15 min read

Mastering Data Warehouse Modeling: From ER to Data Vault

DataFunSummit

Oct 24, 2022 · Databases

Intelligent Operations: Challenges and Solutions with the IoTDB Time‑Series Database

This article examines the data challenges faced by intelligent operations (AIOps), evaluates IoTDB against other time‑series databases through performance benchmarks, outlines Cloudwise's architecture and open‑source contributions, and presents real‑world case studies demonstrating anomaly detection and root‑cause analysis in industrial settings.

Big DataIoTDBPerformance Benchmark

0 likes · 15 min read

Intelligent Operations: Challenges and Solutions with the IoTDB Time‑Series Database

Data Thinking Notes

Oct 24, 2022 · Big Data

How to Diagnose and Fix Spark Data Skew: Practical Optimization Techniques

This article explains the causes of Spark data skew, how to locate skewed tasks using the Web UI, and presents six optimization methods—including increasing shuffle parallelism, filtering abnormal keys, two‑stage aggregation, map‑join, key sampling, and random‑prefix joins—plus a real‑world case study.

Big DataData SkewJoin

0 likes · 21 min read

How to Diagnose and Fix Spark Data Skew: Practical Optimization Techniques

Selected Java Interview Questions

Oct 23, 2022 · Big Data

Building a Cost‑Effective Data Analysis Platform: ClickHouse vs Elasticsearch and Deployment Guide for Zookeeper, Kafka, Filebeat, and ClickHouse

This article compares Elasticsearch and ClickHouse for log analytics, presents cost‑benefit calculations, and provides a step‑by‑step deployment guide for Zookeeper, Kafka, Filebeat, and ClickHouse to build a scalable, low‑cost data analysis platform for SaaS services.

Big DataClickHouseElasticsearch

0 likes · 12 min read

Building a Cost‑Effective Data Analysis Platform: ClickHouse vs Elasticsearch and Deployment Guide for Zookeeper, Kafka, Filebeat, and ClickHouse

Architecture Digest

Oct 23, 2022 · Big Data

Implementing an SQL Parser: Core Concepts, ANTLR vs. Calcite Comparison, and Practical Code Samples

This article explains the motivation for an SQL parser in big‑data ecosystems, describes lexical, syntactic and semantic analysis, compares ANTLR and Apache Calcite as parser solutions, and provides complete code examples and deployment steps for building a functional SQL parsing engine.

ANTLRBig DataCalcite

0 likes · 19 min read

Implementing an SQL Parser: Core Concepts, ANTLR vs. Calcite Comparison, and Practical Code Samples

DataFunSummit

Oct 22, 2022 · Big Data

Tencent Music's Data Asset Management and Governance Practices

The article details Tencent Music's data governance journey, describing the background of rapid resource growth, challenges in cost management, a multi‑layered governance methodology—including metadata, tiered storage, and a Lego metadata platform—and the resulting improvements in resource utilization and data quality.

Big DataTencent Musicdata governance

0 likes · 14 min read

Tencent Music's Data Asset Management and Governance Practices

DataFunTalk

Oct 22, 2022 · Big Data

Design and Practice of a Risk Control Experiment Platform at Du Xiaoman

This article explains the background, architecture, challenges, and step‑by‑step design of a big‑data‑driven risk control experiment platform used for online and offline strategy testing in internet finance.

Big DataExperiment PlatformFinTech

0 likes · 12 min read

Design and Practice of a Risk Control Experiment Platform at Du Xiaoman

Architect's Guide

Oct 22, 2022 · Big Data

Meituan’s Kafka Optimizations: Reducing Read/Write Latency and Managing Large‑Scale Clusters

This article describes how Meituan’s data platform tackles the growing challenges of a 15,000‑plus‑node Kafka deployment by detailing current bottlenecks, latency‑reduction techniques across application and system layers, large‑scale cluster management strategies, and future directions for robustness and cloud‑native migration.

Big DataKafkaLarge-Scale Clusters

0 likes · 21 min read

Meituan’s Kafka Optimizations: Reducing Read/Write Latency and Managing Large‑Scale Clusters

ITPUB

Oct 21, 2022 · Big Data

Hadoop Explained: Architecture, Core Components, and Real-World Applications

This article provides a comprehensive overview of Hadoop, covering its historical development, key characteristics, the HDFS storage framework, the MapReduce processing engine, YARN resource manager, and a wide range of real-world application scenarios, as well as the broader Hadoop ecosystem and its major components.

Big DataDistributed computingEcosystem

0 likes · 20 min read

Hadoop Explained: Architecture, Core Components, and Real-World Applications

DataFunSummit

Oct 21, 2022 · Big Data

Exploring Real‑Time Data Lake Practices at Xiaohongshu Using Apache Iceberg

This article details Xiaohongshu's data platform architecture and three real‑time lake initiatives—log ingestion, CDC ingestion, and lake analysis—showcasing how Apache Iceberg, Flink, and custom shuffling algorithms solve small‑file and cross‑cloud challenges while enabling schema evolution and future multi‑cloud optimizations.

Apache IcebergBig DataCDC

0 likes · 16 min read

Exploring Real‑Time Data Lake Practices at Xiaohongshu Using Apache Iceberg

Bilibili Tech

Oct 21, 2022 · Big Data

Kyuubi at Bilibili: Architecture, Enhancements, and Production Practices for Large‑Scale Data Processing

Bilibili adopted the open‑source Kyuubi proxy to replace its unstable STS layer, enabling multi‑tenant, multi‑engine (Spark, Presto, Flink) SQL/Scala processing with Hive Thrift compatibility, fine‑grained queue isolation, UI monitoring, stability safeguards, and Kubernetes/YARN deployment, while planning further cloud‑native extensions.

Big DataKyuubiSpark

0 likes · 20 min read

Kyuubi at Bilibili: Architecture, Enhancements, and Production Practices for Large‑Scale Data Processing

Hulu Beijing

Oct 21, 2022 · Big Data

How Hulu Scales Spark on Kubernetes: Cloud‑Native Big Data at Disney‑Scale

Hulu’s data platform team describes how they migrated large‑scale Spark workloads from Yarn to native Spark on Kubernetes, leveraging AWS services such as EKS, S3, and custom operators to achieve dynamic scaling, unified monitoring, cost‑effective resource management, and improved stability for search, recommendation, and advertising pipelines.

Big DataData engineeringSpark

0 likes · 18 min read

How Hulu Scales Spark on Kubernetes: Cloud‑Native Big Data at Disney‑Scale

Kuaishou Big Data

Oct 20, 2022 · Big Data

How Kuaishou Scaled Metadata Management for Big Data: Architecture & Lessons

This article outlines Kuaishou's evolution of metadata management from its early Hive‑centric stage to a unified 2.0 platform, detailing system architecture, key technologies, challenges, and future 3.0 vision for low‑code, automated, and intelligent data governance.

Big DataData LineageMetadata

0 likes · 15 min read

How Kuaishou Scaled Metadata Management for Big Data: Architecture & Lessons

ITPUB

Oct 20, 2022 · Big Data

Will HDFS Be Replaced? Analyzing Its Drawbacks and Future Alternatives

The article examines why Hadoop's Distributed File System may become obsolete by detailing its three main shortcomings—deployment complexity, metadata memory limits, and high replication overhead—and explores how newer architectures and erasure coding could address these issues.

Big DataDistributed File SystemHDFS

0 likes · 8 min read

Will HDFS Be Replaced? Analyzing Its Drawbacks and Future Alternatives

Top Architect

Oct 19, 2022 · Big Data

Elasticsearch Architecture Overview and Core Concepts

This article provides a comprehensive overview of Elasticsearch, covering data types, Lucene fundamentals, cluster architecture, shard allocation, indexing mechanisms, storage strategies, refresh and translog processes, segment merging, performance tuning, and JVM optimization for building scalable, near‑real‑time search solutions.

Big DataClusterElasticsearch

0 likes · 37 min read

Elasticsearch Architecture Overview and Core Concepts

DataFunSummit

Oct 18, 2022 · Big Data

Feature Overview of Apache Kyuubi (Incubating) v1.5.0

The article presents a detailed technical walkthrough of Apache Kyuubi 1.5.0, covering its service‑oriented architecture, high‑availability design, multi‑engine extensions for Spark, Flink, Trino and Hive, enhanced engine‑sharing policies, POOL mode configuration, and the project’s future roadmap.

Apache KyuubiBig DataEngine Architecture

0 likes · 13 min read

Feature Overview of Apache Kyuubi (Incubating) v1.5.0

DataFunTalk

Oct 17, 2022 · Big Data

How Data Empowers the Fast‑Moving Consumer Goods Industry: Baicaowei’s End‑to‑End Data Platform Evolution

This article details Baicaowei’s journey from a Hadoop‑based data platform to a modern StarRocks‑driven architecture, illustrating how digitalization, evolving business needs, and streamlined data pipelines empower the fast‑moving consumer goods sector through efficient data collection, modeling, and analytics.

Big DataData ArchitectureDigital Transformation

0 likes · 10 min read

How Data Empowers the Fast‑Moving Consumer Goods Industry: Baicaowei’s End‑to‑End Data Platform Evolution

ITPUB

Oct 15, 2022 · Big Data

Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

This talk introduces the evolution of data lakes, outlines Apache Hudi’s core features, details the Flink‑Hudi integration architecture—including write pipelines, small‑file handling, and read strategies—covers real‑world use cases such as near‑real‑time DB ingestion, OLAP, and ETL, and previews upcoming Hudi roadmap items.

Apache HudiBig DataData Lake

0 likes · 21 min read

Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

Model Perspective

Oct 14, 2022 · Artificial Intelligence

How SimRank Leverages Graph Theory for Powerful Recommendations

SimRank, a graph‑theoretic recommendation algorithm, models users and items as a bipartite graph and computes similarity through iterative matrix operations, with extensions like SimRank++ incorporating edge weights and evidence, while scalable solutions use big‑data frameworks or Monte‑Carlo simulations.

Big DataMatrix ComputationRecommendation Systems

0 likes · 8 min read

How SimRank Leverages Graph Theory for Powerful Recommendations

21CTO

Oct 14, 2022 · Big Data

Top 12 Data Visualization Tools in 2022: Features, Pricing, and How to Choose

This guide reviews the most popular data visualization tools of 2022, explaining their key features, pricing plans, and how they help organizations turn complex data into clear, actionable insights for better decision‑making.

Big DataData visualizationfeatures

0 likes · 14 min read

Top 12 Data Visualization Tools in 2022: Features, Pricing, and How to Choose

Shopee Tech Team

Oct 13, 2022 · Big Data

Improving Flink Unaligned Checkpoint: Problems, Principles, Optimizations, and Production Practices at Shopee

Shopee tackled frequent Flink checkpoint failures caused by back‑pressure by adopting and extending the community’s Unaligned Checkpoint mechanism—adding overdraft buffers, improving legacy sources, introducing an aligned‑checkpoint timeout, enabling output‑buffer switching, merging small HDFS files, and fixing network‑buffer deadlocks—now running hundreds of jobs with stable UC deployment and plans to enable it universally.

Big DataCheckpoint OptimizationFlink

0 likes · 18 min read

Improving Flink Unaligned Checkpoint: Problems, Principles, Optimizations, and Production Practices at Shopee

Big Data Technology & Architecture

Oct 13, 2022 · Big Data

Hudi Clustering After Batch Processing: Merging Small Files Before Streaming

This guide details how to execute Apache Hudi file clustering after a batch job and before streaming, using Spark commands to merge numerous small HDFS files into larger ones, configure clustering and cleaning policies, and verify the results with HDFS counts.

Apache HudiBig DataData Lake

0 likes · 15 min read

Hudi Clustering After Batch Processing: Merging Small Files Before Streaming

DataFunSummit

Oct 12, 2022 · Big Data

Practical Application of Kyuubi in Xiaomi’s Big Data Platform

This article details how Xiaomi integrated the open‑source Kyuubi SQL gateway into its evolving big‑data platform, describing the challenges of multiple SQL services, the architectural redesign for a unified, high‑availability service, performance gains, new features such as engine pooling and Z‑ordering, and future roadmap plans.

Big DataData PlatformKyuubi

0 likes · 15 min read

Practical Application of Kyuubi in Xiaomi’s Big Data Platform

dbaplus Community

Oct 11, 2022 · Big Data

How We Replaced Elasticsearch with ClickHouse for Faster, Cheaper Log Storage

Facing growing log volumes and compliance needs, we evaluated ClickHouse’s hot‑cold‑archive storage to replace Elasticsearch, detailing configuration of storage policies, partitioning strategies, table creation, TTL handling, and cost‑effective OSS integration, ultimately achieving higher write performance and over 50% storage cost reduction.

Big DataClickHouseCold Hot Architecture

0 likes · 22 min read

How We Replaced Elasticsearch with ClickHouse for Faster, Cheaper Log Storage

DataFunSummit

Oct 11, 2022 · Big Data

Building Lakehouse Architecture with Delta Lake: Core Concepts, Technologies, Ecosystem, and Use Cases

This article explains how to construct a lakehouse architecture using Delta Lake by covering its basic concepts, version‑2 features, internal kernel and key technologies, ecosystem integrations, and classic data‑warehouse use cases such as G‑SCD and change‑data‑capture, providing practical guidance for modern big‑data engineering.

ACID TransactionsBig DataChange Data Capture

0 likes · 27 min read

Building Lakehouse Architecture with Delta Lake: Core Concepts, Technologies, Ecosystem, and Use Cases

DataFunSummit

Oct 10, 2022 · Big Data

Stability Optimization Practices for Flink Jobs at Tencent

This article presents Tencent's practical experience in improving Flink job stability, covering the Oceanus platform, stability challenges, and concrete optimization techniques such as reducing failures, minimizing impact, accelerating recovery, and proactive issue detection, followed by a summary and future outlook.

Big DataFlinkReal-Time Computing

0 likes · 12 min read

Stability Optimization Practices for Flink Jobs at Tencent

MaGe Linux Operations

Oct 9, 2022 · Big Data

Master Flink on Kubernetes: Step‑by‑Step Deployment Guide

This guide walks you through deploying Apache Flink on Kubernetes, covering runtime modes, building Docker images, creating ConfigMaps and Services, launching session and application clusters, submitting jobs, monitoring the Web UI, and cleaning up resources, all with practical code snippets and commands.

Big DataDockerFlink

0 likes · 26 min read

Master Flink on Kubernetes: Step‑by‑Step Deployment Guide

DataFunTalk

Oct 9, 2022 · Big Data

Software Localization and the Future of Big Data Platforms in China

The article examines why software localization is essential for China’s data technology, outlines the challenges and current state of domestic operating systems, databases and big‑data platforms, discusses migration and upgrade strategies, and introduces NetEase DataFun’s self‑developed big‑data platform with its features and support.

Big DataChinaPlatform Migration

0 likes · 11 min read

Software Localization and the Future of Big Data Platforms in China

Xingsheng Youxuan Technology Community

Oct 8, 2022 · Big Data

Solving Real‑World Data Quality Challenges with X‑Select’s DQC Platform

This article explains how X‑Select’s Data Quality Platform (DQC) addresses common data quality problems in large‑scale data development by defining six quality dimensions, leveraging open‑source solutions such as Apache Griffin and Qualitis, and implementing rule definition, execution, alerting, and workflow interruption within a Spark‑based architecture.

Big DataData PlatformData Quality

0 likes · 15 min read

Solving Real‑World Data Quality Challenges with X‑Select’s DQC Platform

DataFunSummit

Oct 5, 2022 · Big Data

Serverless Technologies Empowering Big Data Analytics: An Overview of Amazon EMR Serverless

This article explains how Amazon EMR Serverless leverages serverless architecture to simplify, scale, and reduce the cost of big data analytics by providing managed Hadoop‑based services, flexible resource allocation, built‑in security, and seamless integration with the AWS data lake ecosystem.

Amazon EMR ServerlessBig DataData Lake

0 likes · 16 min read

Serverless Technologies Empowering Big Data Analytics: An Overview of Amazon EMR Serverless

ITPUB

Oct 4, 2022 · Big Data

How Kafka Achieves Million‑TPS with Sequential I/O, MMAP, and Zero‑Copy

This article explains how Kafka attains million‑level transactions per second by leveraging sequential disk writes, memory‑mapped files, zero‑copy data transfer, and batch processing, detailing each technique's mechanics and performance impact.

Big DataHigh ThroughputSequential I/O

0 likes · 10 min read

How Kafka Achieves Million‑TPS with Sequential I/O, MMAP, and Zero‑Copy

DataFunSummit

Oct 3, 2022 · Big Data

Optimizing Point‑Query Performance in Presto with Apache Hudi Data Skipping and Layout Techniques

This article explains how Huawei Cloud leverages Apache Hudi and HetuEngine (Presto) to improve point‑query performance on Lakehouse architectures through data layout optimization, file‑skipping techniques, metadata tables, and extensive benchmark results demonstrating multi‑fold speedups.

Apache HudiBig DataData Skipping

0 likes · 11 min read

Optimizing Point‑Query Performance in Presto with Apache Hudi Data Skipping and Layout Techniques

DataFunTalk

Oct 3, 2022 · Artificial Intelligence

Building Real‑World Medical Knowledge Graphs and Clinical Event Graphs: Methods, Pipelines, and Applications

This article explains how YiduCore processes heterogeneous hospital data (EMR, HIS, LIS, RIS, literature) to construct real‑world medical knowledge graphs and clinical event graphs, detailing pipelines for entity extraction, normalization, graph cleaning, PSR scoring, graph embedding, and showcasing applications such as intelligent diagnosis, question answering, automated medical record generation, and clinical trial patient recruitment.

AIBig DataMedical Knowledge Graph

0 likes · 21 min read

Building Real‑World Medical Knowledge Graphs and Clinical Event Graphs: Methods, Pipelines, and Applications

DataFunTalk

Oct 2, 2022 · Big Data

Real-time Data Warehouse Architecture and Hologres Technology Overview

This article explains the evolving requirements of real‑time data warehouses, analyzes Alibaba's Hologres technology principles, presents recommended architectures for various latency scenarios, and discusses practical case studies, performance, security, and cost‑optimization strategies for modern big‑data platforms.

Big DataCloud ComputingHologres

0 likes · 24 min read

Real-time Data Warehouse Architecture and Hologres Technology Overview

DataFunSummit

Sep 30, 2022 · Big Data

MercsDB: Architecture, Storage, Computation, and Optimization of Tencent's MPP Data Warehouse Engine

The article presents a comprehensive technical overview of MercsDB—formerly HermesDB—including its background, storage and indexing designs, native and Presto computation engines, vectorization optimizations, benchmark results, real‑world applications, and future development plans.

Big DataColumnar StorageMPP

0 likes · 20 min read

MercsDB: Architecture, Storage, Computation, and Optimization of Tencent's MPP Data Warehouse Engine

Bilibili Tech

Sep 30, 2022 · Big Data

Bilibili's Efficient Lakehouse Platform Built on Trino and Iceberg

Bilibili’s new lake‑house platform, built on Trino and Iceberg, replaces Hive‑based pipelines by ingesting logs and DB data into Iceberg tables, applying advanced sorting, Z‑order/Hilbert clustering, bitmap and bloom indexes, virtual join columns and pre‑aggregation, enabling 70 000 daily queries on 2 PB with average scans of 2 GB and sub‑2‑second response times.

Big DataData SkippingIceberg

0 likes · 15 min read

Bilibili's Efficient Lakehouse Platform Built on Trino and Iceberg

Bilibili Tech

Sep 30, 2022 · Big Data

From BitMap to RoaringBitmap: Principles, Performance, and Big Data Applications

RoaringBitmap improves traditional BitMap by lazily allocating four container types, compressing sparse data, and dynamically switching between array, bitmap, and run containers, enabling fast exact set operations that power big‑data systems such as Kylin, ClickHouse, and B‑Station’s user‑visit and crowd‑package pipelines, dramatically reducing memory use and processing latency.

Big DataBitmap CompressionClickHouse

0 likes · 16 min read

From BitMap to RoaringBitmap: Principles, Performance, and Big Data Applications

Youzan Coder

Sep 29, 2022 · Big Data

Implementing Spark Data Lineage with Spline: A Step‑by‑Step Guide

This article explains the growing importance of data lineage in large data warehouses, evaluates three Spark lineage extraction approaches, and provides a detailed, step‑by‑step guide to integrating the open‑source Spline agent—including codeless and programmatic initialization, configuration, dispatcher setup, post‑processing, and known limitations.

Apache SparkBig DataData Lineage

0 likes · 16 min read

Implementing Spark Data Lineage with Spline: A Step‑by‑Step Guide

Huolala Tech

Sep 29, 2022 · Big Data

How Huolala Cuts Big Data Costs with Hybrid Cloud Strategies

This article details Huolala's comprehensive big‑data cost‑control system—covering data‑asset measurement, budgeting, auxiliary governance, storage tiering, and elastic compute management—to dramatically reduce both storage and compute expenses while maintaining service quality across diverse workloads.

Big Dataelastic scalingresource budgeting

0 likes · 21 min read

How Huolala Cuts Big Data Costs with Hybrid Cloud Strategies

MaGe Linux Operations

Sep 28, 2022 · Big Data

Master TransBigData: Python Toolkit for Transportation Big Data

TransBigData is a Python library that streamlines the preprocessing, gridding, visualization, and OD extraction of transportation spatiotemporal datasets such as taxi GPS, bike sharing, and bus data, offering concise, efficient functions for data cleaning, rasterization, interactive mapping, and analytical workflows.

Big DataData visualizationGIS

0 likes · 13 min read

Master TransBigData: Python Toolkit for Transportation Big Data

DataFunSummit

Sep 28, 2022 · Big Data

Elasticsearch Time Series Engine: Practices, Challenges, and Alibaba Cloud TimeStream

This article presents a comprehensive overview of using Elasticsearch as a time series engine, covering its motivations, challenges, key features, Alibaba Cloud TimeStream optimizations such as columnar storage, LSM structures, downsampling, and integration with Prometheus and Grafana, while also discussing performance and cost considerations.

Big DataDownsamplingElasticsearch

0 likes · 15 min read

Elasticsearch Time Series Engine: Practices, Challenges, and Alibaba Cloud TimeStream

DataFunTalk

Sep 28, 2022 · Big Data

Privacy Computing in Big Data AI: Challenges, Solutions, and PPML Case Studies

This presentation explores the background and current state of privacy computing, its relevance to big data and AI, discusses SGX and LibOS technologies, introduces the BigDL PPML solution for secure Spark/Flink workloads, and reviews real-world applications and future outlook.

AIBig DataFlink

0 likes · 13 min read

Privacy Computing in Big Data AI: Challenges, Solutions, and PPML Case Studies

MaGe Linux Operations

Sep 26, 2022 · Big Data

Deploy Hadoop on Kubernetes with Helm: A Complete Step‑by‑Step Guide

This tutorial walks you through deploying Hadoop 3.x on a Kubernetes cluster using Helm, covering repository setup, Docker image creation, Helm chart customization, service configuration, installation, verification, and clean‑up, with all necessary commands and YAML snippets.

Big DataDockerHadoop

0 likes · 14 min read

Deploy Hadoop on Kubernetes with Helm: A Complete Step‑by‑Step Guide

DataFunSummit

Sep 26, 2022 · Databases

StarRocks Deployment and Practice at 360: Performance Evaluation, Use Cases, and Future Directions

This article details why 360 chose StarRocks as its OLAP engine, presents performance and operational comparisons with MySQL, Hive, Spark, Druid, Doris and ClickHouse, describes three major production use cases, and outlines ongoing explorations such as cloud‑native integration and Kubernetes support.

Big DataOLAPPerformance Benchmark

0 likes · 17 min read

StarRocks Deployment and Practice at 360: Performance Evaluation, Use Cases, and Future Directions

DataFunSummit

Sep 25, 2022 · Big Data

Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi

This article shares Xiaomi's internal practices of Hadoop YARN, covering scheduling and resource optimization, elastic scheduling, node overcommit handling, federation architecture, metadata warehouse construction, and future plans to improve cluster utilization and cost efficiency.

Big DataHadoopYARN

0 likes · 20 min read

Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi

Aikesheng Open Source Community

Sep 24, 2022 · Databases

Weekly Database and Big Data Article Highlights

This weekly roundup presents a curated selection of high‑quality technical articles and resources on MySQL, database error‑log analysis, big‑data task optimization, SQL injection case studies, and upcoming SQLE development plans, offering readers up‑to‑date insights into database engineering and performance best practices.

Big DataDatabaseMySQL

0 likes · 4 min read

Weekly Database and Big Data Article Highlights

Xiaohongshu Tech REDtech

Sep 22, 2022 · Big Data

Graph Computing Algorithms for E‑commerce Anti‑Fraud and Reselling Bot Detection

The Xiaohongshu anti‑fraud team combats sophisticated same‑group and crowdsourced reselling bots by ingesting real‑time transaction streams into a Nebula Graph, using multi‑hop sub‑graph sampling, label propagation, and modularity‑based community detection to identify suspicious clusters, update risk pools, and enforce personalized purchase‑limit rules.

Big Dataanti-fraudbot detection

0 likes · 9 min read

Graph Computing Algorithms for E‑commerce Anti‑Fraud and Reselling Bot Detection

DataFunSummit

Sep 21, 2022 · Big Data

Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query

This article details NetEase Yanxuan's DMP tag system, covering platform overview, tag definitions, production pipelines, multi‑layer storage architecture, high‑performance query techniques, and future roadmap, illustrating how data from various sources is transformed into actionable user tags for refined operations.

Apache DorisBig DataDMP

0 likes · 10 min read

Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query

Tencent Cloud Developer

Sep 20, 2022 · Information Security

Data Classification and Grading Architecture for Enterprise Data Security

The article details a practical, reusable enterprise architecture for data classification and grading that combines scanning tools, a rule‑engine with hot‑updates, a high‑performance identification service, and a security enforcement platform, addressing massive real‑time data volumes, diverse storage types, cross‑department isolation, and compliance with China’s data security laws.

Big DataData SecurityKafka

0 likes · 14 min read

Data Classification and Grading Architecture for Enterprise Data Security

Alibaba Cloud Big Data AI Platform

Sep 20, 2022 · Big Data

How Alibaba Cloud’s Data Lake Metadata Warehouse Transforms Big Data Management

This article explains the challenges of data lake adoption and details Alibaba Cloud’s metadata warehouse architecture, construction, search capabilities, asset analysis, fine‑grained profiling, and lifecycle management that together enable efficient, cloud‑native big data management.

Alibaba CloudBig DataData Lake

0 likes · 13 min read

How Alibaba Cloud’s Data Lake Metadata Warehouse Transforms Big Data Management

Big Data Technology & Architecture

Sep 19, 2022 · Big Data

Apache Iceberg Table and Catalog Configuration Guide for Hadoop

This article outlines the configuration settings for Apache Iceberg tables and catalogs on Hadoop, covering read and write properties, combine behavior for small HDFS files, reserved table properties, catalog lock options, and Hive Metastore connector Hadoop settings, supplemented with illustrative screenshots.

Big DataCatalogHadoop

0 likes · 3 min read

Apache Iceberg Table and Catalog Configuration Guide for Hadoop

Top Architect

Sep 16, 2022 · Big Data

Understanding ElasticSearch: Distributed Search, Full‑Text Retrieval, and Inverted Index

This article explains the fundamentals of search, why traditional databases struggle with large‑scale text queries, introduces full‑text search and inverted indexes, describes Lucene as the core library, and details ElasticSearch's distributed architecture, features, and common use cases.

Big DataFull-Text Searchinverted index

0 likes · 7 min read

Understanding ElasticSearch: Distributed Search, Full‑Text Retrieval, and Inverted Index

DataFunSummit

Sep 15, 2022 · Big Data

Amazon Real-Time Data Warehouse Architecture and Services Overview

This article reviews the evolution of data warehouse architectures, explains Amazon's serverless real-time data lake design and its key services, and details Amazon Redshift's cloud-native real-time data warehouse features, streaming ingestion, and integrated machine learning capabilities.

Amazon RedshiftBig DataData Lake

0 likes · 10 min read

Amazon Real-Time Data Warehouse Architecture and Services Overview

Huolala Tech

Sep 15, 2022 · Big Data

Unlocking Massive Data Efficiency: How Bitmap and RoaringBitmap Transform Big Data Storage

This article explains the principles, Java implementation, and performance benefits of Bitmap and RoaringBitmap, demonstrating how they dramatically reduce storage costs, enable fast deduplication and set operations, and optimize large‑scale data warehouse queries in real‑world scenarios.

Big DataData StructuresOptimization

0 likes · 18 min read

Unlocking Massive Data Efficiency: How Bitmap and RoaringBitmap Transform Big Data Storage

NetEase Media Technology Team

Sep 15, 2022 · Big Data

SparkSQL on Kubernetes: NetEase Media's Cloud-Native Big Data Infrastructure Practice

NetEase Media migrated SparkSQL to Kubernetes in 2021, using storage‑compute decoupling, hybrid deployment, custom scripts, Kyuubi failover, and extensive monitoring and resource governance, which cut cluster size by over 30% while keeping CPU utilization above 80% and GC throughput above 95%.

Big DataK8s migrationSpark on K8S

0 likes · 13 min read

SparkSQL on Kubernetes: NetEase Media's Cloud-Native Big Data Infrastructure Practice

dbaplus Community

Sep 14, 2022 · Databases

How Apache Doris Enables Real‑Time Analysis of Hudi Data Lakes

This article explains the architecture of Apache Doris, introduces Apache Hudi as a data‑lake format, compares Lambda and Kappa approaches, and details the design, implementation steps, and future roadmap for querying Hudi tables directly from Doris.

Apache DorisApache HudiBig Data

0 likes · 10 min read

How Apache Doris Enables Real‑Time Analysis of Hudi Data Lakes

vivo Internet Technology

Sep 14, 2022 · Big Data

Exploring and Practicing Apache Pulsar at vivo: Cluster Management, Monitoring, and Optimization

The vivo big‑data team details how they migrated massive real‑time workloads from Kafka to Apache Pulsar, describing cluster‑level bundle and ledger management, retention policies, a Prometheus‑Kafka‑Druid monitoring pipeline, load‑balancing tweaks, client tuning, rapid broker‑failure recovery, and future cloud‑native tracing and migration plans.

Apache PulsarBig DataCluster Management

0 likes · 19 min read

Exploring and Practicing Apache Pulsar at vivo: Cluster Management, Monitoring, and Optimization

ByteDance Data Platform

Sep 14, 2022 · Fundamentals

Mastering Enterprise Data Tracking: A Step‑by‑Step Design Blueprint

This guide details how to plan, design, and manage enterprise‑level data tracking projects, covering role responsibilities, initial and iterative construction phases, event and attribute specifications, best‑practice tips, and common pitfalls to ensure accurate, maintainable analytics.

AnalyticsBig DataData Tracking

0 likes · 16 min read

Mastering Enterprise Data Tracking: A Step‑by‑Step Design Blueprint

HomeTech

Sep 13, 2022 · Big Data

Integrating Heterogeneous Data Sources with openLooKeng and Upgrading the Apache Kylin Connector at AutoHome

This article describes how AutoHome tackled the complexity of managing multiple relational, NoSQL, and Hive data stores by adopting openLooKeng for unified, cross‑source SQL queries, outlines its key features such as ANSI‑SQL support, diverse connectors, and query optimizations, and details the custom enhancements made to the Apache Kylin connector to better serve their commercial data analysis workloads.

Big DataConnectorsData Integration

0 likes · 13 min read

Integrating Heterogeneous Data Sources with openLooKeng and Upgrading the Apache Kylin Connector at AutoHome

Alibaba Cloud Big Data AI Platform

Sep 13, 2022 · Big Data

From Hadoop to Cloud‑Native: The Evolution of Data Lakes and Modern Architecture

This article traces the history of data lakes from their 2010 inception with Hadoop through cloud‑native object storage, lakehouse formats like Delta Lake, and Alibaba Cloud's multi‑layer solution, outlining key architectural stages and practical construction challenges for enterprise‑grade implementations.

Alibaba CloudBig DataData Architecture

0 likes · 9 min read

From Hadoop to Cloud‑Native: The Evolution of Data Lakes and Modern Architecture

DataFunSummit

Sep 12, 2022 · Big Data

DataFun Summit 2022: Data Integration Platform – SeaTunnel V2 Architecture Evolution and DataOps Practices

The DataFun Summit 2022, held on September 17, gathered leading experts from Baiji Whale Open Source, NetEase, Tapdata, and Alibaba Cloud to share deep technical insights on SeaTunnel V2 architecture, DataOps implementations, and open‑source big‑data studio tools, offering attendees practical guidance for modern data platforms.

ApacheBig DataData Platform

0 likes · 8 min read

DataFun Summit 2022: Data Integration Platform – SeaTunnel V2 Architecture Evolution and DataOps Practices

21CTO

Sep 9, 2022 · Big Data

How Big Data Is Revolutionizing HR Analytics for Better Retention and Performance

This article explains how the rapid growth of big data—characterized by volume, velocity, and variety—is reshaping human‑resource analytics, enabling companies to identify employee trends, boost engagement, improve performance, and make smarter hiring decisions.

Big DataHR analyticsHRIS

0 likes · 8 min read

How Big Data Is Revolutionizing HR Analytics for Better Retention and Performance

Tencent Cloud Developer

Sep 9, 2022 · Big Data

Data Lake, Data Warehouse, and Lakehouse: Concepts, Architectures, and Industry Practices

The article explains how data lakes excel at ingesting massive, varied data, data warehouses optimize storage and query performance, and lake‑house architectures combine both strengths—offering scalable, low‑cost storage with high‑speed analytics—highlighting industry solutions from Snowflake, Databricks, and major cloud providers.

AnalyticsBig DataData Lake

0 likes · 8 min read

Data Lake, Data Warehouse, and Lakehouse: Concepts, Architectures, and Industry Practices

Selected Java Interview Questions

Sep 9, 2022 · Databases

Performance Testing and Optimization of ClickHouse and Elasticsearch for High-Concurrency Scenarios

This technical report details the requirement analysis, environment setup, monitoring tools, load‑test scripts, data design, execution results, and optimization recommendations for stress‑testing ClickHouse and Elasticsearch to ensure they can handle high‑concurrency business peaks.

Big DataClickHouseDatabase Optimization

0 likes · 11 min read

Performance Testing and Optimization of ClickHouse and Elasticsearch for High-Concurrency Scenarios

Programmer DD

Sep 9, 2022 · Big Data

Why Kafka and Pulsar Lead the Distributed Streaming Landscape

This article introduces Apache Kafka and Apache Pulsar, compares their core features such as publish/subscribe messaging, storage, real‑time pipelines, and stream processing, outlines key characteristics like high throughput, scalability and fault tolerance, and explains fundamental concepts and architecture components unique to each platform.

Big DataDistributed StreamingKafka

0 likes · 14 min read

Why Kafka and Pulsar Lead the Distributed Streaming Landscape

JavaEdge

Sep 7, 2022 · Databases

Understanding HBase: Architecture, Data Model, and Read/Write Mechanics

This article provides a comprehensive overview of HBase, covering its column‑oriented design, core components such as HMaster, RegionServer and ZooKeeper, the data model with column families and row keys, and detailed step‑by‑step write and read processes for distributed storage.

Big DataHBaseNoSQL

0 likes · 16 min read

Understanding HBase: Architecture, Data Model, and Read/Write Mechanics

DataFunSummit

Sep 7, 2022 · Big Data

Integrating Apache Doris with Hudi: Architecture, Design, and Implementation

This article explains the background, architecture, design choices, and step‑by‑step implementation for enabling Apache Doris to query Hudi data lake tables, covering Doris features, Hudi formats, Lambda/Kappa architectures, solution alternatives, and future roadmap for real‑time analytics.

Apache DorisBig DataData Lake

0 likes · 10 min read

Integrating Apache Doris with Hudi: Architecture, Design, and Implementation