Tagged articles
3697 articles
Page 15 of 37
21CTO
21CTO
Nov 9, 2022 · Operations

How Ctrip Handles Billions of Logs Daily: Real‑Time Monitoring, Clog, CAT & TSDB

This article details Ctrip’s large‑scale log monitoring architecture, covering the overall Overview, the Clog log system, the CAT tracing platform, and the internal TSDB solution, explaining how billions of logs are processed in real time with low latency, high reliability, and efficient querying.

Big DataDistributed SystemsLog Monitoring
0 likes · 12 min read
How Ctrip Handles Billions of Logs Daily: Real‑Time Monitoring, Clog, CAT & TSDB
政采云技术
政采云技术
Nov 8, 2022 · Industry Insights

How Small Big‑Data Frontend Teams Can Thrive: A Survival Guide

This guide outlines the essential concepts of big data, the roles of a front‑end data team, practical workflow steps, platform architecture, industry benchmarks, and actionable strategies for small teams to improve efficiency, visualization capabilities, and digital operations.

Big DataData PlatformData visualization
0 likes · 14 min read
How Small Big‑Data Frontend Teams Can Thrive: A Survival Guide
政采云技术
政采云技术
Nov 8, 2022 · Big Data

User Path Analysis in the Hunyi System: Design, Computation Logic, and StarRocks Implementation

This article explains user path analysis as a method to visualize and optimize user flow, describes its productization in the Hunyi analytics platform, details the underlying computation logic, presents a complex StarRocks SQL solution, discusses performance challenges, and suggests future improvements and recruitment opportunities.

Big DataPerformance OptimizationStarRocks
0 likes · 21 min read
User Path Analysis in the Hunyi System: Design, Computation Logic, and StarRocks Implementation
DataFunSummit
DataFunSummit
Nov 7, 2022 · Big Data

Huolala's Data Governance Practices: Data Quality, Metadata, and Cost Management Platforms

This article details Huolala's end‑to‑end data governance practice, covering the construction of a data governance framework, the implementation of a zero‑code data quality platform, a metadata management platform, and a cost‑governance system that together improve data reliability, reduce waste, and support scalable big‑data operations.

Big DataCost Managementdata governance
0 likes · 14 min read
Huolala's Data Governance Practices: Data Quality, Metadata, and Cost Management Platforms
Tencent Cloud Developer
Tencent Cloud Developer
Nov 7, 2022 · Big Data

Data Engineering and Data Warehouse Design: Principles, Practices, and Governance

The article outlines comprehensive data‑engineering and warehouse‑design principles—covering collection (four Ws and methods like SDK, point‑code, binlog), reporting strategies, source selection, modeling with fact, aggregation, dimension and model tables, quality checks, and governance practices such as standardized SDKs, metric libraries, automated lineage, and cost optimization—to share actionable experience for any organization.

Big DataData WarehouseData engineering
0 likes · 32 min read
Data Engineering and Data Warehouse Design: Principles, Practices, and Governance
DataFunSummit
DataFunSummit
Nov 6, 2022 · Artificial Intelligence

Guangfa Group’s Federated Learning Exploration, Platform Construction, and the Book “Federated Learning Principles and Applications”

This article outlines Guangfa Group’s initiatives in privacy computing and federated learning, detailing the development of its federated learning platform, contributions to open‑source FATE, industry standards, various application scenarios such as joint statistics, precise marketing, risk control, cross‑domain verification, and introduces their newly published book on federated learning principles and applications.

Artificial IntelligenceBig DataFATE
0 likes · 23 min read
Guangfa Group’s Federated Learning Exploration, Platform Construction, and the Book “Federated Learning Principles and Applications”
Architects' Tech Alliance
Architects' Tech Alliance
Nov 5, 2022 · Databases

Data Replication: Fundamentals, Technologies, and Future Trends

This article explains the concept of data replication, its three-stage process, key principles of compliance, timeliness, and diversity, various replication methods, layered technologies across storage, operating system, and database levels, emerging cloud and big‑data solutions, and heterogeneous use‑case scenarios.

Big DataDatabasesdata replication
0 likes · 15 min read
Data Replication: Fundamentals, Technologies, and Future Trends
StarRocks
StarRocks
Nov 4, 2022 · Big Data

Building a High‑Performance, Cost‑Effective Cloud Lakehouse with StarRocks and EMR

This article explains how to design and implement a cloud‑native Lakehouse using StarRocks and Tencent Cloud EMR, covering core technical requirements, a five‑layer architecture, data ingestion with Iceberg/Hudi, performance tricks like Z‑order clustering, cost‑control through elastic scaling, and the key product features of EMR StarRocks.

Big DataCloud ComputingEMR
0 likes · 24 min read
Building a High‑Performance, Cost‑Effective Cloud Lakehouse with StarRocks and EMR
dbaplus Community
dbaplus Community
Nov 3, 2022 · Big Data

Why Kafka Stores Data the Way It Does: A Deep Dive into Its Log Architecture

This article thoroughly examines Kafka's storage system, explaining why it uses sequential log writes combined with sparse indexing, how different log formats evolved, and the mechanisms for log retention and compaction that enable high‑throughput, fault‑tolerant streaming at massive scale.

Big DataDistributed SystemsKafka
0 likes · 22 min read
Why Kafka Stores Data the Way It Does: A Deep Dive into Its Log Architecture
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 3, 2022 · Big Data

How Alibaba Cloud’s ODPS Upgrade Redefines Big Data Processing and AI Integration

Alibaba Cloud announced that its ODPS platform has been upgraded into an integrated big‑data solution that supports massive batch jobs, real‑time analytics, and AI workloads, delivering record‑breaking performance and enabling use cases from smart city traffic optimization to accelerated autonomous‑driving model training.

AIBig DataPerformance Benchmark
0 likes · 5 min read
How Alibaba Cloud’s ODPS Upgrade Redefines Big Data Processing and AI Integration
Zhongtong Tech
Zhongtong Tech
Nov 3, 2022 · Databases

How ZTO’s Database Operations Platform Evolved from Manual to Intelligent Automation

The article recounts Chen Jianhua’s presentation at the GOPS Global Operations Conference, detailing ZTO’s three‑stage journey in building a database operations platform—from initial automation to self‑service and finally to fine‑grained, data‑driven intelligent management—while sharing lessons and future plans.

Big DataDatabase operationsPlatform Engineering
0 likes · 4 min read
How ZTO’s Database Operations Platform Evolved from Manual to Intelligent Automation
DataFunSummit
DataFunSummit
Nov 2, 2022 · Big Data

Evolution and Construction of Huolala's Doris‑Based OLAP System

This article details Huolala's journey from a MySQL‑centric analytics pipeline to a multi‑engine OLAP platform built on Doris, covering system architecture, data flow, stage‑wise evolution, engine selection, POC validation, performance tuning, stability measures, and future roadmap for self‑service analytics.

Big DataDorisOLAP
0 likes · 15 min read
Evolution and Construction of Huolala's Doris‑Based OLAP System
DataFunSummit
DataFunSummit
Nov 1, 2022 · Big Data

Case Study of DCMM Standard Implementation at State Grid Tianjin Electric Power

This article details State Grid Tianjin Electric Power's early adoption and successful certification of the national DCMM data management maturity model, outlining background, certification milestones, systematic practices, and lessons learned that illustrate how data governance, architecture, and application strategies drive digital transformation.

Big DataDCMMData Management
0 likes · 11 min read
Case Study of DCMM Standard Implementation at State Grid Tianjin Electric Power
Java Architect Essentials
Java Architect Essentials
Oct 31, 2022 · Big Data

How to Process 10 GB of Age Data on a 4 GB Machine Using Java

This article walks through generating a 10 GB file of age values, reading it line‑by‑line on a 4 GB RAM, 2‑core machine, measuring single‑thread performance, then redesigning the pipeline with a producer‑consumer model, blocking queues and multithreaded string splitting to dramatically boost CPU utilization and cut processing time while managing memory consumption.

Big DataFile ProcessingJava
0 likes · 12 min read
How to Process 10 GB of Age Data on a 4 GB Machine Using Java
Architects' Tech Alliance
Architects' Tech Alliance
Oct 31, 2022 · Industry Insights

What Drives Distributed Storage: Product Forms, Ecosystem, and Key Use Cases

Distributed storage encompasses integrated appliances and pure‑software solutions, each with distinct hardware strategies, and forms a multi‑dimensional industry ecosystem that spans commercial and open‑source software, specialized and generic hardware, serving critical scenarios such as virtualization/cloud, high‑performance computing, and big‑data analytics.

Big DataCloud ComputingHigh Performance Computing
0 likes · 15 min read
What Drives Distributed Storage: Product Forms, Ecosystem, and Key Use Cases
21CTO
21CTO
Oct 30, 2022 · Fundamentals

Top 10 IoT Trends That Will Transform Industries

This article explores the rapid growth of the Internet of Things, outlines the key drivers behind its expansion, highlights major challenges such as chip shortages and bandwidth limits, and presents ten emerging trends—including AI integration, 5G, edge computing, and security—that will shape multiple sectors in the coming years.

5GAIBig Data
0 likes · 9 min read
Top 10 IoT Trends That Will Transform Industries
DataFunSummit
DataFunSummit
Oct 30, 2022 · Big Data

Integrating Apache Spark with Cloud‑Native Technologies: Principles, Kubernetes Deployments, EMR on ACK, and Serverless Spark on DLF

This article examines the challenges of traditional Spark clusters and explains how integrating Spark with cloud‑native platforms—through Kubernetes deployment modes, EMR on ACK practices, Remote Shuffle Service, and serverless Spark on DLF—provides elastic scaling, lower operational costs, and advanced features such as executor rolling and custom scheduler support.

Big DataDLFServerless
0 likes · 18 min read
Integrating Apache Spark with Cloud‑Native Technologies: Principles, Kubernetes Deployments, EMR on ACK, and Serverless Spark on DLF
Python Crawling & Data Mining
Python Crawling & Data Mining
Oct 30, 2022 · Big Data

Why Ozone Is the Next‑Generation Distributed Object Store for Big Data

This article explains how Ozone, the Hadoop community’s new distributed object‑storage system, overcomes HDFS’s small‑file limitations with a hierarchical Volume‑Bucket‑Object model, detailing its architecture, components, data flow for creating and reading objects, and the benefits of its scalable, fault‑tolerant design.

Big DataHadoopObject Storage
0 likes · 12 min read
Why Ozone Is the Next‑Generation Distributed Object Store for Big Data
Past Memory Big Data
Past Memory Big Data
Oct 29, 2022 · Big Data

How to Adapt Hadoop for Domestic Big Data Requirements

The article analyzes Hadoop’s declining relevance, the dominance of CDH/HDP, security pressures from vulnerabilities, and outlines ten technical steps—including hardware adaptation, component selection, dependency resolution, compilation, Ambari integration, packaging, testing, and functional verification—required to create a domestic ARM‑based Hadoop distribution, which the authors have released as a free HDP 3.3.1 build.

ARMAmbariBig Data
0 likes · 15 min read
How to Adapt Hadoop for Domestic Big Data Requirements
DevOps Cloud Academy
DevOps Cloud Academy
Oct 27, 2022 · Big Data

Understanding DataOps: Concepts, Standards, and Enterprise Practices

This article explains DataOps as a methodology for improving data analysis quality and efficiency, outlines its origins, standards, and maturity model, and presents practical insights and case studies from Chinese enterprises on how DataOps addresses common data engineering challenges and drives digital transformation.

Big DataData ManagementDataOps
0 likes · 12 min read
Understanding DataOps: Concepts, Standards, and Enterprise Practices
Data Thinking Notes
Data Thinking Notes
Oct 27, 2022 · Big Data

Boost Spark Performance: Proven Code Optimizations & Tuning Tips

This article outlines practical Spark job optimization techniques—from code-level improvements and resource tuning to data skew handling, persistence strategies, shuffle reduction, broadcast variables, Kryo serialization, and efficient data structures—demonstrating how each can dramatically cut execution time.

Big DataKryo SerializationPerformance tuning
0 likes · 19 min read
Boost Spark Performance: Proven Code Optimizations & Tuning Tips
ITPUB
ITPUB
Oct 26, 2022 · Big Data

Why Kafka Stores Data the Way It Does: Inside Its Architecture

This article provides an in‑depth technical analysis of Kafka’s storage architecture, covering its design goals, storage mechanisms, log segment layout, sparse indexing, log cleanup policies, and the performance techniques such as sequential writes, page cache, and zero‑copy that enable high‑throughput streaming.

Big DataLog SegmentsSparse Index
0 likes · 22 min read
Why Kafka Stores Data the Way It Does: Inside Its Architecture
DataFunTalk
DataFunTalk
Oct 26, 2022 · Big Data

Metadata Management and Governance Practices at Wing Payment: Architecture, Techniques, and Future Outlook

This article explains how metadata serves as the foundation of enterprise data governance, outlines common data governance challenges, describes Wing Payment's metadata governance framework and platform architecture, and presents future directions such as multi‑source management, cross‑cluster disaster recovery, and intelligent recommendation.

Big DataData Lineagedata governance
0 likes · 18 min read
Metadata Management and Governance Practices at Wing Payment: Architecture, Techniques, and Future Outlook
DataFunSummit
DataFunSummit
Oct 25, 2022 · Databases

Design and Implementation of Meituan's Database Autonomy Service (DAS)

This article presents the background, challenges, architectural design, technical solutions, and future roadmap of Meituan's Database Autonomy Service (DAS), a platform that leverages big‑data collection, AI‑assisted root‑cause analysis, and automated operations to improve database performance, reliability, and self‑service capabilities.

AIBig DataDatabase Autonomy
0 likes · 18 min read
Design and Implementation of Meituan's Database Autonomy Service (DAS)
Kuaishou Big Data
Kuaishou Big Data
Oct 25, 2022 · Big Data

How Kuaishou Built a Scalable Big Data Platform with Unified Data Quality and Metric Services

This article details Kuaishou's end‑to‑end big data platform, describing its organizational model, unified data governance framework, comprehensive data‑quality solution, the design of a headless metric platform, key technologies such as automatic modeling and code generation, and future directions toward a decentralized, smart data fabric.

Big DataData Qualitydata governance
0 likes · 21 min read
How Kuaishou Built a Scalable Big Data Platform with Unified Data Quality and Metric Services
dbaplus Community
dbaplus Community
Oct 24, 2022 · Big Data

Mastering Data Warehouse Modeling: From ER to Data Vault

This article explains what a data warehouse is, why modeling it matters, and compares four major modeling approaches—ER, dimensional, Data Vault, and Anchor—detailing their structures, steps, advantages, and typical use cases, while also offering guidance on selecting tools and designing models.

Big DataData VaultData Warehouse
0 likes · 15 min read
Mastering Data Warehouse Modeling: From ER to Data Vault
DataFunSummit
DataFunSummit
Oct 24, 2022 · Databases

Intelligent Operations: Challenges and Solutions with the IoTDB Time‑Series Database

This article examines the data challenges faced by intelligent operations (AIOps), evaluates IoTDB against other time‑series databases through performance benchmarks, outlines Cloudwise's architecture and open‑source contributions, and presents real‑world case studies demonstrating anomaly detection and root‑cause analysis in industrial settings.

Big DataIoTDBPerformance Benchmark
0 likes · 15 min read
Intelligent Operations: Challenges and Solutions with the IoTDB Time‑Series Database
Data Thinking Notes
Data Thinking Notes
Oct 24, 2022 · Big Data

How to Diagnose and Fix Spark Data Skew: Practical Optimization Techniques

This article explains the causes of Spark data skew, how to locate skewed tasks using the Web UI, and presents six optimization methods—including increasing shuffle parallelism, filtering abnormal keys, two‑stage aggregation, map‑join, key sampling, and random‑prefix joins—plus a real‑world case study.

Big DataData SkewJoin
0 likes · 21 min read
How to Diagnose and Fix Spark Data Skew: Practical Optimization Techniques
Selected Java Interview Questions
Selected Java Interview Questions
Oct 23, 2022 · Big Data

Building a Cost‑Effective Data Analysis Platform: ClickHouse vs Elasticsearch and Deployment Guide for Zookeeper, Kafka, Filebeat, and ClickHouse

This article compares Elasticsearch and ClickHouse for log analytics, presents cost‑benefit calculations, and provides a step‑by‑step deployment guide for Zookeeper, Kafka, Filebeat, and ClickHouse to build a scalable, low‑cost data analysis platform for SaaS services.

Big DataClickHouseElasticsearch
0 likes · 12 min read
Building a Cost‑Effective Data Analysis Platform: ClickHouse vs Elasticsearch and Deployment Guide for Zookeeper, Kafka, Filebeat, and ClickHouse
DataFunSummit
DataFunSummit
Oct 22, 2022 · Big Data

Tencent Music's Data Asset Management and Governance Practices

The article details Tencent Music's data governance journey, describing the background of rapid resource growth, challenges in cost management, a multi‑layered governance methodology—including metadata, tiered storage, and a Lego metadata platform—and the resulting improvements in resource utilization and data quality.

Big DataTencent Musicdata governance
0 likes · 14 min read
Tencent Music's Data Asset Management and Governance Practices
Architect's Guide
Architect's Guide
Oct 22, 2022 · Big Data

Meituan’s Kafka Optimizations: Reducing Read/Write Latency and Managing Large‑Scale Clusters

This article describes how Meituan’s data platform tackles the growing challenges of a 15,000‑plus‑node Kafka deployment by detailing current bottlenecks, latency‑reduction techniques across application and system layers, large‑scale cluster management strategies, and future directions for robustness and cloud‑native migration.

Big DataKafkaLarge-Scale Clusters
0 likes · 21 min read
Meituan’s Kafka Optimizations: Reducing Read/Write Latency and Managing Large‑Scale Clusters
ITPUB
ITPUB
Oct 21, 2022 · Big Data

Hadoop Explained: Architecture, Core Components, and Real-World Applications

This article provides a comprehensive overview of Hadoop, covering its historical development, key characteristics, the HDFS storage framework, the MapReduce processing engine, YARN resource manager, and a wide range of real-world application scenarios, as well as the broader Hadoop ecosystem and its major components.

Big DataDistributed computingEcosystem
0 likes · 20 min read
Hadoop Explained: Architecture, Core Components, and Real-World Applications
DataFunSummit
DataFunSummit
Oct 21, 2022 · Big Data

Exploring Real‑Time Data Lake Practices at Xiaohongshu Using Apache Iceberg

This article details Xiaohongshu's data platform architecture and three real‑time lake initiatives—log ingestion, CDC ingestion, and lake analysis—showcasing how Apache Iceberg, Flink, and custom shuffling algorithms solve small‑file and cross‑cloud challenges while enabling schema evolution and future multi‑cloud optimizations.

Apache IcebergBig DataCDC
0 likes · 16 min read
Exploring Real‑Time Data Lake Practices at Xiaohongshu Using Apache Iceberg
Bilibili Tech
Bilibili Tech
Oct 21, 2022 · Big Data

Kyuubi at Bilibili: Architecture, Enhancements, and Production Practices for Large‑Scale Data Processing

Bilibili adopted the open‑source Kyuubi proxy to replace its unstable STS layer, enabling multi‑tenant, multi‑engine (Spark, Presto, Flink) SQL/Scala processing with Hive Thrift compatibility, fine‑grained queue isolation, UI monitoring, stability safeguards, and Kubernetes/YARN deployment, while planning further cloud‑native extensions.

Big DataKyuubiSpark
0 likes · 20 min read
Kyuubi at Bilibili: Architecture, Enhancements, and Production Practices for Large‑Scale Data Processing
Hulu Beijing
Hulu Beijing
Oct 21, 2022 · Big Data

How Hulu Scales Spark on Kubernetes: Cloud‑Native Big Data at Disney‑Scale

Hulu’s data platform team describes how they migrated large‑scale Spark workloads from Yarn to native Spark on Kubernetes, leveraging AWS services such as EKS, S3, and custom operators to achieve dynamic scaling, unified monitoring, cost‑effective resource management, and improved stability for search, recommendation, and advertising pipelines.

Big DataData engineeringSpark
0 likes · 18 min read
How Hulu Scales Spark on Kubernetes: Cloud‑Native Big Data at Disney‑Scale
ITPUB
ITPUB
Oct 20, 2022 · Big Data

Will HDFS Be Replaced? Analyzing Its Drawbacks and Future Alternatives

The article examines why Hadoop's Distributed File System may become obsolete by detailing its three main shortcomings—deployment complexity, metadata memory limits, and high replication overhead—and explores how newer architectures and erasure coding could address these issues.

Big DataDistributed File SystemHDFS
0 likes · 8 min read
Will HDFS Be Replaced? Analyzing Its Drawbacks and Future Alternatives
Top Architect
Top Architect
Oct 19, 2022 · Big Data

Elasticsearch Architecture Overview and Core Concepts

This article provides a comprehensive overview of Elasticsearch, covering data types, Lucene fundamentals, cluster architecture, shard allocation, indexing mechanisms, storage strategies, refresh and translog processes, segment merging, performance tuning, and JVM optimization for building scalable, near‑real‑time search solutions.

Big DataClusterElasticsearch
0 likes · 37 min read
Elasticsearch Architecture Overview and Core Concepts
DataFunSummit
DataFunSummit
Oct 18, 2022 · Big Data

Feature Overview of Apache Kyuubi (Incubating) v1.5.0

The article presents a detailed technical walkthrough of Apache Kyuubi 1.5.0, covering its service‑oriented architecture, high‑availability design, multi‑engine extensions for Spark, Flink, Trino and Hive, enhanced engine‑sharing policies, POOL mode configuration, and the project’s future roadmap.

Apache KyuubiBig DataEngine Architecture
0 likes · 13 min read
Feature Overview of Apache Kyuubi (Incubating) v1.5.0
DataFunTalk
DataFunTalk
Oct 17, 2022 · Big Data

How Data Empowers the Fast‑Moving Consumer Goods Industry: Baicaowei’s End‑to‑End Data Platform Evolution

This article details Baicaowei’s journey from a Hadoop‑based data platform to a modern StarRocks‑driven architecture, illustrating how digitalization, evolving business needs, and streamlined data pipelines empower the fast‑moving consumer goods sector through efficient data collection, modeling, and analytics.

Big DataData ArchitectureDigital Transformation
0 likes · 10 min read
How Data Empowers the Fast‑Moving Consumer Goods Industry: Baicaowei’s End‑to‑End Data Platform Evolution
ITPUB
ITPUB
Oct 15, 2022 · Big Data

Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

This talk introduces the evolution of data lakes, outlines Apache Hudi’s core features, details the Flink‑Hudi integration architecture—including write pipelines, small‑file handling, and read strategies—covers real‑world use cases such as near‑real‑time DB ingestion, OLAP, and ETL, and previews upcoming Hudi roadmap items.

Apache HudiBig DataData Lake
0 likes · 21 min read
Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes
Model Perspective
Model Perspective
Oct 14, 2022 · Artificial Intelligence

How SimRank Leverages Graph Theory for Powerful Recommendations

SimRank, a graph‑theoretic recommendation algorithm, models users and items as a bipartite graph and computes similarity through iterative matrix operations, with extensions like SimRank++ incorporating edge weights and evidence, while scalable solutions use big‑data frameworks or Monte‑Carlo simulations.

Big DataMatrix ComputationRecommendation Systems
0 likes · 8 min read
How SimRank Leverages Graph Theory for Powerful Recommendations
Shopee Tech Team
Shopee Tech Team
Oct 13, 2022 · Big Data

Improving Flink Unaligned Checkpoint: Problems, Principles, Optimizations, and Production Practices at Shopee

Shopee tackled frequent Flink checkpoint failures caused by back‑pressure by adopting and extending the community’s Unaligned Checkpoint mechanism—adding overdraft buffers, improving legacy sources, introducing an aligned‑checkpoint timeout, enabling output‑buffer switching, merging small HDFS files, and fixing network‑buffer deadlocks—now running hundreds of jobs with stable UC deployment and plans to enable it universally.

Big DataCheckpoint OptimizationFlink
0 likes · 18 min read
Improving Flink Unaligned Checkpoint: Problems, Principles, Optimizations, and Production Practices at Shopee
DataFunSummit
DataFunSummit
Oct 12, 2022 · Big Data

Practical Application of Kyuubi in Xiaomi’s Big Data Platform

This article details how Xiaomi integrated the open‑source Kyuubi SQL gateway into its evolving big‑data platform, describing the challenges of multiple SQL services, the architectural redesign for a unified, high‑availability service, performance gains, new features such as engine pooling and Z‑ordering, and future roadmap plans.

Big DataData PlatformKyuubi
0 likes · 15 min read
Practical Application of Kyuubi in Xiaomi’s Big Data Platform
dbaplus Community
dbaplus Community
Oct 11, 2022 · Big Data

How We Replaced Elasticsearch with ClickHouse for Faster, Cheaper Log Storage

Facing growing log volumes and compliance needs, we evaluated ClickHouse’s hot‑cold‑archive storage to replace Elasticsearch, detailing configuration of storage policies, partitioning strategies, table creation, TTL handling, and cost‑effective OSS integration, ultimately achieving higher write performance and over 50% storage cost reduction.

Big DataClickHouseCold Hot Architecture
0 likes · 22 min read
How We Replaced Elasticsearch with ClickHouse for Faster, Cheaper Log Storage
DataFunSummit
DataFunSummit
Oct 11, 2022 · Big Data

Building Lakehouse Architecture with Delta Lake: Core Concepts, Technologies, Ecosystem, and Use Cases

This article explains how to construct a lakehouse architecture using Delta Lake by covering its basic concepts, version‑2 features, internal kernel and key technologies, ecosystem integrations, and classic data‑warehouse use cases such as G‑SCD and change‑data‑capture, providing practical guidance for modern big‑data engineering.

ACID TransactionsBig DataChange Data Capture
0 likes · 27 min read
Building Lakehouse Architecture with Delta Lake: Core Concepts, Technologies, Ecosystem, and Use Cases
DataFunSummit
DataFunSummit
Oct 10, 2022 · Big Data

Stability Optimization Practices for Flink Jobs at Tencent

This article presents Tencent's practical experience in improving Flink job stability, covering the Oceanus platform, stability challenges, and concrete optimization techniques such as reducing failures, minimizing impact, accelerating recovery, and proactive issue detection, followed by a summary and future outlook.

Big DataFlinkReal-Time Computing
0 likes · 12 min read
Stability Optimization Practices for Flink Jobs at Tencent
MaGe Linux Operations
MaGe Linux Operations
Oct 9, 2022 · Big Data

Master Flink on Kubernetes: Step‑by‑Step Deployment Guide

This guide walks you through deploying Apache Flink on Kubernetes, covering runtime modes, building Docker images, creating ConfigMaps and Services, launching session and application clusters, submitting jobs, monitoring the Web UI, and cleaning up resources, all with practical code snippets and commands.

Big DataDockerFlink
0 likes · 26 min read
Master Flink on Kubernetes: Step‑by‑Step Deployment Guide
DataFunTalk
DataFunTalk
Oct 9, 2022 · Big Data

Software Localization and the Future of Big Data Platforms in China

The article examines why software localization is essential for China’s data technology, outlines the challenges and current state of domestic operating systems, databases and big‑data platforms, discusses migration and upgrade strategies, and introduces NetEase DataFun’s self‑developed big‑data platform with its features and support.

Big DataChinaPlatform Migration
0 likes · 11 min read
Software Localization and the Future of Big Data Platforms in China

Solving Real‑World Data Quality Challenges with X‑Select’s DQC Platform

This article explains how X‑Select’s Data Quality Platform (DQC) addresses common data quality problems in large‑scale data development by defining six quality dimensions, leveraging open‑source solutions such as Apache Griffin and Qualitis, and implementing rule definition, execution, alerting, and workflow interruption within a Spark‑based architecture.

Big DataData PlatformData Quality
0 likes · 15 min read
Solving Real‑World Data Quality Challenges with X‑Select’s DQC Platform
ITPUB
ITPUB
Oct 4, 2022 · Big Data

How Kafka Achieves Million‑TPS with Sequential I/O, MMAP, and Zero‑Copy

This article explains how Kafka attains million‑level transactions per second by leveraging sequential disk writes, memory‑mapped files, zero‑copy data transfer, and batch processing, detailing each technique's mechanics and performance impact.

Big DataHigh ThroughputSequential I/O
0 likes · 10 min read
How Kafka Achieves Million‑TPS with Sequential I/O, MMAP, and Zero‑Copy
DataFunTalk
DataFunTalk
Oct 3, 2022 · Artificial Intelligence

Building Real‑World Medical Knowledge Graphs and Clinical Event Graphs: Methods, Pipelines, and Applications

This article explains how YiduCore processes heterogeneous hospital data (EMR, HIS, LIS, RIS, literature) to construct real‑world medical knowledge graphs and clinical event graphs, detailing pipelines for entity extraction, normalization, graph cleaning, PSR scoring, graph embedding, and showcasing applications such as intelligent diagnosis, question answering, automated medical record generation, and clinical trial patient recruitment.

AIBig DataMedical Knowledge Graph
0 likes · 21 min read
Building Real‑World Medical Knowledge Graphs and Clinical Event Graphs: Methods, Pipelines, and Applications
DataFunTalk
DataFunTalk
Oct 2, 2022 · Big Data

Real-time Data Warehouse Architecture and Hologres Technology Overview

This article explains the evolving requirements of real‑time data warehouses, analyzes Alibaba's Hologres technology principles, presents recommended architectures for various latency scenarios, and discusses practical case studies, performance, security, and cost‑optimization strategies for modern big‑data platforms.

Big DataCloud ComputingHologres
0 likes · 24 min read
Real-time Data Warehouse Architecture and Hologres Technology Overview
Bilibili Tech
Bilibili Tech
Sep 30, 2022 · Big Data

Bilibili's Efficient Lakehouse Platform Built on Trino and Iceberg

Bilibili’s new lake‑house platform, built on Trino and Iceberg, replaces Hive‑based pipelines by ingesting logs and DB data into Iceberg tables, applying advanced sorting, Z‑order/Hilbert clustering, bitmap and bloom indexes, virtual join columns and pre‑aggregation, enabling 70 000 daily queries on 2 PB with average scans of 2 GB and sub‑2‑second response times.

Big DataData SkippingIceberg
0 likes · 15 min read
Bilibili's Efficient Lakehouse Platform Built on Trino and Iceberg
Bilibili Tech
Bilibili Tech
Sep 30, 2022 · Big Data

From BitMap to RoaringBitmap: Principles, Performance, and Big Data Applications

RoaringBitmap improves traditional BitMap by lazily allocating four container types, compressing sparse data, and dynamically switching between array, bitmap, and run containers, enabling fast exact set operations that power big‑data systems such as Kylin, ClickHouse, and B‑Station’s user‑visit and crowd‑package pipelines, dramatically reducing memory use and processing latency.

Big DataBitmap CompressionClickHouse
0 likes · 16 min read
From BitMap to RoaringBitmap: Principles, Performance, and Big Data Applications
Youzan Coder
Youzan Coder
Sep 29, 2022 · Big Data

Implementing Spark Data Lineage with Spline: A Step‑by‑Step Guide

This article explains the growing importance of data lineage in large data warehouses, evaluates three Spark lineage extraction approaches, and provides a detailed, step‑by‑step guide to integrating the open‑source Spline agent—including codeless and programmatic initialization, configuration, dispatcher setup, post‑processing, and known limitations.

Apache SparkBig DataData Lineage
0 likes · 16 min read
Implementing Spark Data Lineage with Spline: A Step‑by‑Step Guide
Huolala Tech
Huolala Tech
Sep 29, 2022 · Big Data

How Huolala Cuts Big Data Costs with Hybrid Cloud Strategies

This article details Huolala's comprehensive big‑data cost‑control system—covering data‑asset measurement, budgeting, auxiliary governance, storage tiering, and elastic compute management—to dramatically reduce both storage and compute expenses while maintaining service quality across diverse workloads.

Big Dataelastic scalingresource budgeting
0 likes · 21 min read
How Huolala Cuts Big Data Costs with Hybrid Cloud Strategies
MaGe Linux Operations
MaGe Linux Operations
Sep 28, 2022 · Big Data

Master TransBigData: Python Toolkit for Transportation Big Data

TransBigData is a Python library that streamlines the preprocessing, gridding, visualization, and OD extraction of transportation spatiotemporal datasets such as taxi GPS, bike sharing, and bus data, offering concise, efficient functions for data cleaning, rasterization, interactive mapping, and analytical workflows.

Big DataData visualizationGIS
0 likes · 13 min read
Master TransBigData: Python Toolkit for Transportation Big Data
DataFunSummit
DataFunSummit
Sep 28, 2022 · Big Data

Elasticsearch Time Series Engine: Practices, Challenges, and Alibaba Cloud TimeStream

This article presents a comprehensive overview of using Elasticsearch as a time series engine, covering its motivations, challenges, key features, Alibaba Cloud TimeStream optimizations such as columnar storage, LSM structures, downsampling, and integration with Prometheus and Grafana, while also discussing performance and cost considerations.

Big DataDownsamplingElasticsearch
0 likes · 15 min read
Elasticsearch Time Series Engine: Practices, Challenges, and Alibaba Cloud TimeStream
DataFunSummit
DataFunSummit
Sep 25, 2022 · Big Data

Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi

This article shares Xiaomi's internal practices of Hadoop YARN, covering scheduling and resource optimization, elastic scheduling, node overcommit handling, federation architecture, metadata warehouse construction, and future plans to improve cluster utilization and cost efficiency.

Big DataHadoopYARN
0 likes · 20 min read
Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi
Aikesheng Open Source Community
Aikesheng Open Source Community
Sep 24, 2022 · Databases

Weekly Database and Big Data Article Highlights

This weekly roundup presents a curated selection of high‑quality technical articles and resources on MySQL, database error‑log analysis, big‑data task optimization, SQL injection case studies, and upcoming SQLE development plans, offering readers up‑to‑date insights into database engineering and performance best practices.

Big DataDatabaseMySQL
0 likes · 4 min read
Weekly Database and Big Data Article Highlights
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Sep 22, 2022 · Big Data

Graph Computing Algorithms for E‑commerce Anti‑Fraud and Reselling Bot Detection

The Xiaohongshu anti‑fraud team combats sophisticated same‑group and crowdsourced reselling bots by ingesting real‑time transaction streams into a Nebula Graph, using multi‑hop sub‑graph sampling, label propagation, and modularity‑based community detection to identify suspicious clusters, update risk pools, and enforce personalized purchase‑limit rules.

Big Dataanti-fraudbot detection
0 likes · 9 min read
Graph Computing Algorithms for E‑commerce Anti‑Fraud and Reselling Bot Detection
DataFunSummit
DataFunSummit
Sep 21, 2022 · Big Data

Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query

This article details NetEase Yanxuan's DMP tag system, covering platform overview, tag definitions, production pipelines, multi‑layer storage architecture, high‑performance query techniques, and future roadmap, illustrating how data from various sources is transformed into actionable user tags for refined operations.

Apache DorisBig DataDMP
0 likes · 10 min read
Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query
Tencent Cloud Developer
Tencent Cloud Developer
Sep 20, 2022 · Information Security

Data Classification and Grading Architecture for Enterprise Data Security

The article details a practical, reusable enterprise architecture for data classification and grading that combines scanning tools, a rule‑engine with hot‑updates, a high‑performance identification service, and a security enforcement platform, addressing massive real‑time data volumes, diverse storage types, cross‑department isolation, and compliance with China’s data security laws.

Big DataData SecurityKafka
0 likes · 14 min read
Data Classification and Grading Architecture for Enterprise Data Security
DataFunSummit
DataFunSummit
Sep 15, 2022 · Big Data

Amazon Real-Time Data Warehouse Architecture and Services Overview

This article reviews the evolution of data warehouse architectures, explains Amazon's serverless real-time data lake design and its key services, and details Amazon Redshift's cloud-native real-time data warehouse features, streaming ingestion, and integrated machine learning capabilities.

Amazon RedshiftBig DataData Lake
0 likes · 10 min read
Amazon Real-Time Data Warehouse Architecture and Services Overview
dbaplus Community
dbaplus Community
Sep 14, 2022 · Databases

How Apache Doris Enables Real‑Time Analysis of Hudi Data Lakes

This article explains the architecture of Apache Doris, introduces Apache Hudi as a data‑lake format, compares Lambda and Kappa approaches, and details the design, implementation steps, and future roadmap for querying Hudi tables directly from Doris.

Apache DorisApache HudiBig Data
0 likes · 10 min read
How Apache Doris Enables Real‑Time Analysis of Hudi Data Lakes
vivo Internet Technology
vivo Internet Technology
Sep 14, 2022 · Big Data

Exploring and Practicing Apache Pulsar at vivo: Cluster Management, Monitoring, and Optimization

The vivo big‑data team details how they migrated massive real‑time workloads from Kafka to Apache Pulsar, describing cluster‑level bundle and ledger management, retention policies, a Prometheus‑Kafka‑Druid monitoring pipeline, load‑balancing tweaks, client tuning, rapid broker‑failure recovery, and future cloud‑native tracing and migration plans.

Apache PulsarBig DataCluster Management
0 likes · 19 min read
Exploring and Practicing Apache Pulsar at vivo: Cluster Management, Monitoring, and Optimization
HomeTech
HomeTech
Sep 13, 2022 · Big Data

Integrating Heterogeneous Data Sources with openLooKeng and Upgrading the Apache Kylin Connector at AutoHome

This article describes how AutoHome tackled the complexity of managing multiple relational, NoSQL, and Hive data stores by adopting openLooKeng for unified, cross‑source SQL queries, outlines its key features such as ANSI‑SQL support, diverse connectors, and query optimizations, and details the custom enhancements made to the Apache Kylin connector to better serve their commercial data analysis workloads.

Big DataConnectorsData Integration
0 likes · 13 min read
Integrating Heterogeneous Data Sources with openLooKeng and Upgrading the Apache Kylin Connector at AutoHome
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 13, 2022 · Big Data

From Hadoop to Cloud‑Native: The Evolution of Data Lakes and Modern Architecture

This article traces the history of data lakes from their 2010 inception with Hadoop through cloud‑native object storage, lakehouse formats like Delta Lake, and Alibaba Cloud's multi‑layer solution, outlining key architectural stages and practical construction challenges for enterprise‑grade implementations.

Alibaba CloudBig DataData Architecture
0 likes · 9 min read
From Hadoop to Cloud‑Native: The Evolution of Data Lakes and Modern Architecture
DataFunSummit
DataFunSummit
Sep 12, 2022 · Big Data

DataFun Summit 2022: Data Integration Platform – SeaTunnel V2 Architecture Evolution and DataOps Practices

The DataFun Summit 2022, held on September 17, gathered leading experts from Baiji Whale Open Source, NetEase, Tapdata, and Alibaba Cloud to share deep technical insights on SeaTunnel V2 architecture, DataOps implementations, and open‑source big‑data studio tools, offering attendees practical guidance for modern data platforms.

ApacheBig DataData Platform
0 likes · 8 min read
DataFun Summit 2022: Data Integration Platform – SeaTunnel V2 Architecture Evolution and DataOps Practices
Tencent Cloud Developer
Tencent Cloud Developer
Sep 9, 2022 · Big Data

Data Lake, Data Warehouse, and Lakehouse: Concepts, Architectures, and Industry Practices

The article explains how data lakes excel at ingesting massive, varied data, data warehouses optimize storage and query performance, and lake‑house architectures combine both strengths—offering scalable, low‑cost storage with high‑speed analytics—highlighting industry solutions from Snowflake, Databricks, and major cloud providers.

AnalyticsBig DataData Lake
0 likes · 8 min read
Data Lake, Data Warehouse, and Lakehouse: Concepts, Architectures, and Industry Practices
Selected Java Interview Questions
Selected Java Interview Questions
Sep 9, 2022 · Databases

Performance Testing and Optimization of ClickHouse and Elasticsearch for High-Concurrency Scenarios

This technical report details the requirement analysis, environment setup, monitoring tools, load‑test scripts, data design, execution results, and optimization recommendations for stress‑testing ClickHouse and Elasticsearch to ensure they can handle high‑concurrency business peaks.

Big DataClickHouseDatabase Optimization
0 likes · 11 min read
Performance Testing and Optimization of ClickHouse and Elasticsearch for High-Concurrency Scenarios
Programmer DD
Programmer DD
Sep 9, 2022 · Big Data

Why Kafka and Pulsar Lead the Distributed Streaming Landscape

This article introduces Apache Kafka and Apache Pulsar, compares their core features such as publish/subscribe messaging, storage, real‑time pipelines, and stream processing, outlines key characteristics like high throughput, scalability and fault tolerance, and explains fundamental concepts and architecture components unique to each platform.

Big DataDistributed StreamingKafka
0 likes · 14 min read
Why Kafka and Pulsar Lead the Distributed Streaming Landscape
JavaEdge
JavaEdge
Sep 7, 2022 · Databases

Understanding HBase: Architecture, Data Model, and Read/Write Mechanics

This article provides a comprehensive overview of HBase, covering its column‑oriented design, core components such as HMaster, RegionServer and ZooKeeper, the data model with column families and row keys, and detailed step‑by‑step write and read processes for distributed storage.

Big DataHBaseNoSQL
0 likes · 16 min read
Understanding HBase: Architecture, Data Model, and Read/Write Mechanics
DataFunSummit
DataFunSummit
Sep 7, 2022 · Big Data

Integrating Apache Doris with Hudi: Architecture, Design, and Implementation

This article explains the background, architecture, design choices, and step‑by‑step implementation for enabling Apache Doris to query Hudi data lake tables, covering Doris features, Hudi formats, Lambda/Kappa architectures, solution alternatives, and future roadmap for real‑time analytics.

Apache DorisBig DataData Lake
0 likes · 10 min read
Integrating Apache Doris with Hudi: Architecture, Design, and Implementation