Tagged articles
3697 articles
Page 23 of 37
DataFunTalk
DataFunTalk
Feb 16, 2021 · Big Data

Understanding Presto: Architecture, Query Execution, and Youzan’s Practical Experience

This article explains Presto’s core architecture and low‑latency query execution process, describes how Youzan adopts Presto for various data‑platform scenarios, discusses the evolution of its deployment, and outlines the performance challenges and future enhancements such as Alluxio integration and session property management.

Big DataPerformance OptimizationYouzan
0 likes · 13 min read
Understanding Presto: Architecture, Query Execution, and Youzan’s Practical Experience
Architect
Architect
Feb 15, 2021 · Big Data

Elasticsearch Optimization Practices for Large-Scale Data Queries

This article explains how to optimize Elasticsearch for cross‑month and multi‑year queries on billions of records, covering Lucene fundamentals, index and search performance tweaks, configuration settings, and practical testing results to achieve sub‑second response times.

Big DataElasticsearchOptimization
0 likes · 14 min read
Elasticsearch Optimization Practices for Large-Scale Data Queries
Architecture Digest
Architecture Digest
Feb 15, 2021 · Operations

ELK Stack Overview, Architecture, Installation and Configuration Guide (Version 7.7.0)

This article provides a comprehensive introduction to the ELK stack—Elasticsearch, Logstash, Kibana, and Filebeat—including its components, why it’s used for centralized log management, detailed architecture diagrams, step‑by‑step installation commands, configuration examples, and a practical Kafka‑based data pipeline demonstration.

Big DataELKElasticsearch
0 likes · 22 min read
ELK Stack Overview, Architecture, Installation and Configuration Guide (Version 7.7.0)
DataFunTalk
DataFunTalk
Feb 14, 2021 · Big Data

Impala at NetEase: Architecture, Iceberg Integration, Management System, Optimizations and Future Roadmap

This talk presents NetEase's practical experience with Impala, covering its core architecture, new features in version 3.x, integration with Apache Iceberg, a custom management platform, profiling and statistics enhancements, as well as future plans involving Kubernetes, Alluxio caching and pre‑computation strategies.

Apache IcebergBig DataCluster Management
0 likes · 13 min read
Impala at NetEase: Architecture, Iceberg Integration, Management System, Optimizations and Future Roadmap
DataFunTalk
DataFunTalk
Feb 13, 2021 · Databases

Improving HBase Availability and Reducing Latency Spikes with Replication‑Based Multi‑Path Reads and ZGC

This article describes how the Didi HBase team tackled HBase’s weak availability and GC‑induced latency spikes by introducing a replication‑based client multi‑path read mechanism, configuring hedged reads, and adopting the Z Garbage Collector, and presents the resulting performance improvements and remaining challenges.

Big DataHBaseMulti-Path Read
0 likes · 11 min read
Improving HBase Availability and Reducing Latency Spikes with Replication‑Based Multi‑Path Reads and ZGC
DataFunTalk
DataFunTalk
Feb 12, 2021 · Big Data

Apache Flink at Kuaishou: Past, Present, and Future

Zhao Jianbo, head of Kuaishou's big data architecture team, presents an in‑depth overview of Apache Flink's adoption at Kuaishou, covering reasons for selection, development history, business data flows, technical innovations such as the Slimbase state engine, stability improvements, and future roadmap.

Apache FlinkBig DataKuaishou
0 likes · 16 min read
Apache Flink at Kuaishou: Past, Present, and Future
DataFunTalk
DataFunTalk
Feb 10, 2021 · Big Data

AirWorks Data Intelligence Platform: Architecture, Cloud‑Native Ingestion, and Financial Asset Management Use Case

The article presents Entropy Simplify's AirWorks data intelligence platform, detailing its three‑layer architecture, cloud‑native multi‑source data ingestion system, low‑code ETL capabilities, technical features such as multi‑engine cooperation and data‑skew handling, and a financial asset‑management case study.

Big DataETLFinancial Services
0 likes · 16 min read
AirWorks Data Intelligence Platform: Architecture, Cloud‑Native Ingestion, and Financial Asset Management Use Case
Alibaba Cloud Native
Alibaba Cloud Native
Feb 10, 2021 · Cloud Native

Accelerate AI and Big Data Workloads on Kubernetes with Fluid’s JindoRuntime

Fluid is an open‑source Kubernetes‑native engine that orchestrates and accelerates distributed datasets for AI and big‑data workloads, and this guide explains its core concepts, the JindoRuntime implementation, performance benefits, and step‑by‑step instructions to deploy and test JindoRuntime on a K8s cluster.

AIBig DataData Acceleration
0 likes · 14 min read
Accelerate AI and Big Data Workloads on Kubernetes with Fluid’s JindoRuntime
DataFunTalk
DataFunTalk
Feb 9, 2021 · Big Data

Design and Implementation of a Full‑Chain Marketing Data Product at NetEase Yanxuan

This article details NetEase Yanxuan's business background, market characteristics, data product requirements, and the end‑to‑end design of a full‑chain marketing data product, covering attribution, metric evaluation, analysis frameworks, scenario‑based recommendations, and practical Q&A for data‑driven growth.

Big DataData ProductMarketing Analytics
0 likes · 18 min read
Design and Implementation of a Full‑Chain Marketing Data Product at NetEase Yanxuan
dbaplus Community
dbaplus Community
Feb 9, 2021 · Operations

How Suning Integrated ClickHouse into a Full‑Link Monitoring Platform for Real‑Time OLAP Insights

This article explains how Suning's big‑data team incorporated ClickHouse into their end‑to‑end monitoring ecosystem, detailing the architecture, trace‑ID propagation, slow‑query tracking, MergeTree health checks, replica delay analysis, and the role of Chproxy in delivering comprehensive observability for high‑performance OLAP workloads.

Big DataClickHouseOLAP
0 likes · 15 min read
How Suning Integrated ClickHouse into a Full‑Link Monitoring Platform for Real‑Time OLAP Insights
DataFunTalk
DataFunTalk
Feb 8, 2021 · Big Data

Ozone: The Next‑Generation Distributed Storage System Aiming to Replace HDFS

This article explains how Apache Ozone, built on the HDDS layer, addresses the scalability, memory, and performance limitations of HDFS by splitting metadata services, using RocksDB, implementing fine‑grained locking, RAFT‑based HA, and offering rich APIs, while outlining current challenges and future roadmap.

Big DataHDDSHDFS
0 likes · 29 min read
Ozone: The Next‑Generation Distributed Storage System Aiming to Replace HDFS
Efficient Ops
Efficient Ops
Feb 7, 2021 · Artificial Intelligence

How NLP Transforms Big Data Operations: Real-World AIOps Case Studies

This article explores the intersection of natural language processing and operations, outlines common text‑handling challenges, and presents three concrete AIOps case studies—log Q&A, anomaly detection, and ticket recommendation—while reflecting on a closed‑loop AI workflow and future research directions.

Big DataNLPaiops
0 likes · 9 min read
How NLP Transforms Big Data Operations: Real-World AIOps Case Studies
Architects' Tech Alliance
Architects' Tech Alliance
Feb 7, 2021 · Operations

Understanding the Essence and Implementation of Enterprise Digital Transformation

The article explains what digital transformation truly means for enterprises, outlines its three development stages, describes the core connection‑data‑intelligence framework, compares internal capability rebuilding with external ecosystem integration, and offers practical guidance on why and how companies should embark on digital transformation.

Big DataDigital TransformationOperations
0 likes · 24 min read
Understanding the Essence and Implementation of Enterprise Digital Transformation
DataFunTalk
DataFunTalk
Feb 7, 2021 · Big Data

Optimizations and Extensions for Flink SQL in Tencent Real‑Time Computing Platform

This article, presented by Tencent senior engineer Du Li, details the current state of Flink SQL, compares Jar, Canvas, and SQL modes, introduces window‑function extensions, retract‑stream optimizations, and outlines future roadmap plans for cost‑based optimization and new features in the real‑time computing platform.

Big DataFlinkRetract Stream
0 likes · 19 min read
Optimizations and Extensions for Flink SQL in Tencent Real‑Time Computing Platform
Open Source Linux
Open Source Linux
Feb 7, 2021 · Big Data

Mastering Kafka: Core Concepts, Architecture, and High‑Performance Deployment

This comprehensive guide explains Kafka's role as a message system, detailing topics, partitions, producers, consumers, replication, controller, ZooKeeper coordination, performance optimizations like sequential writes and zero‑copy, and practical recommendations for hardware, configuration, and cluster deployment.

Big DataCluster DeploymentKafka
0 likes · 22 min read
Mastering Kafka: Core Concepts, Architecture, and High‑Performance Deployment
DataFunTalk
DataFunTalk
Feb 5, 2021 · Big Data

Design and Implementation of Beike's Data Management Platform (DMP)

This article details how Beike built a comprehensive Data Management Platform (DMP) that integrates user behavior and business data across multiple apps, outlines its five‑layer architecture, discusses data collection, processing, storage, real‑time profiling, and presents performance results and future optimization directions.

Big DataDMPData engineering
0 likes · 20 min read
Design and Implementation of Beike's Data Management Platform (DMP)

NetEase Yanxuan Data Task Governance Practice: Pre‑, In‑, and Post‑Operation Strategies

NetEase Yanxuan tackled data‑task governance by establishing pre‑operation guarantees, baseline‑driven in‑operation controls, and post‑operation interventions, delivering stable task output, reduced alarms, lineage awareness, rapid incident recovery, and reusable best‑practice products that earned the 2020 Technology Sharing Co‑building Award.

Baseline ManagementBig DataTask Operation
0 likes · 25 min read
NetEase Yanxuan Data Task Governance Practice: Pre‑, In‑, and Post‑Operation Strategies
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Feb 4, 2021 · Big Data

Unlocking Data Middle Platform: From Ingestion to Real‑Time Analytics

This article provides a comprehensive overview of data middle platform concepts, covering data aggregation, ingestion tools, offline and real‑time development, scheduling, baseline control, heterogeneous storage, recommendation dependencies, data permissions, layered data architecture (ODS, DW, DWD, DWS, TDM, ADS), asset management, governance, service APIs, query and analysis services, as well as monitoring, alerting, and operational best practices for building robust big‑data solutions.

Big DataData WarehouseETL
0 likes · 25 min read
Unlocking Data Middle Platform: From Ingestion to Real‑Time Analytics
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Feb 1, 2021 · Big Data

Kafka Overview: Architecture, Advantages, Disadvantages, and Core Concepts

This article provides a comprehensive introduction to Apache Kafka, covering its distributed publish‑subscribe architecture, its key components such as brokers, topics, partitions, producers, consumers, and ZooKeeper, as well as its advantages, drawbacks, storage mechanisms, partition assignment strategies, and reliability guarantees for high‑throughput big‑data streaming.

Big DataDistributed SystemsMessage queue
0 likes · 20 min read
Kafka Overview: Architecture, Advantages, Disadvantages, and Core Concepts
DataFunTalk
DataFunTalk
Feb 1, 2021 · Big Data

Building a Real-Time Data Warehouse with Apache Flink and Apache Iceberg: Architecture, Challenges, and Best Practices

This article presents Tencent's experience of constructing a real‑time data warehouse by integrating Apache Flink with Apache Iceberg, covering background pain points, Iceberg's table format and capabilities, Flink‑Iceberg streaming and batch processing, practical implementations, and future roadmap for data‑lake acceleration.

Apache FlinkApache IcebergBig Data
0 likes · 21 min read
Building a Real-Time Data Warehouse with Apache Flink and Apache Iceberg: Architecture, Challenges, and Best Practices
Architects' Tech Alliance
Architects' Tech Alliance
Jan 29, 2021 · Artificial Intelligence

Comprehensive Overview of Machine Learning: Types, Industry Chain, and Key Technologies

This article provides a detailed introduction to machine learning, covering its definition, learning modes such as supervised, unsupervised and reinforcement learning, shallow versus deep learning, the full industry chain from AI chips to cloud and big‑data services, and the major open‑source frameworks and platforms driving the field.

AI chipsBig DataUnsupervised Learning
0 likes · 11 min read
Comprehensive Overview of Machine Learning: Types, Industry Chain, and Key Technologies
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 28, 2021 · Big Data

Understanding Data Lakes: Definitions, Benefits, Architectures, and Technology Choices

Data lakes, emerging since 2020, are centralized repositories that store structured and unstructured data at any scale, offering flexible analytics, but require robust management to avoid becoming data swamps; this article explains definitions, advantages, typical architectures, and compares cloud and open‑source solutions such as AWS Lake Formation, Alibaba Cloud, Delta, Iceberg, and Hudi.

AnalyticsBig Datacloud storage
0 likes · 13 min read
Understanding Data Lakes: Definitions, Benefits, Architectures, and Technology Choices
dbaplus Community
dbaplus Community
Jan 27, 2021 · Big Data

How We Upgraded a 1500-Node Flink Cluster to 1.10: Challenges and Solutions

Facing a massive 1500‑node Flink 1.4.2 cluster handling over 12,000 tasks and 30 trillion daily events, we migrated to Flink 1.10, detailing new DDL/Catalog support, SQL enhancements, memory tuning, compatibility patches, extensive testing, and engine optimizations such as task‑load metrics and balanced sub‑task scheduling.

Big DataFlinkPerformance Optimization
0 likes · 13 min read
How We Upgraded a 1500-Node Flink Cluster to 1.10: Challenges and Solutions
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Jan 27, 2021 · Big Data

Introduction to Hadoop: Architecture, HDFS, MapReduce, and YARN Overview

This article provides a comprehensive overview of Hadoop, covering its origins, core components such as HDFS, MapReduce, and YARN, their architectures, data storage and processing mechanisms, fault‑tolerance features, scheduling strategies, and practical optimization techniques for large‑scale distributed computing.

Big DataDistributed computingHDFS
0 likes · 33 min read
Introduction to Hadoop: Architecture, HDFS, MapReduce, and YARN Overview
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 25, 2021 · Big Data

Why 2020 Was the Breakthrough Year for Apache Flink’s Ecosystem

In 2020, Apache Flink surged to become the most active Apache project, releasing three major versions that advanced its unified stream‑batch engine, introduced cloud‑native K8s support, expanded AI capabilities with PyFlink, and fostered a thriving Chinese community, solidifying its role as the de‑facto standard for real‑time computing.

AI IntegrationApache FlinkBig Data
0 likes · 21 min read
Why 2020 Was the Breakthrough Year for Apache Flink’s Ecosystem
Didi Tech
Didi Tech
Jan 22, 2021 · Big Data

Erasure Coding Practice in HDFS at Didi: Principles, Implementation, and Lessons Learned

Didi migrated HDFS to Hadoop 3.2 and implemented erasure coding—using XOR and Reed‑Solomon RS(6,3) striping—to replace three‑replica storage for cold data, building back‑ported clients, automated conversion tools, and cross‑datacenter backup pipelines, while addressing operational bugs and noting performance trade‑offs.

Big DataDidiHDFS
0 likes · 11 min read
Erasure Coding Practice in HDFS at Didi: Principles, Implementation, and Lessons Learned
DataFunTalk
DataFunTalk
Jan 22, 2021 · Big Data

Practical Experience of Apache Flink at ByteDance: Architecture, Optimizations, and Future Directions

This article presents ByteDance's real‑world use of Apache Flink, covering the platform's overall architecture, SQL extensions, custom connectors, UI‑driven SQL platform, performance optimizations such as window mini‑batch and custom windows, dimension‑table enhancements, checkpoint recovery improvements, stream‑batch integration, and upcoming roadmap items.

Apache FlinkBig DataByteDance
0 likes · 15 min read
Practical Experience of Apache Flink at ByteDance: Architecture, Optimizations, and Future Directions
Top Architect
Top Architect
Jan 18, 2021 · Big Data

Migrating Over 2 Billion MySQL Records to Google BigQuery Using Kafka

This article details a real‑world solution for migrating more than two billion MySQL records to Google BigQuery by streaming data through Kafka, employing partitioned tables, data filtering, and incremental migration to avoid downtime and reduce storage costs.

Big DataBigQueryData Migration
0 likes · 7 min read
Migrating Over 2 Billion MySQL Records to Google BigQuery Using Kafka
Efficient Ops
Efficient Ops
Jan 17, 2021 · Big Data

Understanding Kafka: Core Concepts, Architecture, and Performance Secrets

This article introduces Kafka’s fundamental role as a messaging system, explains topics, partitions, producers, consumers, replicas, consumer groups, and the controller, and explores its cluster architecture, performance optimizations like sequential writes and zero-copy, providing a comprehensive overview for building scalable data pipelines.

Big DataDistributed SystemsMessage queue
0 likes · 11 min read
Understanding Kafka: Core Concepts, Architecture, and Performance Secrets
Programmer DD
Programmer DD
Jan 16, 2021 · Artificial Intelligence

Can AI Really Predict Employee Work Status? Inside Baidu’s New Patent

The article examines Baidu’s newly filed patent for predicting employee work status, explaining its big‑data‑driven methodology, the company’s claim it’s a talent‑management tool, and the broader debate over workplace surveillance amid the ongoing 996 controversy.

AI predictionBaidu patentBig Data
0 likes · 4 min read
Can AI Really Predict Employee Work Status? Inside Baidu’s New Patent
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 15, 2021 · Big Data

Evolution and Architecture of Major Chinese Big Data Platforms: Taobao, Didi, Meituan, 360, Kuaishou, and JD

This article reviews the evolution, architecture, and key components of major Chinese big‑data platforms—including those of Taobao, Didi, Meituan, 360, Kuaishou, and JD—highlighting data ingestion, storage, processing engines, scheduling systems, and service‑oriented designs that underpin their large‑scale data operations.

Big DataData PlatformHadoop
0 likes · 14 min read
Evolution and Architecture of Major Chinese Big Data Platforms: Taobao, Didi, Meituan, 360, Kuaishou, and JD
DataFunTalk
DataFunTalk
Jan 15, 2021 · Big Data

Optimizing Apache Kylin for Meituan's Sales OLAP: From MapReduce to Spark and Resource Tuning

This article presents a detailed case study of how Meituan's in‑store dining sales team identified severe efficiency issues in their Apache Kylin‑based OLAP system, dissected the construction process, and applied a step‑by‑step optimization roadmap—including engine migration, dimension pruning, resource configuration, and Spark‑based layered building—to boost query performance and achieve near‑perfect SLA.

Apache KylinBig DataMeituan
0 likes · 16 min read
Optimizing Apache Kylin for Meituan's Sales OLAP: From MapReduce to Spark and Resource Tuning
Didi Tech
Didi Tech
Jan 14, 2021 · Cloud Computing

Design and Implementation of Didi's Logi‑KafkaManager Multi‑tenant Kafka Cloud Platform

Didi’s Logi‑KafkaManager is a multi‑tenant Kafka cloud platform that consolidates dozens of clusters into a secure, isolated gateway‑driven service offering intuitive web‑based topic management, real‑time metrics visualization, automated diagnostics, quota governance and safe scaling, delivering high internal satisfaction and enterprise commercialization.

Big DataData SecurityKafka
0 likes · 17 min read
Design and Implementation of Didi's Logi‑KafkaManager Multi‑tenant Kafka Cloud Platform
Meituan Technology Team
Meituan Technology Team
Jan 14, 2021 · Big Data

Design and Implementation of an SSD‑Based Application‑Layer Cache Architecture for Kafka in Meituan Data Platform

Meituan built an SSD‑based application‑layer cache for Kafka that bypasses PageCache contention between real‑time and delayed jobs, classifies log segments across SSD and HDD, limits flush rates, and achieves up to 80% latency reduction while guaranteeing stable real‑time consumption.

Big DataKafkaLogSegment
0 likes · 19 min read
Design and Implementation of an SSD‑Based Application‑Layer Cache Architecture for Kafka in Meituan Data Platform
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
Jan 14, 2021 · Big Data

How Yidun Achieves Real-Time, High-Performance Public-Opinion Data Cleaning with Groovy and JVM

Yidun’s public-opinion monitoring platform transforms massive raw web data into a unified format by separating dynamic Groovy-script-driven cleaning from static processing, achieving real-time source integration, high throughput, scalability, and high availability while addressing format diversity, team coordination, and performance-flexibility trade-offs.

Big DataData cleaningETL
0 likes · 5 min read
How Yidun Achieves Real-Time, High-Performance Public-Opinion Data Cleaning with Groovy and JVM
Architects Research Society
Architects Research Society
Jan 13, 2021 · Fundamentals

Master Data Management (MDM): Concepts, Business Value, Technical Challenges, and Architectural Considerations

The article explains master data management (MDM) as a framework for creating a single, reliable source of truth, outlines its growing business relevance, discusses key technical challenges such as data governance and scalability, and explores next‑generation architectures involving graph databases, big data, and machine learning.

Big DataGraph DatabaseMaster Data Management
0 likes · 10 min read
Master Data Management (MDM): Concepts, Business Value, Technical Challenges, and Architectural Considerations
vivo Internet Technology
vivo Internet Technology
Jan 13, 2021 · Big Data

Statistical Monitoring Using Normal Distribution and Boxplot: Theory, Implementation, and API Design

The article explains the origin of the normal distribution, the central limit theorem, and how boxplots identify anomalies, then describes a Java‑based API that partitions data into five median‑centered levels using same‑period and year‑over‑year ratios to automatically detect and classify abnormal trends in daily metrics.

Anomaly DetectionBig DataBoxplot
0 likes · 11 min read
Statistical Monitoring Using Normal Distribution and Boxplot: Theory, Implementation, and API Design
dbaplus Community
dbaplus Community
Jan 11, 2021 · Databases

Why eBay Switched Its Ad Analytics from Druid to ClickHouse – A Deep Dive

eBay’s ad data platform, originally built on a custom SQL engine and later migrated to Druid, was re‑engineered to use ClickHouse, highlighting challenges such as massive data volume, atomic offline replacements, schema design, compression, and operational simplifications, and demonstrating performance and scalability gains for advertisers.

Ad AnalyticsBig DataClickHouse
0 likes · 18 min read
Why eBay Switched Its Ad Analytics from Druid to ClickHouse – A Deep Dive
DataFunSummit
DataFunSummit
Jan 10, 2021 · Big Data

Business Model and Digital Transformation of Internet Consumer Finance: A Case Study of CMB’s Flash Loan

The article analyzes the business architecture, value proposition, channels, revenue model, core resources, and digital transformation of internet consumer finance using China Merchants Bank’s fast‑approval “Flash Loan” as a case study, highlighting the role of big data, AI, and cloud computing in modern retail lending.

Big DataBusiness ModelDigital Transformation
0 likes · 13 min read
Business Model and Digital Transformation of Internet Consumer Finance: A Case Study of CMB’s Flash Loan
21CTO
21CTO
Jan 7, 2021 · Big Data

How Kuaishou Built a Scalable Big Data Service Platform to Eliminate Redundant Development

This article explains Kuaishou's data service platform, detailing the background challenges of high development barriers and duplicated work, the platform's architecture and key technologies such as configuration‑driven development, multi‑mode APIs, data acceleration, and high‑availability mechanisms, and concludes with future directions.

Big DataData AccelerationData Platform
0 likes · 12 min read
How Kuaishou Built a Scalable Big Data Service Platform to Eliminate Redundant Development
360 Tech Engineering
360 Tech Engineering
Jan 7, 2021 · Big Data

Overview of the Qirin Big Data Platform Architecture and Core Modules

The article introduces the Qirin big data platform—a one‑stop solution covering resource management, metadata, data ingestion, task development, interactive querying, and self‑service analysis—detailing its modular architecture, typical processing workflow, and future development plans for enterprise‑wide data services.

Big DataData PlatformMetadata
0 likes · 11 min read
Overview of the Qirin Big Data Platform Architecture and Core Modules
vivo Internet Technology
vivo Internet Technology
Jan 6, 2021 · Big Data

How HyperLogLog Estimates Cardinality in Massive Data Sets

This article explains the cardinality‑counting problem behind DAU/MAU and unique visitor metrics, compares naïve solutions like Set, Bitmap and Bloom filter, introduces big‑data algorithms such as Linear Counting, LogLog and HyperLogLog, and shows how Redis implements HyperLogLog with dense and sparse storage optimizations.

Big DataCardinalityHyperLogLog
0 likes · 17 min read
How HyperLogLog Estimates Cardinality in Massive Data Sets
DataFunTalk
DataFunTalk
Jan 6, 2021 · Big Data

Didi's Presto Engine: Architecture, Optimizations, and Operational Practices

This article presents Didi's three‑year experience with Presto, detailing its architecture, low‑latency design, large‑scale deployment, extensive Hive compatibility work, resource isolation, Druid connector integration, usability enhancements, stability engineering, performance tuning, and future directions for the ad‑hoc query engine.

Big DataDistributed SystemsDruid Connector
0 likes · 17 min read
Didi's Presto Engine: Architecture, Optimizations, and Operational Practices
dbaplus Community
dbaplus Community
Jan 5, 2021 · Big Data

How Ctrip Built a Scalable Unified Log Framework for Payment Data

Facing massive, heterogeneous logs from numerous payment services, Ctrip’s data team designed a unified logging framework that extends log4j2, streams logs via Kafka to HDFS using a customized Camus pipeline, partitions and stores data in ORC for efficient Hive analysis, while addressing format, storage, and performance challenges.

Big DataCamusHadoop
0 likes · 16 min read
How Ctrip Built a Scalable Unified Log Framework for Payment Data
DataFunTalk
DataFunTalk
Jan 3, 2021 · Artificial Intelligence

iQIYI Machine Learning Platform: Development History, Features, and Practical Experience

This article details the evolution of iQIYI's machine learning platform—from its early Javis‑based deep‑learning system to three major versions that introduced visual workflow, distributed scheduling, auto‑tuning, large‑scale training support, model management, and online prediction—while sharing practical lessons and a real anti‑cheat use case.

Big DataModel ManagementPlatform
0 likes · 13 min read
iQIYI Machine Learning Platform: Development History, Features, and Practical Experience
Tencent Cloud Developer
Tencent Cloud Developer
Dec 30, 2020 · Big Data

How Alluxio Boosts Tencent Cloud EMR: Cutting Bandwidth by 50% and Accelerating IO‑Intensive Workloads

This article analyzes the challenges of traditional monolithic big‑data architectures, explains how Tencent Cloud EMR integrates Alluxio for compute‑storage separation, presents detailed performance benchmarks showing 20‑50% bandwidth reduction and 5‑40% query speedup, and outlines the specific tuning measures applied.

AlluxioBig DataCloud Computing
0 likes · 10 min read
How Alluxio Boosts Tencent Cloud EMR: Cutting Bandwidth by 50% and Accelerating IO‑Intensive Workloads
JD Tech Talk
JD Tech Talk
Dec 30, 2020 · Databases

Architecture and Application Practice of JD Urban Spatio-Temporal Data Engine (JUST)

The presentation details the design, implementation, and real‑world applications of the JD Urban Spatio‑Temporal Data Engine (JUST), a distributed, scalable database that handles massive, complex spatio‑temporal data with novel storage, indexing, and query techniques, demonstrating high performance and ease of use across smart‑city scenarios.

Big DataDatabaseGIS
0 likes · 26 min read
Architecture and Application Practice of JD Urban Spatio-Temporal Data Engine (JUST)
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 29, 2020 · Fundamentals

What Are the 10 Tech Trends Shaping the Post-Pandemic Era?

Alibaba DAMO Academy outlines ten pivotal technology trends for 2021, ranging from third‑generation semiconductors and quantum computing to AI‑driven drug discovery, cloud‑native IT, data‑intelligent agriculture, and smart city operation centers, highlighting how these innovations will drive post‑pandemic growth.

Artificial IntelligenceBig DataQuantum Computing
0 likes · 9 min read
What Are the 10 Tech Trends Shaping the Post-Pandemic Era?
Alibaba Terminal Technology
Alibaba Terminal Technology
Dec 28, 2020 · Big Data

Unlocking Massive-Scale User Behavior Analysis: From Funnels to Intelligent Links

This talk explores how to conduct user behavior analysis on massive data sets, compares existing analytics tools, and presents Alibaba Dataworks' end‑to‑end solution—including funnel and link visualizations, a big‑data processing architecture, and future intelligent link capabilities—to uncover and resolve user‑experience issues efficiently.

Alibaba CloudBig DataData visualization
0 likes · 16 min read
Unlocking Massive-Scale User Behavior Analysis: From Funnels to Intelligent Links
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 28, 2020 · Big Data

Implementing Historical Slowly Changing Dimension (Chain) Tables with PL/pgSQL

This article explains the concept of historical chain (slowly changing dimension) tables in data warehousing, demonstrates how to create source and target tables, provides a PL/pgSQL stored procedure to handle inserts, updates, and deletions, and shows step‑by‑step testing with sample SQL scripts.

Big DataPL/pgSQLSlowly Changing Dimension
0 likes · 10 min read
Implementing Historical Slowly Changing Dimension (Chain) Tables with PL/pgSQL
dbaplus Community
dbaplus Community
Dec 27, 2020 · Big Data

How ClickHouse Powers a 700 B‑Row Real‑Time Data Platform at Ctrip

This article details how Ctrip's senior engineering manager leveraged ClickHouse to build a high‑availability, sub‑second response data platform handling nearly 700 billion rows, describing the motivations, architecture, data synchronization processes, performance gains, challenges, and practical recommendations for large‑scale analytics.

Big DataClickHouseData Architecture
0 likes · 28 min read
How ClickHouse Powers a 700 B‑Row Real‑Time Data Platform at Ctrip
Architect
Architect
Dec 27, 2020 · Big Data

Optimizing Billion‑Scale Hive Queries: Partitioning, Indexing, Bucketing, Active‑User Segmentation, and Data Structure Refactoring

This article walks through the challenges of querying a 300‑billion‑row Hive table, analyzes why traditional partitioning, indexing, and bucketing fall short, and presents a practical solution that combines active‑user segmentation and a redesigned array‑based data model to cut query time from hours to minutes.

Big DataData PartitioningHive
0 likes · 10 min read
Optimizing Billion‑Scale Hive Queries: Partitioning, Indexing, Bucketing, Active‑User Segmentation, and Data Structure Refactoring
Youzan Coder
Youzan Coder
Dec 25, 2020 · Big Data

Metadata Governance and Collection in a Data Asset Platform

The platform implements comprehensive metadata governance by extracting, standardizing, and ingesting basic, trend, resource, lineage, and task metadata from offline and real‑time systems via a Kafka‑based SDK, enabling unified storage, monitoring, alerts, and future automation to improve data asset visibility and quality.

Big DataData CollectionMetadata
0 likes · 18 min read
Metadata Governance and Collection in a Data Asset Platform
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 24, 2020 · Big Data

Common Techniques for Processing Massive Data Sets

This article summarizes a range of practical methods—including Bloom filters, hashing, bit‑maps, heaps, bucket partitioning, database indexes, inverted indexes, external sorting, trie trees, and MapReduce—that are commonly used to handle, deduplicate, and query extremely large data volumes in big‑data applications.

Big DataHashingHeap
0 likes · 11 min read
Common Techniques for Processing Massive Data Sets
Code Ape Tech Column
Code Ape Tech Column
Dec 23, 2020 · Fundamentals

Technical Concepts Illustrated Through Relationship Analogies

The article humorously maps various relationship scenarios to core IT concepts such as backup strategies, high‑availability mechanisms, scaling methods, security measures, cloud services, and big‑data techniques, providing an engaging overview of fundamental system design principles.

Big DataCloud ComputingScaling
0 likes · 8 min read
Technical Concepts Illustrated Through Relationship Analogies
dbaplus Community
dbaplus Community
Dec 22, 2020 · Big Data

How eBay Migrated 10 PB of HDFS Data Across Namespaces in Just 2 Hours

This article details how eBay's ADI Hadoop team tackled a massive 10 PB, 10‑million‑file migration by optimizing DistCp with Fastcopy, load‑balancing, ACL handling, and failure recovery, ultimately completing the transfer within a two‑hour window while preserving cluster stability and performance.

Big DataDistcpHDFS
0 likes · 16 min read
How eBay Migrated 10 PB of HDFS Data Across Namespaces in Just 2 Hours
Architect
Architect
Dec 22, 2020 · Big Data

Dimensional Modeling in Data Warehousing: Concepts, Theory, and Practical Example

This article explains data warehouse fundamentals, reviews classic warehouse models such as ER, dimensional, Data Vault and Anchor, then dives deep into dimensional modeling concepts, star and snowflake schemas, and demonstrates a practical e‑commerce scenario with SQL examples and trade‑offs.

Big DataData WarehouseETL
0 likes · 11 min read
Dimensional Modeling in Data Warehousing: Concepts, Theory, and Practical Example
21CTO
21CTO
Dec 21, 2020 · Big Data

5 Emerging Big Data Trends Shaping Business, Health, and Climate in 2021

This article outlines five key big‑data trends for 2021—including the rise of augmented analytics, the convergence of big data with blockchain, the growing importance of knowledge graphs, data‑driven health innovations, and climate‑focused analytics—highlighting their impact on organizations and future technological landscapes.

Big DataBlockchainKnowledge Graph
0 likes · 8 min read
5 Emerging Big Data Trends Shaping Business, Health, and Climate in 2021
Didi Tech
Didi Tech
Dec 21, 2020 · Big Data

HBase Availability and Latency Optimizations: Replication‑Based Multi‑Read and ZGC Adoption

To overcome HBase’s weak availability and GC‑induced latency spikes, the DiDi team introduced a replication‑based client multi‑read (hedged‑read) mechanism and migrated to the Z Garbage Collector, which together dramatically cut maximum and 99.9th‑percentile latencies while keeping services online during region disruptions.

Big DataHBaseLow latency
0 likes · 12 min read
HBase Availability and Latency Optimizations: Replication‑Based Multi‑Read and ZGC Adoption
Youzan Coder
Youzan Coder
Dec 18, 2020 · Big Data

Design and Implementation of a Configurable Real-Time Rule Engine for Live‑Streaming Product Audits

The paper presents a configurable real‑time rule engine for live‑streaming product audits that decouples data aggregation from rule execution, uses QLExpress for dynamic conditions, supports Dubbo and HTTP sources, and enables safe gray‑release updates, cutting the rule‑change cycle from weeks to near‑real‑time.

Big DataQLExpressReal-time Data
0 likes · 8 min read
Design and Implementation of a Configurable Real-Time Rule Engine for Live‑Streaming Product Audits
Laiye Technology Team
Laiye Technology Team
Dec 18, 2020 · Big Data

Comprehensive Overview of Laiye Technology's Business Intelligence Ecosystem

This article provides a detailed, end‑to‑end description of Laiye Technology's BI ecosystem, covering its background, development stages, data acquisition, transmission, transformation, loading, modeling, storage layers, statistical analysis, real‑time metrics, visualization, and future challenges, illustrating how the company builds a scalable, cloud‑native data‑driven platform.

AnalyticsBIBig Data
0 likes · 22 min read
Comprehensive Overview of Laiye Technology's Business Intelligence Ecosystem
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 17, 2020 · Big Data

Why GraphScope is Revolutionizing Large-Scale Graph Computing for AI and Big Data

GraphScope, an open‑source one‑stop platform from Alibaba DAMO Academy, unifies interactive queries, graph analytics, and graph learning on massive, rapidly evolving graphs, offering high‑performance distributed memory management, Gremlin optimization, and seamless Python integration to tackle real‑world AI and big‑data challenges.

Big DataDistributed SystemsPython
0 likes · 21 min read
Why GraphScope is Revolutionizing Large-Scale Graph Computing for AI and Big Data
macrozheng
macrozheng
Dec 15, 2020 · Big Data

How Kafka Achieves Million‑TPS Through Sequential I/O, MMAP, and Zero‑Copy

Kafka can sustain millions of transactions per second by writing data sequentially to disk, leveraging memory‑mapped files, employing zero‑copy DMA transfers, and batching messages, each technique reducing I/O overhead and CPU involvement, which together enable its high‑throughput performance in big‑data pipelines.

Big DataHigh ThroughputKafka
0 likes · 11 min read
How Kafka Achieves Million‑TPS Through Sequential I/O, MMAP, and Zero‑Copy
Youzan Coder
Youzan Coder
Dec 15, 2020 · Industry Insights

How Youzan Built a Full‑Scale Data Cost Billing System: From SDK to Multi‑Dimensional Analysis

This article details Youzan's end‑to‑end construction of a unified data‑center cost billing system, covering background goals, multi‑type cost support, SDK‑based information collection, cost quantification for offline, real‑time and platform tools, full‑business coverage, multi‑dimensional analysis models, operational rollout, and future plans.

Big DataData PlatformIndustry Insights
0 likes · 19 min read
How Youzan Built a Full‑Scale Data Cost Billing System: From SDK to Multi‑Dimensional Analysis
Programmer DD
Programmer DD
Dec 10, 2020 · Artificial Intelligence

Discover Didi’s 40+ Open‑Source Projects in AI, Big Data & Cloud

DiDi’s open‑source portfolio, now exceeding 40 projects, spans AI runtimes, speech recognition, traffic analytics, middleware, big‑data loaders, monitoring tools, mobile frameworks, and frontend libraries, offering developers ready‑to‑use solutions for edge AI, intelligent transportation, data processing, and system reliability.

Artificial IntelligenceBig DataMobile Development
0 likes · 23 min read
Discover Didi’s 40+ Open‑Source Projects in AI, Big Data & Cloud
Youzan Coder
Youzan Coder
Dec 9, 2020 · Big Data

Youzan Big Data Technology Salon: Practices in Data Cost Governance, Apache Iceberg, Flink, and Data-Driven Growth

The Youzan Big Data Technology Salon brought together Youzan, NetEase and Didi to share practical approaches for cutting data‑infrastructure costs, building an Apache Iceberg‑based data lake, scaling Flink real‑time workloads, and creating a data‑driven growth platform that leverages tracking, A/B testing and analytics.

Apache IcebergBig DataData Cost Governance
0 likes · 5 min read
Youzan Big Data Technology Salon: Practices in Data Cost Governance, Apache Iceberg, Flink, and Data-Driven Growth
DataFunTalk
DataFunTalk
Dec 8, 2020 · Artificial Intelligence

Financial Big Data Risk Control Models: Techniques, Applications, and COVID‑19 Challenges

This article presents a comprehensive overview of financial big‑data risk control models at Du Xiaoman, covering traditional scoring cards, AI‑driven time‑series and text processing, graph‑based networks, model interpretability, probability calibration, stability analysis, and the specific challenges introduced by the COVID‑19 pandemic.

Artificial IntelligenceBig DataCredit Scoring
0 likes · 14 min read
Financial Big Data Risk Control Models: Techniques, Applications, and COVID‑19 Challenges
Xianyu Technology
Xianyu Technology
Dec 8, 2020 · Big Data

Supply-Demand Modeling and Category Optimization for the Idle Second-Hand Market

The article describes a supply‑demand modeling framework for the idle second‑hand market that extracts and structures product attributes, builds a decision‑tree‑based index from price, inventory, search‑hotspot and demand‑activation sub‑models, and uses the index to optimize category allocation, boost scarce supply, and drive overall growth.

Big DataProduct Modelingcategory optimization
0 likes · 7 min read
Supply-Demand Modeling and Category Optimization for the Idle Second-Hand Market
Tencent Cloud Developer
Tencent Cloud Developer
Dec 7, 2020 · Big Data

Searchable Snapshots in Elasticsearch 7.10: Features, Usage, and Future Outlook

Elasticsearch 7.10 adds searchable snapshots, letting users query indices stored directly in remote repositories such as S3 or COS, which halves storage costs, decouples storage from compute, supports manual mounting and ILM cold‑phase policies, and promises future full storage‑compute separation without local caching.

Big DataData TieringElasticsearch
0 likes · 12 min read
Searchable Snapshots in Elasticsearch 7.10: Features, Usage, and Future Outlook
DataFunSummit
DataFunSummit
Dec 1, 2020 · Artificial Intelligence

Building an AI Ecosystem with Flink: AI Flow Architecture, Components, and Applications

This article explains how Flink enables end‑to‑end AI workflows through the AI Flow platform, covering the Lambda architecture background, AI task pipeline stages, the reasons for choosing Flink, AI Flow’s graph model, core services, integration with ML pipelines, and real‑world advertising recommendation use cases.

AI FlowAI PipelineBig Data
0 likes · 12 min read
Building an AI Ecosystem with Flink: AI Flow Architecture, Components, and Applications
DataFunTalk
DataFunTalk
Nov 30, 2020 · Fundamentals

DataFunTalk Annual Conference – Full Program and Speaker Details

The DataFunTalk year‑end conference will be held online on December 19‑20, featuring over 90 speakers across multiple forums covering recommendation algorithms, knowledge graphs, AI, big data, security, and product development, with detailed session schedules, speaker bios, and registration information.

AIBig DataKnowledge Graph
0 likes · 76 min read
DataFunTalk Annual Conference – Full Program and Speaker Details
JD Tech Talk
JD Tech Talk
Nov 30, 2020 · Big Data

Scalable Time Series Similarity Search in Big Data: Partitioning, Dimensionality Reduction, and LSH Approaches

This article examines the challenges of performing time‑series similarity queries on massive datasets and presents three scalable solutions—partition‑based indexing, dimensionality‑reduction using MinHash, and a combined approach with Locality Sensitive Hashing—to reduce computation while preserving similarity accuracy.

Big DataLSHMinhash
0 likes · 10 min read
Scalable Time Series Similarity Search in Big Data: Partitioning, Dimensionality Reduction, and LSH Approaches
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Nov 28, 2020 · Fundamentals

What 19 Core Topics Every Software Architect Must Master

This article outlines a comprehensive knowledge framework for software architects, covering nineteen essential areas such as responsibilities, foundational concepts, internet system challenges, distributed caching, messaging, load balancing, performance testing, operating systems, algorithms, networking, database design, JVM internals, flash-sale systems, microservices, domain‑driven design, security, high‑availability, big data, and blockchain.

Big DataSoftware ArchitectureSystem Design
0 likes · 6 min read
What 19 Core Topics Every Software Architect Must Master
dbaplus Community
dbaplus Community
Nov 28, 2020 · Operations

How a Chinese City Bank Integrated DevOps, AI, and Big Data to Transform Operations

This case study details how a city‑bank leveraged DevOps and ITIL integration, AI‑driven monitoring, and Spark‑based big‑data analytics to build a unified development‑testing‑operations platform, improve service availability, shorten deployment cycles, and achieve near‑99.99% system uptime across its core banking services.

AIBig DataDevOps
0 likes · 17 min read
How a Chinese City Bank Integrated DevOps, AI, and Big Data to Transform Operations