Tagged articles

3697 articles

Page 24 of 37

Nov 27, 2020 · Artificial Intelligence

Mining User Housing Preference Schemes with Supply‑Filtered Tree‑Based Methods

The article proposes a supply‑filtered, tree‑based approach to discover multi‑dimensional user housing preference schemes, contrasting fixed‑length preference mining methods, and details algorithmic modules such as split‑point search, similarity calculation, split suppression, and user clustering to improve interpretability and offline applicability.

AIBig Datahousing recommendation

0 likes · 13 min read

Mining User Housing Preference Schemes with Supply‑Filtered Tree‑Based Methods

Practical DevOps Architecture

Nov 27, 2020 · Big Data

Step-by-Step Guide to Install and Configure a Hadoop 2.8.2 Cluster

This tutorial provides a complete walkthrough for downloading Hadoop 2.8.2, setting up a three‑node master‑slave cluster, configuring core, HDFS, MapReduce and YARN settings, creating required directories, distributing the installation, starting the services, verifying the cluster status, and finally shutting it down.

Big DataCluster SetupHDFS

0 likes · 5 min read

Step-by-Step Guide to Install and Configure a Hadoop 2.8.2 Cluster

dbaplus Community

Nov 26, 2020 · Big Data

Silicon Valley's Data Middle Platform Secrets: EA, Twitter, Airbnb, Uber

This article examines how leading Silicon Valley companies such as EA, Twitter, Airbnb, and Uber design and operate data middle platforms—detailing their architectures, data collection pipelines, standardization efforts, real‑time and batch processing, and the business impact of shared data capabilities.

Big DataData ArchitectureData Platform

0 likes · 25 min read

Silicon Valley's Data Middle Platform Secrets: EA, Twitter, Airbnb, Uber

DataFunTalk

Nov 26, 2020 · Big Data

Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Architecture and Technology

This article details the evolution of 58.com’s commercial data warehouse across three phases—1.0, 2.0, and 3.0—covering its scale, four‑layer architecture, migration from legacy Hadoop‑MapReduce pipelines to Flume/Kafka and Flink streaming, code optimizations, monitoring, and productization for real‑time business insights.

Big DataETLHadoop

0 likes · 9 min read

Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Architecture and Technology

Big Data Technology Architecture

Nov 25, 2020 · Big Data

Data Lake Storage Architecture Selection and JindoFS on Alibaba Cloud

This article explains the concept and benefits of data lakes, outlines the storage and acceleration challenges they pose, presents an ideal checklist for selecting a data lake solution, and evaluates Alibaba Cloud's JindoFS against that checklist, highlighting its capabilities for big‑data and AI workloads.

Alibaba CloudBig DataData Lake

0 likes · 9 min read

Data Lake Storage Architecture Selection and JindoFS on Alibaba Cloud

dbaplus Community

Nov 24, 2020 · Databases

How ClickHouse Enables Millisecond‑Scale User Profiling for Hundreds of Millions

This article explains how Suning built a high‑performance user‑tag platform on ClickHouse, replacing Elasticsearch with bitmap‑based storage and a new architecture that delivers sub‑second profiling queries for over 600 million users, detailing the design, implementation, and future enhancements.

Big DataClickHouseOLAP

0 likes · 14 min read

How ClickHouse Enables Millisecond‑Scale User Profiling for Hundreds of Millions

DataFunTalk

Nov 24, 2020 · Artificial Intelligence

Building Next‑Generation Data Intelligence Infrastructure with Knowledge Graphs: From New Infrastructure to Cognitive AI Platforms

This presentation explains how knowledge graphs serve as the foundation for new‑infrastructure initiatives, detailing the evolution of AI from perception to cognition, the role of big‑data centers, DIKW modeling, intelligent data governance, and the construction of a cognitive AI middle‑platform for industry applications.

AI InfrastructureArtificial IntelligenceBig Data

0 likes · 18 min read

Building Next‑Generation Data Intelligence Infrastructure with Knowledge Graphs: From New Infrastructure to Cognitive AI Platforms

Big Data Technology Architecture

Nov 24, 2020 · Big Data

Using DeltaLake for Industrial Data Platforms: Distributed Stream Processing, Batch‑Stream Fusion, and Transactional Support

This article shares practical experiences of building an industrial data middle‑platform with DeltaLake, covering heterogeneous distributed stream handling, batch‑stream unified analytics, and transactional/algorithm support to improve data timeliness, reliability, and operational efficiency in manufacturing environments.

Batch-Stream FusionBig DataDeltaLake

0 likes · 11 min read

Using DeltaLake for Industrial Data Platforms: Distributed Stream Processing, Batch‑Stream Fusion, and Transactional Support

Alibaba Cloud Developer

Nov 23, 2020 · Big Data

How Alibaba’s CCO Built a Cloud‑Native Real‑Time Data Warehouse with Hologres

Alibaba’s Customer Experience (CCO) team transformed its real‑time data platform by evolving from a Lambda‑style database architecture to a cloud‑native real‑time data warehouse powered by Hologres and Flink, achieving higher throughput, lower latency, reduced costs, and self‑service analytics for massive Double‑11 traffic.

AlibabaBig DataFlink

0 likes · 15 min read

How Alibaba’s CCO Built a Cloud‑Native Real‑Time Data Warehouse with Hologres

Alibaba Cloud Developer

Nov 22, 2020 · Big Data

How Flink’s Stream‑Batch Integration Powered Alibaba’s Record‑Breaking Double‑11

Alibaba’s 2020 Double‑11 achieved unprecedented real‑time processing of 4 billion records per second and 7 TB of data per second using Flink, showcasing the stability, performance and efficiency of its stream‑batch unified architecture across diverse business scenarios.

AlibabaBatch processingBig Data

0 likes · 15 min read

How Flink’s Stream‑Batch Integration Powered Alibaba’s Record‑Breaking Double‑11

Big Data Technology & Architecture

Nov 21, 2020 · Big Data

Big Data Performance Testing: Objectives, Timing, Steps, Tools, and Optimization

This article outlines the purpose, timing, procedures, tools, and optimization techniques for big data performance testing, providing detailed guidance on test planning, execution, metric collection, and analysis to ensure reliable and efficient big data system deployments.

BenchmarkBig DataHadoop

0 likes · 7 min read

Big Data Performance Testing: Objectives, Timing, Steps, Tools, and Optimization

Alibaba Cloud Developer

Nov 19, 2020 · Databases

How AnalyticDB Powers Double 11: Cloud‑Native Data Warehouse Innovations

AnalyticDB, a cloud‑native MySQL‑compatible data warehouse, delivered extreme performance during Double 11 by handling billions of orders with ultra‑high write TPS, while introducing compute‑storage separation, hot‑cold tiering, resource groups, elastic scaling and intelligent optimization to meet demanding real‑time analytics workloads.

AnalyticDBBig DataResource Groups

0 likes · 17 min read

How AnalyticDB Powers Double 11: Cloud‑Native Data Warehouse Innovations

Java Architect Essentials

Nov 19, 2020 · Artificial Intelligence

Overview of Didi’s Open‑Source Projects Across AI, Big Data, Operations, Mobile and Frontend

This article presents a comprehensive catalog of more than 40 open‑source projects released by Didi, covering AI runtimes, speech and NLP engines, big‑data loaders, middleware, mobile frameworks, frontend UI libraries and various operational tools, each with a brief description and a GitHub link.

AIBig DataDidi

0 likes · 18 min read

Overview of Didi’s Open‑Source Projects Across AI, Big Data, Operations, Mobile and Frontend

Meituan Technology Team

Nov 19, 2020 · Big Data

Optimizing Apache Kylin for High‑Performance OLAP in Meituan's Sales System

Meituan’s sales system “Qingtian” boosted OLAP performance by migrating Apache Kylin’s build engine from MapReduce to Spark, consolidating Hive files, refining dictionary creation, applying a By‑layer algorithm, and bulk‑loading cuboid files to HBase, cutting resource consumption and halving build time, ultimately reaching a 100 % SLA.

Apache KylinBig DataMeituan

0 likes · 15 min read

Optimizing Apache Kylin for High‑Performance OLAP in Meituan's Sales System

Tencent Tech

Nov 19, 2020 · Cloud Computing

How Tencent Built a Massive Cloud Storage System to Power QQ Album and Beyond

This article chronicles Tencent's journey from the early development of the TFS distributed storage platform to large‑scale data migrations, flexible bandwidth strategies, and the creation of the cloud‑native YottaStore, illustrating how a small architecture team solved massive storage challenges for billions of users.

Big DataData MigrationYottaStore

0 likes · 15 min read

How Tencent Built a Massive Cloud Storage System to Power QQ Album and Beyond

DeWu Technology

Nov 19, 2020 · Operations

HBase Operations and Use Cases for High‑Concurrency E‑commerce

In this talk, Yun Jin explains how HBase’s petabyte‑scale, horizontally‑scalable architecture—built on Hadoop, HMaster, RegionServers, and Zookeeper—enables e‑commerce platforms to handle extreme promotion‑day traffic by supporting high‑throughput reads/writes, time‑series monitoring, massive order storage, and robust HA, while covering essential table operations, monitoring, and troubleshooting techniques.

Big DataHBaseOperations

0 likes · 6 min read

HBase Operations and Use Cases for High‑Concurrency E‑commerce

JD Retail Technology

Nov 19, 2020 · Big Data

Building JD's Enterprise-wide Big Data Platform: Architecture, Stages, and Challenges

This article summarizes Bao Yongjun’s presentation on JD.com’s end‑to‑end big data platform, covering its strategic value, industry trends, architectural design, development phases from scale‑out to intelligent real‑time processing, and future directions for a cloud‑native, AI‑driven data ecosystem.

Big DataJD.comdata governance

0 likes · 16 min read

Building JD's Enterprise-wide Big Data Platform: Architecture, Stages, and Challenges

Java High-Performance Architecture

Nov 18, 2020 · Big Data

Why Pulsar Might Outperform Kafka: Key Advantages and Drawbacks

This article examines Apache Pulsar, an open‑source messaging platform created by Yahoo, compares it with Kafka by outlining Kafka’s common pain points, highlights Pulsar’s multi‑tenant architecture, layered storage, built‑in functions, and security features, and discusses the trade‑offs of each solution.

Apache PulsarBig DataDistributed Systems

0 likes · 6 min read

Why Pulsar Might Outperform Kafka: Key Advantages and Drawbacks

JD Tech Talk

Nov 17, 2020 · Databases

JUST Engine: Novel Spatio‑Temporal Indexes and Data Models for Large‑Scale Urban Data Management

The article introduces the JUST engine, a spatio‑temporal data platform that extends GeoMesa with three new indexes (Z2T, XZ2T, time_range), defines nine common and three specialized data models, provides default indexing strategies, and offers detailed SQL usage guidelines for efficient querying of massive urban datasets.

Big DataDatabasesGeoMesa

0 likes · 25 min read

JUST Engine: Novel Spatio‑Temporal Indexes and Data Models for Large‑Scale Urban Data Management

Big Data Technology & Architecture

Nov 16, 2020 · Big Data

Understanding Data Skew in Big Data: Causes, Symptoms, and Solutions for Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, how to recognize its symptoms such as stuck reducers or OOM executors, and presents practical strategies—including business‑level adjustments, code refactoring, and platform‑specific tuning—to mitigate the problem.

Big DataHadoopSpark

0 likes · 13 min read

Understanding Data Skew in Big Data: Causes, Symptoms, and Solutions for Hadoop and Spark

Alibaba Cloud Native

Nov 16, 2020 · Cloud Native

What’s New in Fluid 0.4? DataLoad, Small‑File Boost, HDFS Support & Multi‑Dataset Deployment

Fluid 0.4 introduces a DataLoad custom resource for declarative data pre‑warming, enhances support for massive small‑file datasets, adds HDFS‑compatible access for Spark and other big‑data frameworks, and enables mixed‑deployment of multiple datasets on a single node, all backed by significant performance gains.

AIAlluxioBig Data

0 likes · 8 min read

What’s New in Fluid 0.4? DataLoad, Small‑File Boost, HDFS Support & Multi‑Dataset Deployment

DataFunSummit

Nov 15, 2020 · Big Data

Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Using Hadoop, Flume, Kafka, Spark, and Flink

This article details the three‑stage evolution of 58.com’s commercial data warehouse, describing its massive scale, four‑layer architecture, technical challenges, migrations from MapReduce to Hive and Flink, real‑time streaming upgrades, and the resulting improvements in stability, accuracy, and timeliness.

Big DataData ArchitectureData Warehouse

0 likes · 10 min read

Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Using Hadoop, Flume, Kafka, Spark, and Flink

dbaplus Community

Nov 15, 2020 · Big Data

Mastering Real‑Time Stream Processing with Flink: From Fundamentals to Kuaishou Production

This article walks through the evolution of big‑data systems to modern stream processing, explains core Flink concepts such as state, checkpoints, event‑time and windowing, and details Kuaishou’s real‑time UV calculation and fast‑failover techniques for high‑availability streaming jobs.

Big DataFlinkKafka

0 likes · 21 min read

Mastering Real‑Time Stream Processing with Flink: From Fundamentals to Kuaishou Production

Beike Product & Technology

Nov 13, 2020 · Big Data

Beike One‑Stop Big Data Development Platform: Architecture, Evolution, and Future Outlook

The article summarizes Beike's one‑stop big data development platform, describing its data business background, the evolution from a simple Hadoop‑Kafka‑Hive stack to a metadata‑driven, asset‑oriented platform, and outlines current capabilities in data management, integration, scheduling, quality, openness, and future plans.

Big DataData PlatformData engineering

0 likes · 11 min read

Beike One‑Stop Big Data Development Platform: Architecture, Evolution, and Future Outlook

Tencent Cloud Developer

Nov 13, 2020 · Big Data

Apache Spark Core: Architecture, Components, and Execution Flow

Apache Spark Core is a high‑performance, fault‑tolerant engine that abstracts distributed computation through SparkContext, DAG and Task schedulers, supports in‑memory and disk storage, runs on various cluster managers (YARN, Kubernetes, etc.), and unifies batch, streaming, ML and graph processing via its rich ecosystem.

Apache SparkBig DataDAG scheduler

0 likes · 17 min read

Apache Spark Core: Architecture, Components, and Execution Flow

DataFunSummit

Nov 12, 2020 · Big Data

OLAP Engine Selection and Challenges in Large-Scale Data at Youku

This article explores the challenges big data brings to traditional data technologies and reviews various OLAP solutions—including MPP, batch processing, pre‑computation, and Hadoop‑based engines—while detailing Youku’s specific business scenarios and how different OLAP engines are selected to meet performance, scalability, and real‑time analysis requirements.

AnalyticsBig DataData Warehouse

0 likes · 14 min read

OLAP Engine Selection and Challenges in Large-Scale Data at Youku

Python Crawling & Data Mining

Nov 12, 2020 · Big Data

Scrape and Visualize the Hurun Rich List (2015‑2020) with Python & Pyecharts

This article demonstrates how to scrape the Hurun Rich List from 2015 to 2020 using Python, clean the data, and create interactive visualizations of the top 20 wealth holders, wealth trends over six years, and industry shifts with Pyecharts.

Big DataPythonWealth Analysis

0 likes · 4 min read

Scrape and Visualize the Hurun Rich List (2015‑2020) with Python & Pyecharts

Xianyu Technology

Nov 11, 2020 · Industry Insights

How Alibaba’s Double‑11 Tech Stack Powers Record‑Breaking Live Commerce

Alibaba’s Double 11 2023 showcased a suite of cutting‑edge technologies—including the GRTN real‑time transmission network, edge‑AI voice interaction, massive digital infrastructure, AI‑driven smart sample rooms, and 3D virtual home‑decoration live streams—that together delivered sub‑second latency, 30% cost reduction, and unprecedented merchant scalability.

3D virtual realityBig DataDigital Infrastructure

0 likes · 11 min read

How Alibaba’s Double‑11 Tech Stack Powers Record‑Breaking Live Commerce

Architect

Nov 11, 2020 · Big Data

Real-time Click Stream Data Warehouse with Flink and ClickHouse: Architecture, Layered Design, and Practical Tips

This article explains how to build a real‑time click‑stream data warehouse using Flink for stream processing and ClickHouse for near‑real‑time OLAP, covering click‑stream characteristics, dimensional modeling, layered warehouse design, async dimension joins, sink implementation, and data rebalancing strategies.

Big DataClick StreamClickHouse

0 likes · 7 min read

Real-time Click Stream Data Warehouse with Flink and ClickHouse: Architecture, Layered Design, and Practical Tips

DataFunTalk

Nov 11, 2020 · Big Data

Evolution and Practices of Cainiao's Real‑Time Data Warehouse for International Import Business

This article details the high‑complexity logistics scenario of Cainiao's international import business, explains the evolution from offline to real‑time data warehouses (versions 1.0 and 2.0), describes the layered architecture, enumerates technical challenges such as multi‑source joins, state explosion, out‑of‑order processing, and presents concrete solutions using Flink features, logical middle‑layers, union‑all joins, deduplication, timer services, and batch‑stream hybrid processing.

Big DataFlinkstate-management

0 likes · 21 min read

Evolution and Practices of Cainiao's Real‑Time Data Warehouse for International Import Business

Practical DevOps Architecture

Nov 11, 2020 · Big Data

Step-by-Step Guide to Installing and Configuring Apache Flume on a Cluster

This guide walks through downloading Apache Flume, setting up a master‑slave cluster, and configuring NetCat, Exec, and Avro sources with corresponding sinks and memory channels, including verification commands to ensure the agents run correctly.

Apache FlumeBig DataCluster Setup

0 likes · 5 min read

Step-by-Step Guide to Installing and Configuring Apache Flume on a Cluster

JD Tech Talk

Nov 9, 2020 · Big Data

Trajectory-Based Population Flow Analysis for COVID‑19 Prevention Using HBase and Spark

The article presents a comprehensive big‑data solution that stores massive GPS trajectory records in HBase, processes them with Spark to identify individuals who visited a pandemic source region, and visualizes their spatio‑temporal distribution in target cities to support precise epidemic control measures.

Big DataCOVID-19HBase

0 likes · 8 min read

Trajectory-Based Population Flow Analysis for COVID‑19 Prevention Using HBase and Spark

Big Data Technology & Architecture

Nov 8, 2020 · Big Data

Flume Tuning Guide for High‑Throughput Data Ingestion

This article explains how to identify and resolve performance bottlenecks in Apache Flume by configuring Taildir sources, optimizing channel capacities, tuning Kafka sinks, adjusting JVM options, and using simple monitoring scripts, enabling a single Flume‑NG agent to sustain over 50,000 RPS in production.

Big DataFlumeKafka

0 likes · 10 min read

Flume Tuning Guide for High‑Throughput Data Ingestion

Big Data Technology & Architecture

Nov 7, 2020 · Databases

Understanding Sparse Indexes in Databases: Kafka and ClickHouse Examples

This article explains the concept of sparse indexes in database systems, compares them with dense indexes, and demonstrates their use in Kafka log files and ClickHouse MergeTree tables, highlighting implementation details, lookup procedures, and configuration parameters.

Big DataClickHouseDatabases

0 likes · 8 min read

Understanding Sparse Indexes in Databases: Kafka and ClickHouse Examples

iQIYI Technical Product Team

Nov 6, 2020 · Big Data

HBase Overview and Step‑by‑Step Installation Guide

This article introduces HBase’s column‑oriented architecture, explains the roles of Master, RegionServer, and Zookeeper, and provides detailed environment preparation and installation commands for setting up an HBase cluster on Hadoop.

Big DataClusterDatabase

0 likes · 8 min read

HBase Overview and Step‑by‑Step Installation Guide

StarRing Big Data Open Lab

Nov 5, 2020 · Cloud Native

How Transwarp Scheduler Tackles Mixed Workloads in Unified Cloud‑Native Infrastructure

This article reviews the challenges of scheduling heterogeneous workloads—micro‑services, big‑data, AI, and HPC—on a unified cloud‑native platform, compares existing schedulers like Mesos and YARN, examines Kubernetes ecosystem extensions such as Volcano and YuniKorn, and details the design and components of the Transwarp Scheduler built on Kubernetes Scheduling Framework v2.

AIBig DataScheduler

0 likes · 16 min read

How Transwarp Scheduler Tackles Mixed Workloads in Unified Cloud‑Native Infrastructure

dbaplus Community

Nov 3, 2020 · Big Data

How Ctrip Boosted Hotel Data Warehouse Performance 400% with ClickHouse

Ctrip’s hotel data team tackled a 3 TB daily data load by building a ClickHouse cluster on VMware, creating custom sync and execution tools, applying query optimizations, and handling merge and memory errors, ultimately achieving over 400% performance gains across multiple reporting themes.

Big DataClickHouseData Warehouse

0 likes · 7 min read

How Ctrip Boosted Hotel Data Warehouse Performance 400% with ClickHouse

JD Tech Talk

Nov 3, 2020 · Big Data

Efficient Detection of Suspected Infected Crowds Using Spatio‑Temporal Trajectory Analysis

The article presents a novel spatio‑temporal trajectory risk metric and an efficient query framework, including a space‑first index (SFT) and distributed storage, to identify high‑risk contacts of COVID‑19 patients from massive movement data, with real‑world deployment results in several Chinese cities.

Big DataCOVID-19indexing

0 likes · 12 min read

Efficient Detection of Suspected Infected Crowds Using Spatio‑Temporal Trajectory Analysis

AntTech

Nov 2, 2020 · Frontend Development

Opportunities and Challenges of Enterprise Data Visualization Applications

The talk outlines why enterprise data visualization is essential for extracting value from massive, multi‑dimensional data, describes design and development challenges, presents AntV's comprehensive frontend visualization solutions, and predicts future trends such as intelligent, democratized, and decision‑integrated visual analytics.

AntVBig DataData visualization

0 likes · 15 min read

Opportunities and Challenges of Enterprise Data Visualization Applications

Big Data Technology & Architecture

Nov 2, 2020 · Big Data

Log Collection and Processing Architecture with Flume and Kafka for Big Data Platforms

This article explains how to design a scalable log collection system for big‑data platforms by combining Flume for data ingestion, Kafka for buffering and high‑throughput transport, and downstream processing components, providing configuration examples and best‑practice recommendations.

Big DataFlumeKafka

0 likes · 9 min read

Log Collection and Processing Architecture with Flume and Kafka for Big Data Platforms

Liangxu Linux

Nov 2, 2020 · Big Data

Master Shell Tricks to Analyze Beijing Points‑Based Residency Data in Seconds

This article demonstrates how to use standard shell utilities such as grep, cut, sort, uniq, awk, and join to quickly extract insights—like top companies, common surnames, popular given names, age distribution, and hometown rankings—from a JSON dataset of Beijing points‑based residency applicants.

Big DataData AnalysisJSON

0 likes · 13 min read

Master Shell Tricks to Analyze Beijing Points‑Based Residency Data in Seconds

Big Data Technology & Architecture

Nov 1, 2020 · Big Data

Hive Performance Tuning: Parallel Execution, Strict Mode, JVM Reuse, and Speculative Execution

This article explains Hive performance tuning techniques, including enabling parallel execution, configuring strict mode to prevent risky queries, reusing JVMs to reduce overhead, and using speculative execution to mitigate slow tasks, with configuration examples and practical considerations.

Big DataHiveJVM Reuse

0 likes · 8 min read

Hive Performance Tuning: Parallel Execution, Strict Mode, JVM Reuse, and Speculative Execution

Top Architect

Oct 31, 2020 · Big Data

Building a Zhihu User Data Crawler and Large‑Scale Analysis with SpringBoot, SeimiCrawler, RabbitMQ, ElasticSearch, and Kibana

This article describes how to build a Java‑based crawler to collect millions of Zhihu user profiles, handle anti‑crawling measures with rotating user‑agents and a proxy pool, deduplicate data using a Bloom filter, import the results into ElasticSearch, and analyze the dataset with Kibana and ECharts visualizations.

Big DataElasticsearchJava

0 likes · 15 min read

Building a Zhihu User Data Crawler and Large‑Scale Analysis with SpringBoot, SeimiCrawler, RabbitMQ, ElasticSearch, and Kibana

Big Data Technology & Architecture

Oct 31, 2020 · Big Data

Hive Performance Tuning: Understanding Map and Reduce Counts

This article explains how Hive determines the number of map and reduce tasks based on input file size and block configuration, discusses when to increase or decrease map counts, and provides practical commands for adjusting reducer settings to optimize large‑scale data processing.

Big DataHiveMapReduce

0 likes · 6 min read

Hive Performance Tuning: Understanding Map and Reduce Counts

Tencent Cloud Middleware

Oct 30, 2020 · Cloud Computing

How KonaJDK Powers Tencent Cloud Java, Big Data, and Secure Computing

This article explains how Tencent's self‑developed KonaJDK underpins cloud Java services, enhances micro‑service monitoring, adds national cryptography support, optimizes large‑heap tools like jmap, and delivers performance gains for big‑data workloads, while contributing key features back to the OpenJDK community.

Big DataCloud ComputingJVM

0 likes · 11 min read

How KonaJDK Powers Tencent Cloud Java, Big Data, and Secure Computing

ITPUB

Oct 30, 2020 · Fundamentals

Why Java Remains the Dominant Programming Language Across Industries

The article outlines Java’s history, its widespread adoption by top companies, key features such as simplicity, portability and security, and its extensive use in big‑data frameworks, IoT, Android, finance, web development, scientific tools, and cloud services, arguing why it will stay popular.

Big DataIoTJava

0 likes · 11 min read

Why Java Remains the Dominant Programming Language Across Industries

21CTO

Oct 30, 2020 · Big Data

Which Log Collection System Wins? Scribe, Chukwa, Kafka, Flume & ELK Compared

This article reviews the background, requirements, and architectural designs of major open‑source log collection systems—including Facebook’s Scribe, Apache’s Chukwa, LinkedIn’s Kafka, Cloudera’s Flume—and evaluates mature monitoring tools such as ELK, highlighting their features, use cases, advantages, and drawbacks for large‑scale log processing.

Big DataELKFlume

0 likes · 18 min read

Which Log Collection System Wins? Scribe, Chukwa, Kafka, Flume & ELK Compared

Zhongtong Tech

Oct 30, 2020 · Big Data

How Apache Kylin Supercharged OLAP at ZTO Express: A Deep Dive

This article details ZTO Express's journey of adopting Apache Kylin for OLAP, comparing it with Presto, describing platform architecture, performance gains, integration with scheduling and monitoring systems, and the practical optimizations and future plans that enabled sub‑second query responses on massive daily data volumes.

Apache KylinBig DataHBase

0 likes · 16 min read

How Apache Kylin Supercharged OLAP at ZTO Express: A Deep Dive

Alibaba Cloud Developer

Oct 29, 2020 · Frontend Development

How Big Data and AI Are Redefining Front‑End Development

From the early days of static web pages to today's data‑driven, AI‑enhanced interfaces, this article explores how the big‑data boom and artificial‑intelligence advances since 2010 have transformed front‑end technologies, driving innovations in data visualization, web‑based software, and diverse user interactions.

AIBig DataData visualization

0 likes · 11 min read

How Big Data and AI Are Redefining Front‑End Development

DataFunTalk

Oct 25, 2020 · Big Data

Bilibili's Saber Real-Time Computing Platform: Architecture, Challenges, and AI Integration

Zheng Zhisheng from Bilibili presents the Saber real-time computing platform, detailing its pain points, evolution, Apache Flink‑based architecture, SQL‑centric BSQL programming, DAG drag‑and‑drop design, AI use cases, and future development plans to improve scalability, operability, and AI integration.

AI IntegrationApache FlinkBSQL

0 likes · 19 min read

Bilibili's Saber Real-Time Computing Platform: Architecture, Challenges, and AI Integration

Architect's Tech Stack

Oct 23, 2020 · Big Data

Sorting a 4.6 GB File with 500 Million Integers: Internal, Bitmap, and External Sorting Techniques

The article explains how to sort a massive 4.6 GB file containing 500 million random integers by first attempting in‑memory quicksort and merge sort, then using a bitmap approach, and finally applying an external sort that splits the data into manageable chunks and merges them efficiently.

Big DataJavaSorting

0 likes · 8 min read

Sorting a 4.6 GB File with 500 Million Integers: Internal, Bitmap, and External Sorting Techniques

Big Data Technology & Architecture

Oct 21, 2020 · Big Data

An Introduction to Apache Hudi: Concepts, Design Principles, and Architecture

This article introduces Apache Hudi, explaining its core concepts, design principles, table architecture, write and compaction mechanisms, and the three query modes that enable efficient batch and incremental processing on modern data lakes.

Apache HudiBig DataData Lake

0 likes · 21 min read

An Introduction to Apache Hudi: Concepts, Design Principles, and Architecture

Big Data Technology & Architecture

Oct 19, 2020 · Big Data

Delta Lake: ACID Transactions, Schema Management, and Unified Batch‑Streaming for Data Lakes

Delta Lake adds ACID transaction support, schema enforcement, data versioning, and unified batch‑and‑stream processing to Apache Spark‑based data lakes, addressing reliability, quality, performance, and update challenges of traditional data lake architectures.

ACID TransactionsApache SparkBig Data

0 likes · 13 min read

Delta Lake: ACID Transactions, Schema Management, and Unified Batch‑Streaming for Data Lakes

Tencent Cloud Developer

Oct 19, 2020 · Big Data

Improving Spark Write Performance for Massive Files on Object Storage with Tencent Cloud EMR

By parallelizing Spark’s driver‑side commit, trash, and move phases—previously single‑threaded operations that caused costly copy‑on‑rename when writing massive files to object storage—the Tencent Cloud EMR case achieved over a tenfold (1,100 %) speedup, making object storage a viable alternative to HDFS.

Big DataDistributed computingEMR

0 likes · 8 min read

Improving Spark Write Performance for Massive Files on Object Storage with Tencent Cloud EMR

ITPUB

Oct 16, 2020 · Big Data

How NetEase Cloud Music Built a Real‑Time Data Warehouse with Flink & Calcite

This article details NetEase Cloud Music's evolution of a real‑time data warehouse built on Flink 1.9 and Calcite, covering platform scale, architectural design, metadata management, SDK simplifications, monitoring improvements, and concrete use cases such as AB‑testing, live reporting, and feature serving.

Big DataCalciteFlink

0 likes · 8 min read

How NetEase Cloud Music Built a Real‑Time Data Warehouse with Flink & Calcite

Yuewen Technology

Oct 16, 2020 · Artificial Intelligence

How Intelligent Traffic Distribution Boosts New Book Exposure in Reading Apps

This article describes the design and implementation of an intelligent traffic distribution system for a reading platform, detailing its background, overall architecture, sub-modules such as the small‑traffic experiment platform, near‑line computation, retrieval strategies, pacing algorithms, and how it balances user personalization with content ecosystem growth.

AIBig DataReal-time Streaming

0 likes · 8 min read

How Intelligent Traffic Distribution Boosts New Book Exposure in Reading Apps

Architects' Tech Alliance

Oct 15, 2020 · Big Data

Why Data Lakes and Data Warehouses Are Merging: The Rise of the Lakehouse Era

This article traces the 20‑year evolution of big‑data technologies, compares data lakes and data warehouses, explains their complementary strengths, and presents Alibaba Cloud’s lakehouse solution that unifies storage and compute to deliver flexible, performant, and cost‑effective analytics for enterprises.

Big DataCloud ComputingData Lake

0 likes · 30 min read

Why Data Lakes and Data Warehouses Are Merging: The Rise of the Lakehouse Era

Big Data Technology & Architecture

Oct 15, 2020 · Big Data

Meituan's OLAP Requirements and Apache Kylin Deployment: Architecture, Challenges, and Comparative Analysis

This article describes Meituan's massive OLAP workloads, the specific challenges of data scale, complex schemas, and precise counting, explains how Apache Kylin was integrated using wide tables and bitmap deduplication, compares its performance and features with Presto, Druid and other engines, and outlines future improvements.

Apache KylinBig DataData Warehouse

0 likes · 19 min read

Meituan's OLAP Requirements and Apache Kylin Deployment: Architecture, Challenges, and Comparative Analysis

DataFunTalk

Oct 15, 2020 · Big Data

Real‑Time Computing for Online Education: Architecture, Data Platform and Automation at VIPKID

This article explains how VIPKID leverages real‑time streaming with Flink to build a unified data platform, automatically tag and process help requests during 1‑v‑1 live classes, and achieve significant reductions in manual monitoring while improving course quality and user experience.

Big DataFlinkReal-time Streaming

0 likes · 14 min read

Real‑Time Computing for Online Education: Architecture, Data Platform and Automation at VIPKID

Big Data Technology & Architecture

Oct 13, 2020 · Big Data

Understanding Stateful Functions: API, Runtime, and Stream Processing with Apache Flink

This article explains the open‑source Stateful Functions framework, its API and Flink‑based runtime, and how it simplifies building distributed stateful applications by combining serverless concepts with robust state management for event‑driven architectures.

Apache FlinkBig DataDistributed Systems

0 likes · 8 min read

Understanding Stateful Functions: API, Runtime, and Stream Processing with Apache Flink

DataFunTalk

Oct 12, 2020 · Big Data

Building a General Real‑Time Data Warehouse: Methods and Practices at Meituan Waimai

This article introduces Meituan Waimai's approach to constructing a universal real‑time data warehouse, covering streaming technology choices, Lambda/Kappa architectures, layered design, platformization, SLA management, and a practical Lambda‑style use case for real‑time analytics.

Big DataDoris OLAPFlink

0 likes · 16 min read

Building a General Real‑Time Data Warehouse: Methods and Practices at Meituan Waimai

Alibaba Cloud Developer

Oct 11, 2020 · Operations

How Alibaba’s SLS Powers a Unified Observability Platform for Massive Data

Alibaba Cloud’s Log Service (SLS) has evolved into a unified observability middle‑platform that handles tens of petabytes daily, offering integrated storage, processing, and AI‑driven analysis for logs, metrics, and traces, while addressing challenges of data ingestion, performance, and scalability across diverse Ops scenarios.

Big DataLog AnalyticsObservability

0 likes · 16 min read

How Alibaba’s SLS Powers a Unified Observability Platform for Massive Data

ITPUB

Oct 10, 2020 · Big Data

How Didi Scaled Presto for Petabyte‑Scale Queries: Architecture & Optimizations

Didi’s three‑year journey with Presto transformed it into the company’s primary ad‑hoc and Hive‑SQL acceleration engine, serving over 6 000 users, processing 2‑3 PB of HDFS data daily, and achieving major gains in stability, performance, cost, and usability through extensive architectural tweaks, resource isolation, connector extensions, and monitoring enhancements.

Big DataCluster ManagementDruid Connector

0 likes · 18 min read

How Didi Scaled Presto for Petabyte‑Scale Queries: Architecture & Optimizations

JD Tech Talk

Oct 10, 2020 · Big Data

Discovering Real-Time Reachable Areas Using Trajectory Connections

This article presents a novel method for real-time reachable area analysis that leverages recent trajectory data, introduces a Skip Graph Index for efficient query processing, predicts optimal trajectory‑splicing parameters with machine learning, and demonstrates its effectiveness through extensive experiments on multiple real‑world datasets.

Big Datak-value predictionreal-time reachable area

0 likes · 13 min read

Discovering Real-Time Reachable Areas Using Trajectory Connections

Didi Tech

Oct 9, 2020 · Big Data

Presto at Didi: Architecture, Optimizations, and Operational Experience

At Didi, Presto has been the default ad‑hoc and Hive‑SQL engine for over three years, serving 6,000 users, processing 2‑3 PB daily and 30‑35 trillion rows, with mixed and dedicated clusters, migration to PrestoSQL 340, extensive Hive compatibility, label‑based isolation, a native Druid connector, usability and stability enhancements, and JVM‑level performance optimizations, while planning further resource‑saving upgrades.

Big DataCluster ManagementDistributed SQL

0 likes · 17 min read

Presto at Didi: Architecture, Optimizations, and Operational Experience

Alibaba Terminal Technology

Oct 9, 2020 · Frontend Development

How Big Data and AI Are Redefining Front‑End Development

From the early days of static web pages to today’s data‑driven, AI‑enhanced interfaces, this article explores how the rise of big data platforms like Alibaba Cloud’s Feitian has transformed front‑end development through advanced visualization, software‑Web convergence, and diverse new interactions.

Big DataCloud ComputingData visualization

0 likes · 9 min read

DataFunTalk

Oct 7, 2020 · Big Data

Yanxuan Data Warehouse: Architecture, Standards, and Evaluation Framework

This article outlines the Yanxuan data warehouse’s layered architecture, the offline and real‑time development platforms, the comprehensive standards for metric definition, model design, and SQL development, and proposes a six‑dimensional evaluation system covering data norms, security, quality, stability, continuous improvement, and development efficiency.

Big DataData engineeringSQL Standards

0 likes · 12 min read

Yanxuan Data Warehouse: Architecture, Standards, and Evaluation Framework

DataFunTalk

Sep 30, 2020 · Big Data

Real-time Data Warehouse Construction for Didi Ride-hailing's Carpool Service

This article details Didi's end‑to‑end real‑time data warehouse design for the carpool business, covering its objectives, architecture layers from ODS to application, naming conventions, StreamSQL development, operational tooling, challenges faced, and future batch‑stream integration plans.

Big DataDidiFlink

0 likes · 20 min read

Real-time Data Warehouse Construction for Didi Ride-hailing's Carpool Service

IT Architects Alliance

Sep 29, 2020 · Big Data

How Qualitis Ensures High‑Availability Data Quality Monitoring on Big Data Platforms

Qualitis is a big‑data‑platform‑based data‑quality‑management service that defines, detects, and reports data‑set quality issues, featuring idempotent backend services, load‑balanced high‑availability, Zookeeper‑coordinated process synchronization, thread‑pool throttling, and clearly separated internal and external APIs.

Big DataData QualityQualitis

0 likes · 6 min read

How Qualitis Ensures High‑Availability Data Quality Monitoring on Big Data Platforms

Architects Research Society

Sep 29, 2020 · Big Data

Understanding DataOps: Principles, Benefits, and Implementation

DataOps, an Agile‑derived methodology that extends DevOps principles to data analytics, emphasizes automation, collaboration, and continuous delivery to accelerate and improve data processing, quality, and business insight, while outlining its benefits, relationship to Agile/DevOps, and practical steps for adoption.

Big DataContinuous AnalyticsDataOps

0 likes · 12 min read

Understanding DataOps: Principles, Benefits, and Implementation

Tencent Advertising Technology

Sep 29, 2020 · Artificial Intelligence

The Power of Data and AI: Highlights from the 2020 Tencent Advertising Algorithm Live Week

The 2020 Tencent Advertising Algorithm Live Week presented expert insights on federated learning, machine learning, big data, and deep‑learning applications in advertising, offering a comprehensive Q&A that explains how massive data fuels AI breakthroughs and reshapes business problem solving.

Big Datamachine learning

0 likes · 11 min read

The Power of Data and AI: Highlights from the 2020 Tencent Advertising Algorithm Live Week

High Availability Architecture

Sep 29, 2020 · Artificial Intelligence

Architecture Design Overview of Recommendation Systems

This article reviews the core algorithm modules of recommendation systems from an architectural perspective, discussing offline, near‑line, and online layers, the trade‑offs between personalization, timeliness, and resource consumption, system boundaries, external dependencies, and the practical design of each layer.

AIBig Dataarchitecture

0 likes · 30 min read

Architecture Design Overview of Recommendation Systems

Big Data Technology & Architecture

Sep 29, 2020 · Big Data

Implementing Real-Time TopN Rankings with Apache Flink

This article demonstrates how to develop a real-time TopN ranking feature in Apache Flink, covering stream setup, word count aggregation, global and grouped TopN calculations, and nested TopN strategies to mitigate hotspot issues, complete with Java code examples.

Big DataFlinkJava

0 likes · 8 min read

Implementing Real-Time TopN Rankings with Apache Flink

DataFunTalk

Sep 25, 2020 · Big Data

Meituan Waimai Data Warehouse: Architecture Evolution, Governance, and Future Roadmap

The article details Meituan Waimai's offline data warehouse evolution from its initial V1.0 design through V2.0 improvements to the V3.0 modeling‑tool driven architecture, covering the four‑layer framework, Spark‑based ETL, data governance processes, resource optimization, security measures, and future development plans.

Big DataETLMeituan

0 likes · 22 min read

Meituan Waimai Data Warehouse: Architecture Evolution, Governance, and Future Roadmap

Big Data Technology & Architecture

Sep 24, 2020 · Big Data

HiveSQL Classic Optimization Cases: Partitioning, Subset Decomposition, and Percentile Approximation Improvements

This article presents three HiveSQL optimization case studies—restructuring a large‑scale query with partitioned tables, breaking a complex window‑function query into smaller subsets with joins, and refactoring excessive PERCENTILE_APPROX calls—demonstrating how each change reduces execution time from hours to minutes and improves overall performance.

Big DataHiveHiveSQL

0 likes · 9 min read

HiveSQL Classic Optimization Cases: Partitioning, Subset Decomposition, and Percentile Approximation Improvements

Suning Technology

Sep 24, 2020 · Artificial Intelligence

How AI and Big Data Are Revolutionizing Carrefour’s In‑Store and Home Delivery Services

During the Mid‑Autumn holiday, Carrefour’s Nanjing Bridge store showcased how AI‑driven big‑data analytics, self‑checkout innovations, and smart supply‑chain upgrades are reshaping retail operations, from receipt printing to pre‑packaged produce and rapid home‑delivery services.

AIBig DataDigital Transformation

0 likes · 7 min read

How AI and Big Data Are Revolutionizing Carrefour’s In‑Store and Home Delivery Services

Java Architect Essentials

Sep 23, 2020 · Big Data

Evolution of JD.com Order Center Elasticsearch Cluster Architecture

The article details how JD.com's order center migrated its massive order query workload from MySQL to Elasticsearch, iteratively improving cluster isolation, node deployment, replica tuning, master‑slave redundancy, version upgrades, and data synchronization while addressing performance pitfalls such as deep pagination and FieldData usage.

Big DataCluster ArchitectureElasticsearch

0 likes · 12 min read

Evolution of JD.com Order Center Elasticsearch Cluster Architecture

JD Tech Talk

Sep 23, 2020 · Artificial Intelligence

Delivery Time Inference Based on Couriers' Trajectories

Leveraging large-scale courier trajectory data and spatiotemporal analytics, the DTInf framework infers parcel delivery times by detecting stay points, correcting delivery locations, and matching delivery events using a trained MLP model, achieving a mean absolute error of 401 seconds and outperforming baselines by over 30%.

Big DataLogisticscourier trajectories

0 likes · 10 min read

Delivery Time Inference Based on Couriers' Trajectories

Tencent Cloud Developer

Sep 22, 2020 · Big Data

Evolution and Architecture of Beike's OLAP Platform: From Hive/MySQL to Multi‑Engine Flexibility

Beike’s OLAP platform has progressed from a basic Hive‑MySQL batch pipeline to a Kylin‑based single‑engine solution, and now to a flexible multi‑engine architecture that uses a query‑engine layer to route metrics across Kylin, Druid, ClickHouse and Doris, dramatically cutting cube‑build times, supporting real‑time ingestion, and paving the way for further engine consolidation and automated performance routing.

Apache DruidApache KylinBeike

0 likes · 17 min read

Evolution and Architecture of Beike's OLAP Platform: From Hive/MySQL to Multi‑Engine Flexibility

Big Data Technology & Architecture

Sep 19, 2020 · Big Data

Understanding Kafka Consumer Group Rebalance and Timeout Mechanisms

This article explains how Kafka consumer groups assign partitions, the four situations that trigger a rebalance, the impact of consumer poll timeouts, and practical ways to tune max.poll.interval.ms and max.poll.records to avoid rebalance‑related errors.

Big DataKafkaTimeout

0 likes · 12 min read

Understanding Kafka Consumer Group Rebalance and Timeout Mechanisms

Big Data Technology & Architecture

Sep 18, 2020 · Big Data

Understanding the Elasticsearch Master Election Process

This article explains when Elasticsearch triggers a master election, describes each election stage—including active master and candidate selection, Bully algorithm comparison, and master node responsibilities—while providing code excerpts that illustrate the underlying implementation details.

Big DataCluster ManagementDistributed Systems

0 likes · 8 min read

Understanding the Elasticsearch Master Election Process

Big Data Technology & Architecture

Sep 18, 2020 · Big Data

Understanding Kafka Consumer Groups, Partition Assignment, and Offset Management

This article explains how Kafka consumer groups accelerate message consumption by distributing partitions across multiple consumers, details the three key characteristics of consumer groups, and provides in‑depth guidance on partition assignment strategies and offset management with practical Java code examples.

Big DataKafkaOffset Management

0 likes · 13 min read

Understanding Kafka Consumer Groups, Partition Assignment, and Offset Management

Suning Technology

Sep 18, 2020 · Operations

How Suning’s Tech‑Driven Strategy Revitalized Carrefour China’s Retail Operations

This article details Suning’s comprehensive digital transformation of Carrefour China, outlining three implementation phases, the integration of AI, big data and cloud technologies, and the resulting operational efficiencies, sales growth, and enhanced omnichannel retail experience.

AIBig DataCarrefour

0 likes · 9 min read

How Suning’s Tech‑Driven Strategy Revitalized Carrefour China’s Retail Operations

Youku Technology

Sep 18, 2020 · Big Data

Digitalization of Youku Long‑Video Content Supply Chain: Practices and Architecture

Youku’s digital content‑supply‑chain system transforms long‑video production by introducing a three‑stage framework—structured evaluation of talent and scripts, information‑driven production management, and a unified demand‑aligned content strategy—that curtails delays, mitigates risk, and saves over 100 million RMB while scaling to billions of data records daily.

Artificial IntelligenceBig DataContent Supply Chain

0 likes · 11 min read

Digitalization of Youku Long‑Video Content Supply Chain: Practices and Architecture

Big Data Technology & Architecture

Sep 17, 2020 · Big Data

Monitoring Kafka Consumer Groups with kafka-consumer-groups and Kafka Manager

This article explains how to monitor Kafka consumer groups using the built‑in kafka‑consumer‑groups tool and the Kafka Manager UI, providing commands, field explanations, and setup steps to ensure real‑time data availability for downstream services such as MongoDB or Elasticsearch.

Big DataKafkaKafka Manager

0 likes · 4 min read

Monitoring Kafka Consumer Groups with kafka-consumer-groups and Kafka Manager

Full-Stack Internet Architecture

Sep 17, 2020 · Big Data

How Big Data Is Used for Price Discrimination and the New Regulations to Stop It

The article explains how big‑data algorithms enable online price discrimination—often called “kill‑familiar” pricing—illustrates real‑world e‑commerce examples, outlines the recently enacted Chinese online tourism regulation prohibiting such practices, and discusses broader data‑privacy and security concerns.

Big DataData Privacyconsumer rights

0 likes · 6 min read

How Big Data Is Used for Price Discrimination and the New Regulations to Stop It

Programmer DD

Sep 17, 2020 · Big Data

5 Open‑Source Quant Trading Tools Every Developer Should Explore

Discover five open‑source stock‑trading utilities—funds, ZVT, QUANTAXIS, StockAnalysisSystem, and match‑trade—each offering real‑time data, backtesting, multi‑asset support, and high‑performance matching to help programmers build powerful quantitative finance applications.

Big DataPythonQuantitative Trading

0 likes · 5 min read

5 Open‑Source Quant Trading Tools Every Developer Should Explore

DataFunTalk

Sep 17, 2020 · Big Data

Design and Implementation of a Scalable User Tag Production Platform

The article explains how a flexible, high‑performance user‑tagging system is built on a batch‑stream integrated architecture using big‑data technologies such as Impala, HDFS, and Flink to support both offline and real‑time label generation for precise marketing, product improvement, and operational analytics.

Big DataFlinkImpala

0 likes · 15 min read

Design and Implementation of a Scalable User Tag Production Platform

Java Architect Essentials

Sep 16, 2020 · Big Data

Elasticsearch Adoption and Architecture Cases in Major Chinese Companies

This article reviews how major Chinese companies such as JD.com, Ctrip, Quark, 58.com, and Didi have adopted Elasticsearch for large‑scale search, log analysis, and real‑time analytics, detailing their cluster architectures, scaling strategies, and operational practices.

Big DataElasticsearchReal-time Analytics

0 likes · 11 min read

Elasticsearch Adoption and Architecture Cases in Major Chinese Companies

Big Data Technology & Architecture

Sep 16, 2020 · Big Data

Understanding Flink CEP's NFAb Automaton for Complex Event Processing

This article explains how Flink's Complex Event Processing (CEP) library implements pattern matching using a nondeterministic finite automaton with matching caches (NFAb), covering its theoretical foundation, construction, state transition semantics, event selection strategies, shared versioned match buffers, and computation state details.

Big DataCEPFlink

0 likes · 9 min read

Understanding Flink CEP's NFAb Automaton for Complex Event Processing

Big Data Technology & Architecture

Sep 16, 2020 · Databases

Optimizing a Complex MySQL Slow Query for Article Comments

This article analyzes a 60‑second MySQL query that retrieves article comments with multiple filters, explains why the optimizer chooses a small table as the driver, and presents a step‑by‑step optimization—including avoiding semi‑joins, improving index usage, refining range conditions, and moving GROUP BY into a subquery—that reduces execution time to 1.3 seconds, achieving a 60‑fold speedup.

Big DataDatabaseMySQL

0 likes · 13 min read

Optimizing a Complex MySQL Slow Query for Article Comments

Architects Research Society

Sep 15, 2020 · Big Data

Key Factors to Consider When Building Your Own Data Warehouse

This article examines essential considerations for selecting and designing a data warehouse—including data volume, scalability, on‑premises versus cloud options, pricing models, and ETL/ELT approaches—to help organizations choose the most suitable solution for their needs.

Big DataData WarehouseScalability

0 likes · 9 min read

Key Factors to Consider When Building Your Own Data Warehouse

Huawei Cloud Developer Alliance

Sep 15, 2020 · Big Data

Mastering ETL: 8 Essential Algorithms for Modern Data Warehouses

This article explains why ETL is a critical step in building data warehouses, introduces eight core ETL algorithms—including full delete/insert, upsert, append, and various link‑table models—describes their ideal use cases, and provides ready‑to‑run SQL code examples for each.

AlgorithmsBig DataData Warehouse

0 likes · 12 min read

Mastering ETL: 8 Essential Algorithms for Modern Data Warehouses

Alibaba Cloud Developer

Sep 15, 2020 · Big Data

How JindoFS Accelerates Data Lakes: Deep Dive into Storage‑Compute Optimization

This article explains why data lake acceleration is essential, outlines the three key architectural decisions for big‑data architects, and details Alibaba Cloud's JindoFS solutions—including basic adaptation, cache acceleration, and deep‑customization modes—to boost performance and reliability of lake storage and compute.

Big DataJindoFSOSS

0 likes · 18 min read

How JindoFS Accelerates Data Lakes: Deep Dive into Storage‑Compute Optimization

Big Data Technology & Architecture

Sep 13, 2020 · Big Data

ClickHouse Deployment, Management, and Monitoring Practices in Production

This article explains ClickHouse's strengths as a high‑performance MPP database, details hardware selection, read/write separation, shard expansion steps, batch‑size tuning, and presents a three‑layer monitoring model, while also describing its practical application in Tencent's game analytics platform.

Big DataClickHouseData Warehouse

0 likes · 19 min read

ClickHouse Deployment, Management, and Monitoring Practices in Production

Selected Java Interview Questions

Sep 12, 2020 · Databases

Understanding Elasticsearch Internals: Architecture, Lucene Indexing, Sharding, and Scaling

This article explains the internal workings of Elasticsearch, covering its cloud‑based cluster architecture, Lucene‑based indexing structures such as segments, shards, inverted indexes, stored fields and doc values, as well as search processing, caching, merging, routing, and scaling strategies.

Big DataElasticsearchSharding

0 likes · 13 min read

Understanding Elasticsearch Internals: Architecture, Lucene Indexing, Sharding, and Scaling

Big Data Technology & Architecture

Sep 11, 2020 · Big Data

Evolution of JD.com Order Center Elasticsearch Cluster Architecture

This article details how JD.com's order center migrated its Elasticsearch cluster from a simple, default‑configured setup to a highly available, multi‑replica, dual‑cluster architecture with version upgrades, data synchronization strategies, and performance optimizations to support billions of documents and hundreds of millions of daily queries.

Big DataCluster ArchitectureElasticsearch

0 likes · 12 min read

Ctrip Technology

Sep 10, 2020 · Big Data

Design and Implementation of a Unified Log Framework for Ctrip Payment Center

The article describes the design, architecture, and operational details of a unified logging framework at Ctrip's payment center, covering log production via a Log4j2 extension, Kafka‑Camus collection, Hive/ORC storage, MapReduce parsing optimizations, and governance strategies for massive daily TB‑scale data.

Big DataCamusHadoop

0 likes · 15 min read

Design and Implementation of a Unified Log Framework for Ctrip Payment Center

DataFunTalk

Sep 10, 2020 · Databases

Graph‑Based Real‑Time Content Update Architecture at Youku: Challenges, Design, and Practice

This technical presentation explains how Youku tackles the massive, real‑time update problem of video‑content graphs by adopting a graph‑database architecture, sub‑graph partitioning, schema‑driven logical views, and Flink‑based pipelines to achieve second‑level updates for billions of entities and attributes.

Big DataFlinkGraph Database

0 likes · 15 min read

Graph‑Based Real‑Time Content Update Architecture at Youku: Challenges, Design, and Practice