Tagged articles

3697 articles

Page 25 of 37

Sep 9, 2020 · Databases

How to Speed Up Massive MySQL User‑Log Tables: Partitioning, Indexing, and Migration Strategies

This article examines performance problems with a 20‑million‑row MySQL user‑log table on Alibaba Cloud RDS, outlines three solution paths—optimizing the existing database, migrating to a MySQL‑compatible high‑performance service, and adopting a big‑data engine—and provides detailed guidance on schema design, indexing, partitioning, and practical SQL tweaks.

Big DataDatabase OptimizationMySQL

0 likes · 17 min read

How to Speed Up Massive MySQL User‑Log Tables: Partitioning, Indexing, and Migration Strategies

DataFunTalk

Sep 9, 2020 · Big Data

NetEase Big Data User Profiling: Architecture, Tagging System, and Real‑World Applications

This presentation details NetEase's massive multi‑domain data ecosystem, the design of its user‑profile center—including basic, behavior, preference, and predictive tags—ID‑mapping techniques, quality assurance processes, and several real‑time and offline use cases such as marketing, recommendation, growth operations, advertising, and fraud detection.

Big DataID-MappingTag Management

0 likes · 13 min read

NetEase Big Data User Profiling: Architecture, Tagging System, and Real‑World Applications

dbaplus Community

Sep 8, 2020 · Databases

Achieving Billion‑Row Second‑Level Queries with ClickHouse Real‑Time Engine

JD’s Algorithmic Intelligence team built a ClickHouse‑based real‑time analytics engine that ingests Kafka and offline data, uses MergeTree tables with strategic partitioning and sorting, and employs batch writes, materialized views, and monitoring to achieve second‑level queries over billions of rows.

Big DataClickHouseMergeTree

0 likes · 17 min read

Achieving Billion‑Row Second‑Level Queries with ClickHouse Real‑Time Engine

Alibaba Cloud Developer

Sep 7, 2020 · Big Data

How Alibaba’s ADC Project Automates Real‑Time SQL Generation with Design Patterns and Priority Queues

This article explains how the Alibaba DChain Data Converger (ADC) automatically creates wide‑table SQL for real‑time cross‑database analytics by using a pipeline architecture, priority‑queue‑driven task scheduling, and specific design patterns to handle metadata, joins, and resource management.

Big DataReal-time DataSQL Generation

0 likes · 13 min read

How Alibaba’s ADC Project Automates Real‑Time SQL Generation with Design Patterns and Priority Queues

DataFunTalk

Sep 7, 2020 · Big Data

Real‑time Data Warehouse Architecture and Best Practices in Alibaba Search Recommendation

This article presents Alibaba's search‑recommendation real‑time data warehouse, describing its business background, typical use cases, key requirements, the evolution from architecture 1.0 to 2.0 with Flink and Hologres, best‑practice patterns such as row/column storage, stream‑batch integration, high‑concurrency updates, and future directions like real‑time joins and persistent dimension storage.

Big DataFlinkHologres

0 likes · 13 min read

Real‑time Data Warehouse Architecture and Best Practices in Alibaba Search Recommendation

Architecture Digest

Sep 3, 2020 · Databases

Practical Elasticsearch Performance and Stability Tuning Guide

This article consolidates practical Elasticsearch tuning techniques—including configuration file adjustments, system‑level optimizations, and usage‑level settings—to improve cluster performance, stability, and resource efficiency for production environments.

Big DataCluster ConfigurationElasticsearch

0 likes · 15 min read

Practical Elasticsearch Performance and Stability Tuning Guide

Big Data Technology & Architecture

Sep 2, 2020 · Big Data

An Overview of Apache Hudi: Architecture, Features, and Query Types

Apache Hudi is an open‑source data‑lake framework that leverages Spark to ingest, manage, and incrementally query large analytical datasets on HDFS‑compatible storage, offering features such as timeline management, copy‑on‑write and merge‑on‑read tables, and support for snapshot, incremental, and read‑optimized queries across engines like Hive, Spark SQL and Presto.

Apache HudiBig DataData Lake

0 likes · 12 min read

An Overview of Apache Hudi: Architecture, Features, and Query Types

Big Data Technology & Architecture

Sep 1, 2020 · Big Data

Configuring Hadoop to Support LZO Compression

This guide explains how to enable LZO compression in Hadoop by installing the twitter‑provided hadoop‑lzo library, updating core‑site.xml, synchronizing files across nodes, creating LZO indexes, and running a WordCount MapReduce job with LZO‑compressed output.

Big DataHadoopLZO

0 likes · 6 min read

Configuring Hadoop to Support LZO Compression

DataFunTalk

Sep 1, 2020 · Big Data

NetEase Real-Time Computing Platform (Sloth): Architecture, Practices, and Future Outlook

This article introduces NetEase's real-time computing platform Sloth, detailing its architecture, component layers, integrated IDE, operational tooling, unified metadata management, challenges such as Kudu write amplification, and proposes a tiered real‑time data‑warehouse model with a vision for storage‑compute separation and unified batch‑stream APIs.

Big DataFlinkKafka

0 likes · 13 min read

NetEase Real-Time Computing Platform (Sloth): Architecture, Practices, and Future Outlook

Xianyu Technology

Sep 1, 2020 · Artificial Intelligence

Interest-Based Live Stream Recommendation System for Xianyu

Within three weeks, the team built an interest‑based live‑stream recommendation platform for Xianyu that combined operational insights, BI analysis, and offline algorithms to generate user‑anchor interest tags, sync them to an online graph, and dramatically boost top‑room UV and click‑through rates.

Big DataGraph Databaseinterest tagging

0 likes · 8 min read

Interest-Based Live Stream Recommendation System for Xianyu

Laravel Tech Community

Aug 31, 2020 · Big Data

Evolution of JD Daojia Order System Elasticsearch Cluster Architecture

This article details the step‑by‑step evolution of the JD Daojia order‑center Elasticsearch cluster—from an initial loosely configured deployment to a real‑time dual‑cluster architecture with replica tuning, master‑slave adjustments, data‑sync strategies, and lessons learned about pagination, fielddata, and doc values—highlighting how each phase improved query throughput, stability, and scalability for billions of documents.

Big DataCluster ArchitectureElasticsearch

0 likes · 12 min read

Evolution of JD Daojia Order System Elasticsearch Cluster Architecture

Big Data Technology & Architecture

Aug 31, 2020 · Big Data

Integration Methods of Hive and Spark SQL (Potential Interview Topics)

This article provides a comprehensive guide on integrating Hive with Spark SQL, covering Hive‑on‑Spark and Spark‑on‑Hive setups, spark‑shell and spark‑sql usage, HiveServer2 with Beeline, Scala scripts for reading and writing Hive tables, and partition handling for aggregated results.

Big DataData IntegrationHive

0 likes · 7 min read

Integration Methods of Hive and Spark SQL (Potential Interview Topics)

Big Data Technology & Architecture

Aug 30, 2020 · Big Data

Kylin Cube Construction Principles and Optimization Techniques

This article explains the fundamentals of Kylin Cube construction—including dimensions, measures, Cuboid generation, layer-by-layer and in‑memory building algorithms, storage mechanisms, and various optimization strategies such as derived dimensions, aggregation groups, row‑key design, and concurrency granularity—providing a comprehensive guide for big‑data OLAP practitioners.

Big DataCubeKylin

0 likes · 14 min read

Kylin Cube Construction Principles and Optimization Techniques

DataFunTalk

Aug 30, 2020 · Big Data

Large-Scale Recommendation System Feature Engineering and Optimization with Spark and FESQL

This article explains how large-scale recommendation systems rely on efficient feature engineering, describes the three-layer architecture (offline, stream, online), and details how Spark SQL and the LLVM‑optimized FESQL engine improve performance and ensure offline‑online feature consistency.

Big DataFESQLFeature Engineering

0 likes · 13 min read

Large-Scale Recommendation System Feature Engineering and Optimization with Spark and FESQL

Big Data Technology & Architecture

Aug 27, 2020 · Big Data

HBase Architecture, Components, and Operations Overview

This article provides a comprehensive overview of Apache HBase’s architecture, detailing its core components such as RegionServer, HMaster, ZooKeeper, WAL, MemStore, and HFiles, and explains key processes including read/write paths, compaction, region splitting, load balancing, and recovery mechanisms.

Big DataDatabase ArchitectureDistributed Systems

0 likes · 17 min read

HBase Architecture, Components, and Operations Overview

Tencent Cloud Developer

Aug 27, 2020 · Big Data

Elasticsearch Overview: Architecture, Lucene Foundations, Application Scenarios, and Optimizations

Elasticsearch, built on Apache Lucene, provides a distributed, near‑real‑time search platform that scales to billions of documents across thousands of nodes, supporting use cases such as log analytics, time‑series monitoring, and product search, while Tencent’s CES adds advanced availability, performance, and cost‑optimizing features.

Big DataElasticsearchPerformance Optimization

0 likes · 17 min read

Elasticsearch Overview: Architecture, Lucene Foundations, Application Scenarios, and Optimizations

Big Data Technology & Architecture

Aug 26, 2020 · Big Data

Advanced ClickHouse Path Analysis, Funnel, Retention, and Session Statistics

This article demonstrates how to leverage ClickHouse’s parametric aggregate and higher‑order functions to perform path matching, intelligent path detection, ordered funnel conversion, retention calculation, and session statistics for user behavior analysis in a big‑data environment.

AnalyticsBig DataClickHouse

0 likes · 11 min read

Advanced ClickHouse Path Analysis, Funnel, Retention, and Session Statistics

Big Data Technology & Architecture

Aug 25, 2020 · Big Data

Understanding Kafka's Segment Storage and Index Design

This article explains how Kafka partitions data into segments, stores each segment as paired index and log files, and uses sparse indexing to enable efficient queries, illustrating the process with examples and diagrams of segment layout and offset lookup.

Big DataKafkaSegment

0 likes · 4 min read

Understanding Kafka's Segment Storage and Index Design

Efficient Ops

Aug 24, 2020 · Operations

How to Scale Elasticsearch for PB‑Level Game Logs: Real‑World Strategies & Lessons

This article walks through a mid‑size gaming company's journey of deploying, tuning, and scaling an Elasticsearch cluster for massive log volumes, covering hot‑cold node architecture, ILM policies, shard management, Logstash‑Kafka optimization, emergency expansions, and the promise of searchable snapshots to achieve petabyte‑scale storage with cost efficiency.

Big DataElasticsearchILM

0 likes · 28 min read

How to Scale Elasticsearch for PB‑Level Game Logs: Real‑World Strategies & Lessons

Didi Tech

Aug 24, 2020 · Big Data

Evolution and Architecture of DiDi Data Channel Service

DiDi’s Data Channel Service evolved from a fragmented component system into a unified, SLA‑driven platform with a UI‑based Sync Center and Flink‑powered StreamSQL engine, dramatically improving task creation speed, resource utilization, and reliability while automating issue diagnosis for company‑wide real‑time and offline data synchronization.

Big DataETLFlink

0 likes · 12 min read

Evolution and Architecture of DiDi Data Channel Service

58 Tech

Aug 24, 2020 · Big Data

Design and Practice of an Online Real-Time Feature System for Intelligent Risk Control

This article presents the concepts, architecture, and practical techniques of an online real‑time feature system used in intelligent risk‑control, covering feature definition, time‑window types, calculation functions, distributed processing, low‑latency storage, and operational challenges in high‑concurrency environments.

Big DataFeature EngineeringReal-time Processing

0 likes · 16 min read

Design and Practice of an Online Real-Time Feature System for Intelligent Risk Control

Big Data Technology & Architecture

Aug 23, 2020 · Big Data

Integrating Flink 1.11 with Hive Streaming, Kafka, and Table API

This article demonstrates how to use Flink 1.11's enhanced Hive integration to stream data from a Kafka source, write it into partitioned Hive tables with checkpoint‑driven commits, and read Hive tables as a continuous source using dynamic table options and table hints.

Big DataFlinkHive

0 likes · 13 min read

Integrating Flink 1.11 with Hive Streaming, Kafka, and Table API

Big Data Technology & Architecture

Aug 23, 2020 · Big Data

Apache Hudi Overview, Core Concepts, and Quick‑Start Guide

This article introduces Apache Hudi, explaining its storage types, query views, timeline feature, typical use cases such as near‑real‑time ingestion and incremental pipelines, and provides a step‑by‑step Scala/Spark quick‑start guide with code examples for compiling, inserting, updating, querying, and syncing data to Hive.

Apache HudiBig DataData Lake

0 likes · 18 min read

Apache Hudi Overview, Core Concepts, and Quick‑Start Guide

Big Data Technology & Architecture

Aug 22, 2020 · Big Data

Integrating Kerberos with Spark on CDH: Configuration, Deployment, and Troubleshooting Guide

This guide explains how to prepare a CDH‑based Spark environment for Kerberos authentication, covering prerequisite knowledge, classpath adjustments, HBase configuration files, Spark‑Env settings, user permission grants, Spark‑Submit execution, and common troubleshooting steps.

Big DataCDHHBase

0 likes · 12 min read

Integrating Kerberos with Spark on CDH: Configuration, Deployment, and Troubleshooting Guide

Java Architect Essentials

Aug 21, 2020 · Big Data

Design and Integration of Flume, Kafka, Storm, Drools, and Redis for Real‑Time ETL Log Analysis

This article presents a modular architecture for real‑time ETL log analysis that combines Flume for log collection, Kafka as a buffering layer, Storm for stream processing, Drools for rule‑based data transformation, and Redis for fast storage, detailing installation, configuration, and code integration steps.

Big DataDroolsFlume

0 likes · 23 min read

Design and Integration of Flume, Kafka, Storm, Drools, and Redis for Real‑Time ETL Log Analysis

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Practical Guide to Building an Advertising Project with Spark and Kudu

This article provides a step‑by‑step tutorial on deploying a Spark‑based advertising data pipeline using Kudu, covering Hadoop setup, HDFS data loading, Spark application refactoring, Maven packaging, Yarn execution, and crontab scheduling for daily automated runs.

Big DataHadoopKudu

0 likes · 11 min read

Practical Guide to Building an Advertising Project with Spark and Kudu

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Spark + Kudu Advertising Business Project: Data Statistics and Processing Guide

This article demonstrates how to implement an advertising business data statistics pipeline using Spark and Kudu, detailing metric requirements, Scala processing code, complex SQL aggregations, schema design, and data sinking for verification.

Big DataData ProcessingKudu

0 likes · 7 min read

Spark + Kudu Advertising Business Project: Data Statistics and Processing Guide

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Spark + Kudu Advertising Business Project: Step-by-Step Implementation

This article walks through the complete implementation of an advertising statistics pipeline using Spark and Kudu, covering requirement analysis, Scala code development, SQL queries, schema definition, and data sinking, with full code snippets and execution results.

Big DataData ProcessingKudu

0 likes · 7 min read

Spark + Kudu Advertising Business Project: Step-by-Step Implementation

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Spark + Kudu Advertising Project: Refactoring, Scala Traits, ETL Processor, and Project Entry

This article walks through a Spark and Kudu advertising project, explaining the refactoring approach, Scala trait usage, implementation of ETL and province‑city statistics processors, and shows the complete Spark application entry point with full code examples.

Big DataData ProcessingETL

0 likes · 7 min read

Spark + Kudu Advertising Project: Refactoring, Scala Traits, ETL Processor, and Project Entry

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Spark + Kudu Advertising Project: Province‑City Statistics and Data Persistence

This tutorial walks through a Spark‑Kudu advertising project that computes province‑city distribution statistics using SQL, defines the necessary schema, and demonstrates how to write the aggregated results back to a Kudu table for persistent storage, complete with Scala code examples.

Big DataData engineeringKudu

0 likes · 4 min read

Spark + Kudu Advertising Project: Province‑City Statistics and Data Persistence

Huawei Cloud Developer Alliance

Aug 21, 2020 · Big Data

How Big Data and IoT Are Transforming Vehicle Networks: Opportunities and Challenges

This article explains the concepts of the Internet of Things and big data, explores how massive sensor data fuels smart transportation and vehicle networking, outlines practical applications such as real‑time traffic control and autonomous driving, and analyzes the technical and managerial bottlenecks hindering future growth.

Autonomous DrivingBig DataIoT

0 likes · 13 min read

How Big Data and IoT Are Transforming Vehicle Networks: Opportunities and Challenges

Liangxu Linux

Aug 19, 2020 · Operations

How to Quickly Analyze Beijing Residency Data with Shell Commands

This tutorial shows how to use standard Unix shell tools such as grep, cut, sort, uniq, awk, and join to extract insights—top companies, most common surnames, popular given names, age distribution, and hometown statistics—from a JSON dataset of over 6,000 Beijing residency applicants.

Big DataData AnalysisJSON

0 likes · 13 min read

How to Quickly Analyze Beijing Residency Data with Shell Commands

Big Data Technology & Architecture

Aug 19, 2020 · Big Data

Big Data ETL Project: Parsing Advertising JSON with Spark, IP Lookup, and Storing into Kudu

This tutorial describes how to place advertising JSON data on HDFS, use Spark for ETL and analysis, enrich logs with IP lookup, and persist the results into Kudu with daily scheduling, including code examples and schema definitions.

Big DataETLIP lookup

0 likes · 17 min read

Big Data ETL Project: Parsing Advertising JSON with Spark, IP Lookup, and Storing into Kudu

dbaplus Community

Aug 18, 2020 · Big Data

Designing a Scalable Financial Data Warehouse: Modeling, Layers, and Quality Control

This article outlines a comprehensive approach to building a financial data warehouse, covering background needs, modeling methodologies, a layered architecture (I, C, S, R), data quality monitoring, metadata management, and detailed naming and coding standards to ensure maintainable, high‑quality data pipelines.

Big DataData QualityData Warehouse

0 likes · 14 min read

Designing a Scalable Financial Data Warehouse: Modeling, Layers, and Quality Control

Suning Technology

Aug 18, 2020 · Backend Development

Boosting Mega‑Sale Stability: Suning’s Backend Data Components in Action

The article details how Suning’s transaction middle‑platform leverages custom TPS collection, advanced flow‑control, big‑data analytics, and AI‑driven forecasting to ensure system stability, capacity planning, and intelligent inventory distribution during the high‑traffic 818 promotional event.

AIBackendBig Data

0 likes · 17 min read

Boosting Mega‑Sale Stability: Suning’s Backend Data Components in Action

Big Data Technology & Architecture

Aug 18, 2020 · Big Data

End-to-End Real-Time Web Log Processing with Flume, Kafka, Spark Streaming, HBase, and Spring Boot

This tutorial demonstrates how to generate simulated web access logs in Python, schedule them with Crontab, collect them in real time using Flume, forward them to Kafka, process the streams with Spark Streaming, store results in HBase, and visualize the data via a Spring Boot application with ECharts.

Big DataEChartsFlume

0 likes · 36 min read

End-to-End Real-Time Web Log Processing with Flume, Kafka, Spark Streaming, HBase, and Spring Boot

Beike Product & Technology

Aug 17, 2020 · Big Data

Bitmap-Based User Segmentation in a DMP Platform Using ClickHouse

This article describes how a data management platform (DMP) at Beike leverages ClickHouse bitmap structures and Spark pipelines to generate global numeric user IDs, design tag-specific bitmap rules for enum, continuous, and date attributes, handle boundary cases, and produce high‑performance bitmap SQL for real‑time user group estimation and complex segment logic.

Big DataClickHouseDMP

0 likes · 17 min read

Bitmap-Based User Segmentation in a DMP Platform Using ClickHouse

Big Data Technology & Architecture

Aug 17, 2020 · Big Data

Complex Event Processing (CEP) with Flink: Concepts, Pattern API, and a Scala Practical Example

This article introduces Complex Event Processing (CEP), explains its core concepts and features, details Flink's Pattern API with individual, combined, and group patterns, and provides a complete Scala example that detects three consecutive login failures within three seconds using Flink CEP.

Big DataCEPFlink

0 likes · 10 min read

Complex Event Processing (CEP) with Flink: Concepts, Pattern API, and a Scala Practical Example

Big Data Technology & Architecture

Aug 16, 2020 · Big Data

Comprehensive Overview of HDFS: Architecture, Advantages, Limitations, Commands, and Advanced Features

This article provides a detailed introduction to HDFS, covering its application scenarios, core architecture, fault‑tolerance benefits, drawbacks such as high latency and small‑file inefficiency, essential shell and API commands, cluster management procedures, and newer Hadoop 2.0 features like HA, Federation, snapshots, ACLs, and heterogeneous storage.

Big DataCLIData Storage

0 likes · 10 min read

Comprehensive Overview of HDFS: Architecture, Advantages, Limitations, Commands, and Advanced Features

Big Data Technology & Architecture

Aug 15, 2020 · Big Data

Step-by-Step Guide to Building an ELK Stack with Kafka, Zookeeper, Logstash, and Filebeat for Log Collection

This tutorial provides a comprehensive, step-by-step procedure for setting up a log‑collection pipeline using Filebeat, Kafka, Zookeeper, Logstash, Elasticsearch, and Kibana across multiple servers, covering hardware preparation, system tuning, software installation, configuration files, and verification commands.

Big DataELKFilebeat

0 likes · 11 min read

Step-by-Step Guide to Building an ELK Stack with Kafka, Zookeeper, Logstash, and Filebeat for Log Collection

Big Data Technology & Architecture

Aug 15, 2020 · Big Data

Understanding Data Lakes: Concepts, Architecture, Vendor Solutions, and Practical Use Cases

This comprehensive article explains what a data lake is, outlines its core characteristics and reference architecture, compares major cloud providers' data‑lake offerings, presents typical advertising and gaming use cases, and proposes a practical, agile process for building and operating a data lake.

Big DataData ArchitectureData Lake

0 likes · 50 min read

Understanding Data Lakes: Concepts, Architecture, Vendor Solutions, and Practical Use Cases

Suning Technology

Aug 14, 2020 · Big Data

Building SuNing’s Supply‑Chain Data Platform with DDD and Big‑Data Design

This article recounts SuNing’s step‑by‑step journey of designing and implementing a supply‑chain data middle platform, outlining its business rationale, DDD‑based domain modeling, layered system architecture, and practical deployment insights that illustrate how a tailored big‑data solution can enhance data services and governance.

Big DataDDDData Platform

0 likes · 11 min read

Building SuNing’s Supply‑Chain Data Platform with DDD and Big‑Data Design

Huolala Tech

Aug 13, 2020 · Operations

How Huolala’s “Smart Brain” Uses AI and Optimization to Revolutionize Logistics

At the 2020 Global Logistics Technology Conference in Haikou, Huolala CTO Zhang Hao detailed the company’s self‑developed “Smart Brain” system, which leverages AI, big‑data analytics, IoT and custom optimization algorithms to achieve real‑time, intelligent dispatch, dynamic pricing and safer, more efficient logistics operations.

AIBig DataIoT

0 likes · 6 min read

How Huolala’s “Smart Brain” Uses AI and Optimization to Revolutionize Logistics

Aikesheng Open Source Community

Aug 13, 2020 · Databases

Introduction to ClickHouse: Features, Installation, Performance Testing, and Comparison

This article introduces ClickHouse, an open‑source column‑oriented OLAP database, detailing its key features, appropriate use cases, installation steps, performance benchmark queries, and how it compares with other columnar storage solutions while highlighting its adoption by major internet companies.

Big DataClickHouseColumnar Database

0 likes · 10 min read

Introduction to ClickHouse: Features, Installation, Performance Testing, and Comparison

Architecture Digest

Aug 13, 2020 · Big Data

Synchronizing Billion-Row MySQL Data to HBase: Three Practical Schemes and Implementation Guide

This comprehensive guide details three practical methods for syncing massive MySQL datasets to HBase—including Sqoop, Kafka‑Thrift, and Flink pipelines—covering environment setup, configuration, code examples, performance comparisons, and optimization tips for large‑scale data ingestion and querying.

Big DataFlinkHBase

0 likes · 24 min read

Synchronizing Billion-Row MySQL Data to HBase: Three Practical Schemes and Implementation Guide

Big Data Technology & Architecture

Aug 13, 2020 · Big Data

Configuring Kerberos‑Enabled HDFS Access with Maven in a Hadoop Cluster

This guide walks through setting up a Maven project, adding Hadoop dependencies, configuring Kerberos (krb5.conf and keytab), loading core‑site.xml, and providing Java utility classes to initialize the HDFS client and list files in an HA‑enabled Hadoop cluster.

Big DataHDFSHadoop

0 likes · 5 min read

Configuring Kerberos‑Enabled HDFS Access with Maven in a Hadoop Cluster

Big Data Technology Architecture

Aug 13, 2020 · Databases

Deep Dive into Apache Druid V1 Storage Format: Index Structures and Disk Layout

This article provides a detailed analysis of Apache Druid V1's column‑oriented storage format, covering dimension dictionaries, variable‑length encoded values, bitmap inverted indexes, array handling, and the physical metadata layout that enables sub‑second OLAP queries on massive datasets.

Apache DruidBig DataBitmap Index

0 likes · 8 min read

Deep Dive into Apache Druid V1 Storage Format: Index Structures and Disk Layout

Tencent Cloud Middleware

Aug 12, 2020 · Big Data

How Serverless Functions Can Replace Traditional Kafka Data Pipelines for Lower Cost and Easier Scaling

This article explains how Tencent Cloud CKafka works, describes the challenges of traditional open‑source data‑flow solutions, and demonstrates a Serverless Function approach—complete with architecture diagrams and code examples—to achieve low‑cost, auto‑scaling Kafka‑to‑Elasticsearch pipelines.

Big DataCKafkaElasticsearch

0 likes · 12 min read

How Serverless Functions Can Replace Traditional Kafka Data Pipelines for Lower Cost and Easier Scaling

IT Architects Alliance

Aug 12, 2020 · Big Data

Introduction to Confluent KSQL for Real-Time Stream Processing

This article introduces Confluent KSQL, a SQL‑based real‑time stream processing engine for Kafka, covering its architecture, stream vs table concepts, query lifecycle, Docker‑based setup, DDL commands, example joins, windowed aggregations, connectors, and its advantages and limitations.

Big DataDockerKSQL

0 likes · 9 min read

Introduction to Confluent KSQL for Real-Time Stream Processing

Big Data Technology & Architecture

Aug 12, 2020 · Big Data

Real‑time User Behavior Collection Using Flume, Kafka, and Spark Streaming on Hadoop

This guide explains how to continuously collect web‑service user behavior logs, route them through Flume agents to Kafka, and finally ingest them with Spark Streaming into HDFS, covering environment preparation, configuration files, deployment steps, and verification procedures.

Big DataFlumeHadoop

0 likes · 9 min read

Real‑time User Behavior Collection Using Flume, Kafka, and Spark Streaming on Hadoop

Architects' Tech Alliance

Aug 11, 2020 · Big Data

Comprehensive Overview of Data Middle Platform Architecture, Components, and Practices

This article provides an extensive summary of data middle platform concepts, covering data aggregation, collection tools, offline and real‑time development, data governance, service layers, warehouse construction, and operational practices, illustrating how enterprises build and manage a unified data ecosystem.

Big DataData Middle PlatformData Warehouse

0 likes · 27 min read

Comprehensive Overview of Data Middle Platform Architecture, Components, and Practices

Big Data Technology & Architecture

Aug 11, 2020 · Big Data

Consuming Kerberos‑Protected Kafka Data with Spark Streaming and Storing into Kudu

This guide demonstrates how to configure a Spark Streaming application running on YARN in cluster mode to securely consume Kerberos‑protected Kafka topics and write the processed data into Kudu tables, including necessary Java code, Kerberos keytab setup, Kafka client configuration, and spark‑submit commands.

Big DataJavaKafka

0 likes · 11 min read

Consuming Kerberos‑Protected Kafka Data with Spark Streaming and Storing into Kudu

Big Data Technology & Architecture

Aug 10, 2020 · Big Data

Real-time Hot Item, PV, and UV Statistics Using Apache Flink, Kafka, and Bloom Filter

This article demonstrates how to implement real-time hot item ranking, page view counting, and unique visitor estimation using Apache Flink with Kafka sources, sliding windows, custom aggregation functions, and a Bloom filter backed by Redis, providing complete Scala code examples.

Big DataFlinkKafka

0 likes · 15 min read

Real-time Hot Item, PV, and UV Statistics Using Apache Flink, Kafka, and Bloom Filter

Big Data Technology & Architecture

Aug 10, 2020 · Fundamentals

Understanding Bloom Filter: Concept, Principles, Implementation, and Applications

This article explains the concept, principles, implementation details, and practical applications of Bloom Filters, including formulas for optimal bit array size and hash count, Java code examples using Guava, and common use cases such as deduplication, web crawling, and spam filtering.

Big DataGuavaJava

0 likes · 12 min read

Understanding Bloom Filter: Concept, Principles, Implementation, and Applications

Python Crawling & Data Mining

Aug 8, 2020 · Big Data

How Python Data Mining Uncovers Why '30 Only' Became a Summer Hit

This article uses Python to scrape and analyze Douban ratings, user comments, and Tencent video danmu for the TV drama “30 Only”, revealing the show’s explosive popularity, the most discussed characters, and audience sentiment through statistical charts and word‑cloud visualizations.

Big DataPythonTV Drama Analysis

0 likes · 11 min read

How Python Data Mining Uncovers Why '30 Only' Became a Summer Hit

Big Data Technology & Architecture

Aug 8, 2020 · Big Data

Setting Up InfluxDB and Grafana for Flink Metrics Monitoring

This guide walks through installing InfluxDB and Grafana on CentOS, configuring InfluxDB for Flink metrics storage, creating databases and retention policies, integrating the Flink InfluxDB reporter, and building Grafana dashboards to visualize real‑time Flink job metrics.

Big DataFlinkGrafana

0 likes · 8 min read

Setting Up InfluxDB and Grafana for Flink Metrics Monitoring

Ctrip Technology

Aug 6, 2020 · Big Data

Data Governance Practices and Model Design in Ctrip Vacation Data Warehouse

This article shares the practical experience and thinking behind Ctrip's vacation data governance project, covering team efficiency optimization, demand sorting, data domain definition, warehouse layering, unified dimension modeling, metric standardization, and the overall benefits of a centralized data governance framework.

Big DataCtripData Warehouse

0 likes · 17 min read

Data Governance Practices and Model Design in Ctrip Vacation Data Warehouse

Youku Technology

Aug 6, 2020 · Big Data

Alibaba Entertainment Data Platform: The Journey Ahead

The presentation outlines how Alibaba's entertainment data platform has evolved to meet the real‑time, low‑cost, and scalable analytics demands of campaigns such as Double 11 and 618, detailing its architecture, real‑time processing, pre‑computed data cubes, practical design choices, and lessons learned from implementation challenges.

Big DataReal-time Analytics

0 likes · 1 min read

Alibaba Entertainment Data Platform: The Journey Ahead

Big Data Technology & Architecture

Aug 6, 2020 · Big Data

Flink Configuration Parameters and Related Tuning for Kafka and Yarn

This article provides a comprehensive guide to configuring Apache Flink—including job manager and task manager settings, high‑availability via Zookeeper, metrics reporting, as well as Kafka producer tuning and Yarn resource adjustments—to help practitioners optimize big‑data streaming jobs.

Big DataFlinkHA

0 likes · 8 min read

Flink Configuration Parameters and Related Tuning for Kafka and Yarn

Big Data Technology & Architecture

Aug 5, 2020 · Big Data

An Introduction to Apache Kylin: Architecture, Core Concepts, Installation, and Enterprise Use Cases

This article provides a comprehensive overview of Apache Kylin, covering its background, core OLAP concepts, technical architecture, installation steps, cube-building methods, real‑world enterprise deployments, and resources for further learning, illustrating how it enables sub‑second query performance on massive datasets.

Apache KylinBig DataCube

0 likes · 20 min read

An Introduction to Apache Kylin: Architecture, Core Concepts, Installation, and Enterprise Use Cases

Fulu Network R&D Team

Aug 4, 2020 · Big Data

Practical Experience with State Management in Flink Real‑Time Stream Processing

This article shares practical experiences and insights on using different types of state in Apache Flink for real‑time stream processing, covering managed versus raw state, code examples in Scala and Java, handling late data, dimension table joins, distinct semantics, and best‑practice recommendations.

Big DataFlinkJava

0 likes · 15 min read

Practical Experience with State Management in Flink Real‑Time Stream Processing

Dada Group Technology

Aug 4, 2020 · Big Data

Design and Implementation of the Tianhe Data Tracking Management Platform at Dada Group

The article describes how Dada Group created the Tianhe platform to centralize, standardize, and automate massive data‑tracking (埋点) requirements across multiple product lines, detailing its goals, architecture, core functions, current status, and future development directions.

Big DataData QualityData Tracking

0 likes · 10 min read

Design and Implementation of the Tianhe Data Tracking Management Platform at Dada Group

21CTO

Aug 1, 2020 · Big Data

Mastering User Profiling: A Comprehensive Big Data Blueprint

This article explains how enterprises can leverage massive raw and business data to build detailed user profiles, covering tag types, data architecture, development modules, project phases, key deliverables, and a real-world e‑commerce case study.

Big DataData WarehouseETL

0 likes · 22 min read

Mastering User Profiling: A Comprehensive Big Data Blueprint

DataFunTalk

Aug 1, 2020 · Big Data

User Profiling Methodology and Engineering Solutions

This article explains the fundamentals of user profiling in the big data era, covering tag types, data architecture, development modules, a step‑by‑step implementation process, a practical e‑commerce case study, table design strategies, and both quantitative and qualitative profiling methods.

Big DataETLmachine learning

0 likes · 22 min read

User Profiling Methodology and Engineering Solutions

Tianxing Digital Tech User Experience

Jul 31, 2020 · Big Data

How Pandemic Data Visualization Evolved: From John Snow’s Cholera Map to Modern COVID Dashboards

This article traces the history and development of pandemic data visualization—from 19th‑century cholera maps and early 2000s SARS charts to sophisticated COVID‑19 dashboards—while outlining five essential design principles that make such visualizations clear, engaging, and impactful.

Big DataCOVID-19Design Principles

0 likes · 13 min read

How Pandemic Data Visualization Evolved: From John Snow’s Cholera Map to Modern COVID Dashboards

Programmer DD

Jul 31, 2020 · Big Data

How to Find Common URLs in 5 Billion‑Entry Files with Only 4 GB RAM

This article explains how to locate the intersecting URLs between two 5‑billion‑record files (≈320 GB total) using a hash‑based divide‑and‑conquer method that fits within a strict 4 GB memory limit.

Big DataMemory OptimizationURL intersection

0 likes · 3 min read

How to Find Common URLs in 5 Billion‑Entry Files with Only 4 GB RAM

Tencent Cloud Developer

Jul 30, 2020 · Big Data

Cost Governance Practices in Youzan's Data Middle Platform

Youzan's data middle platform faced cost growth outpacing business due to low utilization and storage inefficiencies; they applied utilization standards, containerization, COS storage migration, offline task optimization, and fine-grained cost-billing, achieving a 12% compute boost, 17% batch savings, 80% storage cost cut, and over 25% overall cost reduction.

Big DataCloud ComputingContainerization

0 likes · 24 min read

Cost Governance Practices in Youzan's Data Middle Platform

Big Data Technology & Architecture

Jul 30, 2020 · Big Data

Understanding Bucket Sampling Queries in Hive

This article explains Hive's bucket sampling syntax, demonstrates how to use the TABLESAMPLE clause with various bucket parameters, provides concrete SQL examples, and clarifies the underlying hash‑based mechanism that determines which rows are returned.

Big DataBucket SamplingHive

0 likes · 4 min read

Understanding Bucket Sampling Queries in Hive

Big Data Technology & Architecture

Jul 29, 2020 · Big Data

Sqoop Tutorial: Importing and Exporting Data between Relational Databases, HDFS, Hive, and HBase

This article provides a comprehensive guide to using Sqoop for importing data from relational databases into HDFS, Hive, and HBase, as well as exporting data back to databases, covering command syntax, options, and practical examples for big‑data workflows.

Big DataHBaseHDFS

0 likes · 8 min read

Sqoop Tutorial: Importing and Exporting Data between Relational Databases, HDFS, Hive, and HBase

Tencent Cloud Developer

Jul 29, 2020 · Big Data

Case Study: Optimizing Tencent Cloud Elasticsearch for High‑Volume Game Log Analytics

To handle a gaming company's million‑QPS log stream, the team built a hot‑cold Tencent Cloud Elasticsearch cluster with ILM‑driven tiering, scaled CPU/heap, reduced shard count via shrink and replica tweaks, tuned Logstash‑Kafka pipelines, and employed COS snapshots and searchable snapshots, achieving stable performance and lower cost.

Big DataElasticsearchILM

0 likes · 29 min read

Case Study: Optimizing Tencent Cloud Elasticsearch for High‑Volume Game Log Analytics

Youzan Coder

Jul 29, 2020 · Big Data

How We Migrated a 200‑Node Hadoop Cluster Across Data Centers: Lessons and Strategies

This article presents a comprehensive case study of migrating a 200‑plus node Hadoop offline platform across data centers, covering background, architecture, solution evaluation, detailed implementation steps, consistency checks, operational safeguards, encountered issues, and future recommendations.

Big DataDP PlatformData Consistency

0 likes · 21 min read

How We Migrated a 200‑Node Hadoop Cluster Across Data Centers: Lessons and Strategies

Big Data Technology & Architecture

Jul 28, 2020 · Big Data

Enabling CGroup in Hadoop Yarn NodeManager to Limit Container CPU Resources

This article explains how to enable Linux CGroup support in Hadoop Yarn NodeManager to limit container CPU usage, detailing required configuration properties, hierarchy setup, CPU limit parameters, and a critical kernel version caveat.

Big DataCPUHadoop

0 likes · 7 min read

Enabling CGroup in Hadoop Yarn NodeManager to Limit Container CPU Resources

MaGe Linux Operations

Jul 28, 2020 · Big Data

How Leading Chinese Companies Scale Elasticsearch for Billions of Orders

This article surveys how major Chinese tech firms such as JD.com, Ctrip, Didi, and 58.com deploy and evolve Elasticsearch clusters to handle massive order data, log analysis, real‑time monitoring, and security tasks, detailing architecture choices, shard strategies, multi‑cluster designs, and performance optimizations.

Big DataElasticsearchOrder Management

0 likes · 11 min read

How Leading Chinese Companies Scale Elasticsearch for Billions of Orders

Xianyu Technology

Jul 28, 2020 · Operations

ShenTan: Automated Fault Localization System for Online Services

ShenTan is an automated fault‑localization platform for online services that quickly (under five seconds) pinpoints server‑side issues with developer‑level accuracy by aggregating real‑time metrics, applying a decision‑tree model enriched by expert knowledge and dynamic thresholds, and presenting results through an integrated alert and visualization system, while planning broader endpoint coverage and multi‑tenant support.

Big DataFault LocalizationOperations

0 likes · 12 min read

ShenTan: Automated Fault Localization System for Online Services

Big Data Technology & Architecture

Jul 27, 2020 · Big Data

How to View Hadoop/YARN Application Logs via History Server and Yarn Commands

This guide explains how to retrieve Hadoop/YARN application logs using the History Server UI, Yarn command‑line tools, and direct HDFS log access, including commands for listing applications, fetching specific logs, and locating the remote log directory.

Big DataCLIHDFS

0 likes · 4 min read

How to View Hadoop/YARN Application Logs via History Server and Yarn Commands

dbaplus Community

Jul 26, 2020 · Big Data

How Prometheus Powers Scalable Monitoring for Massive Big Data Clusters

Facing thousands of nodes in expanding big‑data clusters, the author evaluates legacy monitoring stacks, selects Prometheus + Alertmanager + Grafana, and details its architecture, custom exporters, real‑time alerts, self‑healing mechanisms, and visual dashboards that now support ten large clusters and dozens of services.

AlertmanagerBig DataGrafana

0 likes · 11 min read

How Prometheus Powers Scalable Monitoring for Massive Big Data Clusters

DataFunTalk

Jul 23, 2020 · Big Data

Design and Implementation of a Financial Data Warehouse: Architecture, Modeling, Quality Control, and Metadata Management

This article outlines the end‑to‑end design of a financial data warehouse, covering background needs, modeling methodology choices, a layered architecture, data quality monitoring, metadata management, naming and coding standards, and future improvement directions.

Big DataData QualitySQL Standards

0 likes · 11 min read

Design and Implementation of a Financial Data Warehouse: Architecture, Modeling, Quality Control, and Metadata Management

Big Data Technology & Architecture

Jul 23, 2020 · Big Data

Comprehensive Kafka FAQ: Uses, Architecture, Offsets, and Partition Management

This article provides an extensive overview of Apache Kafka, covering its use cases, key concepts such as ISR, AR, HW, LEO, and LW, message ordering, the roles of partitioners, serializers and interceptors, producer and consumer client architecture, offset handling, multithreaded consumption, and topic partition management.

Big DataKafkaMessage queue

0 likes · 16 min read

Comprehensive Kafka FAQ: Uses, Architecture, Offsets, and Partition Management

dbaplus Community

Jul 22, 2020 · Databases

How to Optimize Real‑Time Vector Tile Services for Millions of Features with PostgreSQL & PostGIS

This article explains how to efficiently browse and render millions of GIS features in real‑time vector tiles using PostgreSQL and PostGIS, covering background challenges, several thinning algorithms, their implementation steps, limitations, advantages, and a practical example with a 3‑million‑point dataset.

Big DataData DilutionGIS

0 likes · 8 min read

How to Optimize Real‑Time Vector Tile Services for Millions of Features with PostgreSQL & PostGIS

Big Data Technology & Architecture

Jul 22, 2020 · Big Data

Kafka Architecture and Core Concepts: Producers, Brokers, and Consumers

This article explains Kafka's fundamental architecture, including the roles of producers, brokers, and consumers, key concepts such as topics, partitions, replicas, ISR, and controller, as well as detailed mechanisms of producer client structure, interceptors, serializers, partitioners, and consumer group rebalancing strategies.

Big DataDistributed SystemsKafka

0 likes · 22 min read

Kafka Architecture and Core Concepts: Producers, Brokers, and Consumers

Alibaba Cloud Developer

Jul 22, 2020 · Big Data

Exploring the Apache Big Data Ecosystem: Hadoop, Spark, Flink, and More

This article surveys the rapidly evolving big data landscape by reviewing a wide range of Apache projects—including Hadoop, Spark, Flink, HBase, Kudu, Impala, Kafka, and others—detailing their core components, architectures, strengths, and typical use‑cases for building distributed data platforms.

ApacheBig DataData Processing

0 likes · 20 min read

Exploring the Apache Big Data Ecosystem: Hadoop, Spark, Flink, and More

Tencent Cloud Developer

Jul 21, 2020 · Big Data

Scaling Tencent Meeting Video Stream Quality Analysis with Tencent Cloud Elasticsearch

Facing explosive growth and massive video‑stream quality data, Tencent Meeting migrated its custom Lucene‑based analysis engine to Tencent Cloud Elasticsearch, which delivered over 1 million writes per second, automatic sharding, reduced latency from hours to seconds, and sustained 99.99% availability, proving a high‑performance, scalable solution for large‑scale video conferencing.

Big DataCloud ComputingElasticsearch

0 likes · 16 min read

Scaling Tencent Meeting Video Stream Quality Analysis with Tencent Cloud Elasticsearch

Big Data Technology & Architecture

Jul 20, 2020 · Big Data

Kafka Workflow and File Storage Mechanism: Topics, Partitions, Segments, Index and Log Files

This article explains Kafka’s workflow, detailing how topics, partitions, and segments are organized, the structure of index and log files, message composition, offset-based retrieval, and the overall data directory layout, providing a comprehensive overview of Kafka’s storage architecture.

Big DataKafkaOFFSET

0 likes · 8 min read

Kafka Workflow and File Storage Mechanism: Topics, Partitions, Segments, Index and Log Files

Big Data Technology & Architecture

Jul 19, 2020 · Big Data

An Overview of Hive, HBase Integration, Apache Phoenix, and Lealone in the Big Data Ecosystem

This article explains Hive's role as a Hadoop‑based data warehouse, its integration with HBase, the advantages and drawbacks of that combination, introduces Apache Phoenix as a high‑performance SQL layer on HBase, and describes the open‑source NewSQL database Lealone, providing practical usage scenarios and performance comparisons.

Big DataData WarehouseHBase

0 likes · 9 min read

An Overview of Hive, HBase Integration, Apache Phoenix, and Lealone in the Big Data Ecosystem

Big Data Technology & Architecture

Jul 18, 2020 · Big Data

Common Spark SQL, Spark Core, PySpark, and Streaming Issues and Their Solutions

This article compiles frequent Spark SQL, Spark Core, PySpark, and Streaming problems—such as filesystem errors, configuration pitfalls, memory limits, shuffle failures, and version incompatibilities—along with concise explanations of their causes and step‑by‑step remediation methods for big‑data environments.

Big DataPySparkSpark

0 likes · 14 min read

Common Spark SQL, Spark Core, PySpark, and Streaming Issues and Their Solutions

Python Crawling & Data Mining

Jul 17, 2020 · Big Data

What Do Gaokao Numbers Reveal? Python-Powered Deep Dive into China’s College Admissions

This article uses Python to scrape and analyze over 2,900 Chinese university and major data points, revealing trends in Gaokao participation, provincial enrollment, university types, popularity rankings, and public curiosity about majors, all illustrated with charts and code examples.

Big DataGaokaoPython

0 likes · 12 min read

What Do Gaokao Numbers Reveal? Python-Powered Deep Dive into China’s College Admissions

Beike Product & Technology

Jul 16, 2020 · Backend Development

Kafka Connect: Introduction and Concepts for Data Pipelines

This article introduces Kafka Connect, a framework for building scalable data pipelines between Kafka and other systems, covering its architecture, key concepts like connectors and tasks, and practical deployment examples.

Backend DevelopmentBig DataDistributed Systems

0 likes · 20 min read

Kafka Connect: Introduction and Concepts for Data Pipelines

Ctrip Technology

Jul 16, 2020 · Big Data

Design and Architecture of the User Profiling System at Ctrip Business Travel

This article describes the concept, tag taxonomy, data flow architecture, and Lambda‑based query service design of Ctrip Business Travel's user profiling system, highlighting how batch and real‑time processing with Spark, Flink, Hive, MongoDB and Redis enable precise marketing, risk control and personalized services.

Big DataCtripdata pipeline

0 likes · 12 min read

Design and Architecture of the User Profiling System at Ctrip Business Travel

Big Data Technology & Architecture

Jul 16, 2020 · Big Data

Spark Configuration Parameters and Performance Tuning Guidelines

This article explains the purpose, default values, and practical tuning recommendations for common Spark submit options such as executor counts, memory settings, shuffle behavior, speculation, and various Spark SQL configurations to help users optimize big‑data workloads.

Big DataExecutorPerformance tuning

0 likes · 14 min read

Spark Configuration Parameters and Performance Tuning Guidelines

Architect

Jul 15, 2020 · Big Data

Understanding Flink Task Slots, Resource Allocation, and Slot Sharing Mechanisms

This article explains how Flink uses task slots to partition TaskManager resources, the benefits of slot sharing, the interaction between Scheduler, SlotPool, and ResourceManager, and the internal classes such as LogicalSlot, PhysicalSlot, and SlotSharingManager that enable resource isolation and sharing in stream processing jobs.

Big DataFlinkTask Slot

0 likes · 6 min read

Understanding Flink Task Slots, Resource Allocation, and Slot Sharing Mechanisms

Youzan Coder

Jul 15, 2020 · Big Data

Design and Implementation of Youzan ABTest System for Data‑Driven Growth

Youzan created an internal A/B testing platform—combining Java/Node SDKs, a real‑time data pipeline, and a metadata‑driven workflow—to enable data‑driven product iteration, granular traffic allocation, automated logging, statistical analysis, and scalable growth insights across its merchant services, while planning further automation and integration.

A/B testingBig DataExperiment Platform

0 likes · 19 min read

Design and Implementation of Youzan ABTest System for Data‑Driven Growth

Huolala Tech

Jul 15, 2020 · Big Data

How to Build Smart, Scalable Data Tracking Solutions for Comprehensive Analytics

This article explores the fundamentals, common schemes, pain points, and a smart end‑to‑end solution for data tracking (埋点), offering practical guidelines, architectural diagrams, and a concrete example to help engineers implement comprehensive, controllable, and efficient event collection pipelines.

AnalyticsBig DataData Tracking

0 likes · 9 min read

How to Build Smart, Scalable Data Tracking Solutions for Comprehensive Analytics

58 Tech

Jul 13, 2020 · Big Data

Design and Implementation of a Financial Data Warehouse: Architecture, Modeling, Quality Monitoring, and Metadata Management

This article presents a comprehensive design and implementation guide for a financial data warehouse, covering background needs, modeling methodology choices, a layered architecture, data quality monitoring, metadata management, naming and coding standards, and future development directions.

Big DataData QualityData Warehouse

0 likes · 11 min read

Design and Implementation of a Financial Data Warehouse: Architecture, Modeling, Quality Monitoring, and Metadata Management

Big Data Technology & Architecture

Jul 13, 2020 · Big Data

Understanding and Optimizing Flink Checkpoint Mechanism for Large-Scale State

This article explains Flink's checkpoint mechanism, outlines key performance metrics, discusses interval configuration, external state storage choices, resource allocation, and task-local recovery strategies to improve checkpoint speed and reliability in large‑scale state scenarios.

Big DataCheckpointFlink

0 likes · 5 min read

Understanding and Optimizing Flink Checkpoint Mechanism for Large-Scale State

Architects Research Society

Jul 12, 2020 · Databases

GraphTech Ecosystem Overview: Graph Database Landscape and Storage Options (2019)

This article surveys the 2019 GraphTech ecosystem, detailing the rapid growth of graph databases, market drivers, ecosystem layers, and the variety of native and multi‑model storage systems that support graph‑structured data.

Big DataDatabase EcosystemStorage Systems

0 likes · 7 min read

GraphTech Ecosystem Overview: Graph Database Landscape and Storage Options (2019)

Big Data Technology & Architecture

Jul 12, 2020 · Big Data

Design and Implementation of Ozone Data Exploration Service (Recon Server)

This article explains the design of a data exploration service for large‑scale distributed storage systems, detailing metadata synchronization, index reconstruction, aggregation tables, node‑level statistics, a user console, and the transition from checkpoint‑based snapshots to delta updates using RocksDB WAL in Hadoop Ozone Recon Server.

Big DataDelta UpdatesOzone

0 likes · 9 min read

Design and Implementation of Ozone Data Exploration Service (Recon Server)

Big Data Technology & Architecture

Jul 10, 2020 · Big Data

Creating a Test Table in Phoenix/HBase and Implementing a Custom Bitmap Aggregation Function in Spark

This tutorial demonstrates how to create a VARBINARY test table in HBase using Phoenix, serialize its data with RoaringBitmap, implement a custom Spark aggregation function to merge bitmap values, and query the table via Spark SQL, showcasing a practical big-data processing workflow.

Big DataHBasePhoenix

0 likes · 6 min read

Creating a Test Table in Phoenix/HBase and Implementing a Custom Bitmap Aggregation Function in Spark

GrowingIO Tech Team

Jul 9, 2020 · Big Data

How BitMap Storage Boosts Event Analysis Performance in Big Data Platforms

This article explains GrowingIO's event analysis data model, the challenges of metric‑dimension calculations on massive datasets, and how a BitMap‑based vertical storage and dimension‑combination numbering dramatically improve query efficiency and scalability.

Big DataPerformance Optimizationbitmap

0 likes · 11 min read

How BitMap Storage Boosts Event Analysis Performance in Big Data Platforms

Big Data Technology & Architecture

Jul 9, 2020 · Big Data

How ZooKeeper Supports HBase: Coordination, Fault Tolerance, Log Splitting, META Table Management, and Replication

This article explains how ZooKeeper functions as a distributed coordination service for HBase, detailing its role in master and RegionServer fault tolerance, log splitting, META table location tracking, and replication management, illustrating the underlying ZNode structures and failover mechanisms.

Big DataDistributed CoordinationHBase

0 likes · 7 min read

How ZooKeeper Supports HBase: Coordination, Fault Tolerance, Log Splitting, META Table Management, and Replication

Sohu Tech Products

Jul 8, 2020 · Big Data

Optimizing Workflow in Data Warehouse Construction: A Layered Task‑Instance Approach

The article analyzes data‑warehouse workflow scenarios, explains core concepts such as OLAP, multidimensional modeling and layer architecture, reviews existing workflow engines like Azkaban, Oozie and Airflow, and proposes a task‑and‑instance layered optimization that simplifies dependency configuration, improves collaboration, and supports complex scheduling in modern big‑data environments.

Big DataETLTask Scheduling

0 likes · 21 min read

Optimizing Workflow in Data Warehouse Construction: A Layered Task‑Instance Approach