Tagged articles

3697 articles

Page 27 of 37

Apr 20, 2020 · Big Data

How Spark SQL Chooses Join Strategies: Broadcast, Shuffle Hash, and Sort Merge

The article explains Spark SQL's Catalyst optimizer rules for selecting among Broadcast hash join, Shuffle hash join, and Sort‑merge join, covering build‑side determination, size thresholds, broadcast hints, local hash‑map construction, and fallback strategies for non‑equi joins.

Big DataBroadcast JoinShuffle Hash Join

0 likes · 10 min read

How Spark SQL Chooses Join Strategies: Broadcast, Shuffle Hash, and Sort Merge

Big Data Technology & Architecture

Apr 19, 2020 · Big Data

Understanding the Backpressure Mechanism in Spark Streaming

This article explains Spark Streaming's backpressure mechanism, detailing how batch intervals can cause data accumulation, the role of Receivers versus DirectKafkaInputDStream, configuration to enable backpressure, and the internal workings of RateController, ReceiverRateController, ReceiverSupervisor, BlockGenerator, and rate calculations for Kafka streams.

Big DataKafkaRateController

0 likes · 12 min read

Understanding the Backpressure Mechanism in Spark Streaming

Python Programming Learning Circle

Apr 16, 2020 · Big Data

Getting Started with PySpark: Creating SparkContext, Parallelizing Data, and Basic DataFrame Operations

This tutorial demonstrates how to initialize a SparkContext in PySpark, perform simple parallel computations such as temperature conversion and reduction, create a SparkSession to read CSV data, and apply common DataFrame operations like selecting columns, adding new columns, filtering, grouping, and aggregating.

Big DataPySparkSpark

0 likes · 5 min read

Getting Started with PySpark: Creating SparkContext, Parallelizing Data, and Basic DataFrame Operations

HomeTech

Apr 16, 2020 · Big Data

Home (ZhiJia) Distributed Task Scheduling System Overview

The article presents a comprehensive overview of the Home (ZhiJia) distributed task scheduling system, detailing its background, advantages, technology stack, architecture, core concepts, module responsibilities, IDE integration, and future improvement plans for big‑data processing workflows.

Big DataDistributed SchedulingMaster‑Slave

0 likes · 10 min read

Home (ZhiJia) Distributed Task Scheduling System Overview

dbaplus Community

Apr 15, 2020 · Big Data

How Ctrip Scaled Hadoop Across Data Centers: Architecture and Lessons

This article details Ctrip's Hadoop evolution, the challenges of expanding across multiple data centers, the evaluation of multi‑cluster versus single‑cluster designs, and the concrete architectural changes, migration tools, bandwidth monitoring, and future plans that enabled a stable cross‑datacenter big‑data platform.

Big DataCross-DataCenterHDFS

0 likes · 19 min read

How Ctrip Scaled Hadoop Across Data Centers: Architecture and Lessons

DataFunTalk

Apr 15, 2020 · Big Data

Apache Flink OLAP Engine: Architecture, Optimizations, and Use Cases

This article presents an in‑depth overview of Apache Flink's new OLAP engine, covering OLAP fundamentals, the three OLAP models, Flink's unified streaming‑batch‑OLAP architecture, performance optimizations, benchmark results, and future development directions.

Apache FlinkBig DataOLAP

0 likes · 11 min read

Apache Flink OLAP Engine: Architecture, Optimizations, and Use Cases

Big Data Technology & Architecture

Apr 15, 2020 · Big Data

Understanding HDFS SecondaryNameNode and the Checkpoint Process

This article explains the role of HDFS SecondaryNameNode, the structure of fsimage and edits files, how checkpointing works—including configuration parameters and steps—and how the process changes when NameNode high availability is enabled.

Big DataCheckpointFilesystem

0 likes · 6 min read

Understanding HDFS SecondaryNameNode and the Checkpoint Process

Big Data Technology Architecture

Apr 15, 2020 · Big Data

Real-Time Data Warehouse Practices: Case Studies from Meituan, NetEase, Zhihu, and OPPO

This article reviews the evolution of data warehouses from traditional offline models to modern real‑time architectures, presenting detailed case studies of Meituan, NetEase, Zhihu, and OPPO, and discusses layer designs, technology choices such as Flink, Kafka, and storage options, and key lessons for building scalable real‑time warehouses.

Big DataFlinkKafka

0 likes · 13 min read

Real-Time Data Warehouse Practices: Case Studies from Meituan, NetEase, Zhihu, and OPPO

Ops Development Stories

Apr 13, 2020 · Big Data

Step-by-Step Guide to Installing and Configuring ELK Stack on CentOS 7

This comprehensive tutorial walks you through installing Java, Elasticsearch, Logstash, Kibana, and related tools on two CentOS 7 servers, configuring cluster settings, verifying health, and visualizing logs with Kibana, complete with command‑line examples and troubleshooting tips.

Big DataCentOSELK

0 likes · 17 min read

Step-by-Step Guide to Installing and Configuring ELK Stack on CentOS 7

Programmer DD

Apr 12, 2020 · Big Data

Master Elasticsearch: From Basics to SpringBoot Integration and Advanced Queries

This comprehensive guide introduces Elasticsearch fundamentals, its features and use cases, then walks through integrating it with SpringBoot, configuring Maven dependencies, performing index and document operations, and demonstrates a variety of query types and aggregations using both RESTful APIs and Java code examples.

Big DataElasticsearchFull-Text Search

0 likes · 46 min read

Master Elasticsearch: From Basics to SpringBoot Integration and Advanced Queries

Amap Tech

Apr 10, 2020 · Backend Development

Platformization of POI Deep Information Integration at Amap: Design and Implementation

Amap transformed its fragmented POI deep‑information pipelines into a unified platform that automates data acquisition, parsing, dimension alignment, specification mapping, and lifecycle management across billions of records, enabling product managers to integrate, debug, and scale diverse content‑provider feeds with real‑time, end‑to‑end control.

BackendBig DataConversion Engine

0 likes · 13 min read

Platformization of POI Deep Information Integration at Amap: Design and Implementation

DataFunTalk

Apr 9, 2020 · Big Data

Scaling and Optimizing 58.com’s Hadoop‑Based Offline Computing Platform: Architecture, Challenges, and Solutions

This article details how 58.com built a massive Hadoop‑based offline computing platform with over 4,000 servers and hundreds of petabytes of storage, addressing scaling, stability, GC, YARN scheduling, SparkSQL migration, storage operations, and a large‑scale cross‑datacenter migration.

Big DataData MigrationHadoop

0 likes · 24 min read

Scaling and Optimizing 58.com’s Hadoop‑Based Offline Computing Platform: Architecture, Challenges, and Solutions

Meituan Technology Team

Apr 9, 2020 · Big Data

Dual-Engine MOLAP + ROLAP Architecture with Apache Doris for Meituan Takeaway Data Warehouse

Meituan Takeaway’s data warehouse combines Apache Kylin’s MOLAP cubes for stable dimensions with Apache Doris’s MPP‑driven ROLAP engine to handle changing dimensions, detail queries, and near‑real‑time analytics, achieving millisecond‑level responses, reduced storage/compute costs, and simplifying operations across diverse analytical workloads.

Apache DorisBig DataData Warehouse

0 likes · 18 min read

Dual-Engine MOLAP + ROLAP Architecture with Apache Doris for Meituan Takeaway Data Warehouse

Big Data Technology & Architecture

Apr 9, 2020 · Big Data

Optimizing Hadoop and Hive Jobs with Filters, Projections, and Predicate Pushdown

The article explains how applying filters, projections, and predicate pushdown in Hadoop and Hive reduces data volume, speeds up MapReduce jobs, and improves performance, while also covering join limitations and providing a Java Mapper example for practical implementation.

Big DataHadoopHive

0 likes · 4 min read

Optimizing Hadoop and Hive Jobs with Filters, Projections, and Predicate Pushdown

Big Data Technology & Architecture

Apr 8, 2020 · Big Data

Common Apache Flink Exceptions and How to Resolve Them

This article enumerates typical Apache Flink deployment, job, and checkpoint errors—such as JDK version issues, resource shortages, task manager timeouts, and state migration problems—and provides practical troubleshooting steps and configuration tips to help engineers quickly diagnose and fix these failures.

Big DataCheckpointException

0 likes · 8 min read

Common Apache Flink Exceptions and How to Resolve Them

Big Data Technology & Architecture

Apr 8, 2020 · Big Data

Spark Job Execution Principles and Parameter Tuning for Hive on Spark

This article explains how Spark jobs run on YARN, describes the impact of stages, shuffle and task parallelism, and provides detailed recommendations for tuning Spark executor, memory, core, and parallelism settings to dramatically improve Hive‑on‑Spark TPCx‑BB benchmark performance on large datasets.

Big DataHiveParameter Tuning

0 likes · 12 min read

Spark Job Execution Principles and Parameter Tuning for Hive on Spark

ITPUB

Apr 6, 2020 · Big Data

How to Build a Data Lake Quickly: Strategies, Tools, and Real‑World Cases

This article explains the origins and market growth of data lakes, compares them with traditional data warehouses, showcases major implementations like Amazon Galaxy and Club Factory, and provides practical guidance on choosing open‑source or commercial cloud solutions to construct a data lake efficiently while minimizing risk.

Big DataCloud ComputingData Architecture

0 likes · 10 min read

How to Build a Data Lake Quickly: Strategies, Tools, and Real‑World Cases

Big Data Technology & Architecture

Apr 2, 2020 · Big Data

Hive SQL Table Creation, Data Loading, and Query Examples for Student, Course, Teacher, and Score Datasets

This article demonstrates how to create Hive tables for student, course, teacher, and score data, generate CSV files, load them into Hive, and provides a comprehensive set of Hive SQL queries covering data retrieval, aggregation, ranking, and statistical analysis for educational datasets.

Big DataData WarehouseHive

0 likes · 21 min read

Hive SQL Table Creation, Data Loading, and Query Examples for Student, Course, Teacher, and Score Datasets

Big Data Technology & Architecture

Apr 1, 2020 · Big Data

HBase Cluster Deployment Architecture, Configuration Optimization, and Application Layer Usage

This article details the evolution of HBase cluster deployment from mixed‑hardware/software setups to fully independent clusters, explains hardware and software considerations, presents memory and region planning, outlines key configuration parameters, and provides Spark integration examples for batch and real‑time queries and writes.

Big DataCluster DeploymentConfiguration Optimization

0 likes · 24 min read

HBase Cluster Deployment Architecture, Configuration Optimization, and Application Layer Usage

Big Data Technology & Architecture

Mar 31, 2020 · Big Data

Comprehensive Spark Optimization Guide: Development, Resource, Skew, Shuffle, and Additional Tips

This article presents a detailed summary of Meituan's Spark optimization techniques, covering development‑level RDD tuning, resource parameter configuration, data‑skew mitigation, shuffle improvements, and the advantages of using DataFrame/Dataset APIs for better performance.

Big DataOptimizationPerformance tuning

0 likes · 12 min read

Comprehensive Spark Optimization Guide: Development, Resource, Skew, Shuffle, and Additional Tips

Xianyu Technology

Mar 31, 2020 · Backend Development

Hermes Push System: Architecture and Design Overview

The Hermes Push System at Xianyu separates push decisions into three coordinated services—Configuration Center for audience and material data, Task Center for timing and orchestration, and Matching Center for real‑time content ranking—leveraging MySQL, ODPS, Flink, SchedulerX, MetaQ and Alibaba’s TPP/IGraph to boost click‑through rates, double user coverage, and achieve record daily active users, while planning to add open‑page notifications and deeper AI personalization.

AlibabaBackendBig Data

0 likes · 12 min read

Hermes Push System: Architecture and Design Overview

Big Data Technology & Architecture

Mar 30, 2020 · Databases

HBase Optimization: JVM Tuning, Region Split Policies, BlockCache, and Compaction Strategies

This guide explains how to optimize HBase performance by adjusting JVM memory settings, selecting appropriate garbage collectors, configuring MSLAB and in‑memory compaction, choosing region split policies, tuning BlockCache implementations, and applying suitable compaction policies for different workloads.

Big DataBlockCacheHBase

0 likes · 18 min read

HBase Optimization: JVM Tuning, Region Split Policies, BlockCache, and Compaction Strategies

Tencent Cloud Developer

Mar 29, 2020 · Industry Insights

How Federated Learning Is Breaking Data Silos Across Clouds

This article examines the rise of federated learning as a solution to data islands, detailing regulatory pressures, technical foundations, industry implementations by WeBank, Tencent and VMware, and practical product workflows that enable secure, cross‑cloud AI collaboration.

Artificial IntelligenceBig DataData Collaboration

0 likes · 9 min read

How Federated Learning Is Breaking Data Silos Across Clouds

DataFunTalk

Mar 28, 2020 · Big Data

Applying Flink State Management for Real-Time Recommendation Scenarios

This article explains how Apache Flink's flexible state management can be leveraged to solve data correlation challenges in real‑time recommendation platforms, compares Flink with Spark and Storm, describes the underlying broadcast and managed state mechanisms, and provides a step‑by‑step implementation using Kafka, Druid, and custom broadcast functions.

Big DataFlinkStreaming

0 likes · 14 min read

Applying Flink State Management for Real-Time Recommendation Scenarios

Programmer DD

Mar 27, 2020 · Big Data

How Leading Chinese Companies Scale Elasticsearch for Billions of Queries

This article surveys how major Chinese tech firms such as JD.com, Ctrip, Qunar, 58.com and Didi design, scale, and operate massive Elasticsearch clusters for search, real‑time analytics, and security, detailing architecture choices, shard strategies, data pipelines and performance optimizations.

Big DataDistributed SystemsElasticsearch

0 likes · 12 min read

How Leading Chinese Companies Scale Elasticsearch for Billions of Queries

Xianyu Technology

Mar 26, 2020 · Big Data

Scalable User Behavior Data Collection and Auto-Generated Datasets for Xianyu

Xianyu created a highly extensible user‑behavior collection framework that standardizes data into a common ODPS schema, uses JavaScript Proxy to intercept navigation and API calls, maps business metrics via JSON, aggregates reports to cut dataset‑creation effort from days to minutes while avoiding heavy full‑tracking overhead.

AnalyticsBig DataData Collection

0 likes · 9 min read

Scalable User Behavior Data Collection and Auto-Generated Datasets for Xianyu

58 Tech

Mar 26, 2020 · Big Data

LPA-Detector: Distributed Label Propagation with Confidence Weights for Large‑Scale Graph Risk Detection

The article introduces LPA-Detector, an open‑source project that redesigns the Label Propagation Algorithm using Spark GraphX to add node confidence weights and relationship influence, achieving significant improvements in execution efficiency and detection accuracy for massive graph data in risk‑control scenarios.

Big DataDistributed computingRisk Detection

0 likes · 8 min read

LPA-Detector: Distributed Label Propagation with Confidence Weights for Large‑Scale Graph Risk Detection

360 Quality & Efficiency

Mar 24, 2020 · Big Data

Understanding Granularity in Data Warehouse Design

This article explains the concept of granularity in data warehouse design, describing data models composed of structures, operations, and constraints, illustrating how granularity affects storage detail, query performance, and resource consumption, and recommending a dual‑granularity approach to balance efficiency and analytical depth.

AnalyticsBig DataData Warehouse

0 likes · 5 min read

Understanding Granularity in Data Warehouse Design

Big Data Technology & Architecture

Mar 23, 2020 · Big Data

Best Practices for Designing HBase RowKey to Avoid Hotspots

The article explains how to design HBase RowKeys by dispersing keys, controlling their length, and ensuring uniqueness, providing concrete techniques such as salting, hashing, reversing values, and a practical example with table creation to improve scan performance and prevent region hotspot issues.

Big DataHBaseHotSpot

0 likes · 6 min read

Best Practices for Designing HBase RowKey to Avoid Hotspots

dbaplus Community

Mar 19, 2020 · Big Data

Inside Ctrip Flight Ticket Data Warehouse: Evolution, Architecture, and Real‑Time Challenges

This article details the evolution of Ctrip's flight ticket data warehouse, describing its historical tech stack, current architecture—including Hive, Presto, ClickHouse, CrateDB, and Flink—data synchronization methods, layer design, quality monitoring, and a real‑time price‑monitoring use case.

Big DataCtripData Quality

0 likes · 19 min read

Inside Ctrip Flight Ticket Data Warehouse: Evolution, Architecture, and Real‑Time Challenges

Qunar Tech Salon

Mar 19, 2020 · Big Data

Apache Kafka Overview: Architecture, Features, and Usage

This article provides a comprehensive introduction to Apache Kafka, covering its high‑throughput distributed architecture, core concepts such as topics, partitions, brokers, producers and consumers, design goals, performance characteristics, deployment steps, configuration, and example code for producers, consumers, and Spring Boot integration.

Big DataDistributed SystemsKafka

0 likes · 39 min read

Apache Kafka Overview: Architecture, Features, and Usage

Big Data Technology Architecture

Mar 19, 2020 · Big Data

Hive Optimization Modes: Local, Parallel, Strict, and Uber

This article explains Hive's four optimization modes—Local, Parallel, Strict, and Uber—detailing their purpose, performance impact on small MapReduce jobs, and the specific configuration parameters required to enable each mode effectively.

Big Data

0 likes · 8 min read

Hive Optimization Modes: Local, Parallel, Strict, and Uber

Youzan Coder

Mar 18, 2020 · Big Data

The Evolution of Youzan’s Data Warehouse in a Big Data Environment

The article traces Youzan’s data warehouse from its chaotic early days lacking structure, through a 2016 Airflow‑driven construction phase that introduced layered ODS/DW/Data Mart architecture and naming standards, to a mature stage focused on efficiency, security, SparkSQL, dimensional modeling, metadata, and ongoing real‑time and governance challenges.

AirflowBig DataData Warehouse

0 likes · 20 min read

The Evolution of Youzan’s Data Warehouse in a Big Data Environment

Big Data Technology & Architecture

Mar 17, 2020 · Big Data

Quick Guide to Building a Canal‑Based Real‑Time Data Synchronization Platform on CentOS 7

This article walks through the end‑to‑end setup of a small‑scale data platform using Alibaba's Canal for MySQL binlog capture, covering the installation and configuration of MySQL, Zookeeper, Kafka, and Canal itself, and demonstrates real‑time change capture with sample DML operations.

Big DataCanalCentOS

0 likes · 20 min read

Quick Guide to Building a Canal‑Based Real‑Time Data Synchronization Platform on CentOS 7

58 Tech

Mar 16, 2020 · Fundamentals

Understanding Object Serialization: Principles, Frameworks, and Performance Optimizations

This article explains the concept of object serialization, compares generic formats like JSON/XML with binary approaches, discusses optimization principles, key performance metrics, and reviews major serialization frameworks such as Protobuf, Thrift, Hessian, Kryo, and Avro, while also covering TLV encoding, varint algorithms, and practical pitfalls.

Big DataBinaryProtobuf

0 likes · 16 min read

Understanding Object Serialization: Principles, Frameworks, and Performance Optimizations

DevOps

Mar 16, 2020 · Operations

JD.com DevOps Case Study: Agile Transformation, Continuous Delivery, and Organizational Practices

This case study examines JD.com’s evolution into a technology‑driven enterprise, detailing its corporate culture, the “ABCDE” technology strategy, the implementation of DevOps and agile practices through the CALMS framework, and how unified continuous‑delivery platforms and operational metrics have driven growth, efficiency, and pandemic response.

Big DataContinuous DeliveryDevOps

0 likes · 16 min read

JD.com DevOps Case Study: Agile Transformation, Continuous Delivery, and Organizational Practices

Top Architect

Mar 13, 2020 · Big Data

Three Billion‑Scale MySQL‑to‑HBase Synchronization Solutions and Practical Implementation

This article presents a comprehensive guide for synchronizing massive MySQL datasets to HBase, covering environment preparation, fast MySQL data loading techniques, and three practical pipelines—Sqoop, Kafka‑Thrift, and Kafka‑Flink—along with performance comparisons and optimization tips for large‑scale data processing.

Big DataFlinkHBase

0 likes · 24 min read

Three Billion‑Scale MySQL‑to‑HBase Synchronization Solutions and Practical Implementation

Meituan Technology Team

Mar 12, 2020 · Big Data

Data Governance Practices in Meituan Delivery: Architecture, Standards, and Security

Meituan Delivery’s data‑governance framework combines a four‑layer warehouse architecture with comprehensive business, technical, security, and resource‑management standards, continuous metadata and security controls, and tools such as Wherehows and QuickSight, delivering standardized, secure, and easily shareable data while guiding future optimization and emerging‑technology adoption.

Big DataData ArchitectureData Security

0 likes · 27 min read

Data Governance Practices in Meituan Delivery: Architecture, Standards, and Security

Open Source Linux

Mar 12, 2020 · Big Data

Step-by-Step Guide to Build a Hadoop 2.9.2 Cluster on CentOS 7.5

This tutorial walks you through setting up a three‑node Hadoop 2.9.2 cluster on CentOS 7.5, covering environment preparation, password‑less SSH, user creation, JDK installation, Hadoop extraction, configuration file edits, directory setup, ownership changes, service startup, and verification via web UIs.

Big DataCentOSCluster Setup

0 likes · 13 min read

Step-by-Step Guide to Build a Hadoop 2.9.2 Cluster on CentOS 7.5

Tencent Tech

Mar 11, 2020 · Big Data

Scaling the Health Code: Tencent Cloud Elasticsearch at Billion-User Scale

Leveraging Tencent Cloud Elasticsearch, the nationwide COVID‑19 health code platform handled over 1.6 billion scans for more than 900 million users, achieving millisecond‑level search, seamless horizontal scaling, multi‑zone high availability, and robust security, while simplifying development through RESTful APIs and rich UI tools.

Big DataDistributed SystemsElasticsearch

0 likes · 12 min read

Scaling the Health Code: Tencent Cloud Elasticsearch at Billion-User Scale

Alibaba Cloud Developer

Mar 9, 2020 · Big Data

How Alibaba Digitally Managed 100,000 Employees’ Return to the Office

Alibaba leveraged a suite of digital solutions—including a big‑data entry‑control system, AI‑driven mask detection, smart‑robot meal scheduling, predictive parking, environment regulation, and contactless services—to orchestrate a safe, orderly return of over 100,000 staff across its global campuses.

AIBig DataDigital Transformation

0 likes · 9 min read

How Alibaba Digitally Managed 100,000 Employees’ Return to the Office

Big Data Technology & Architecture

Mar 8, 2020 · Big Data

Hive on Spark Tuning Parameters and Best Practices

This article explains how to tune Hive on Spark by adjusting driver, executor, and Hive configuration parameters—including CPU cores, memory allocations, dynamic allocation, and join thresholds—to achieve optimal performance when running on YARN.

Big DataHivePerformance tuning

0 likes · 7 min read

Hive on Spark Tuning Parameters and Best Practices

Top Architect

Mar 6, 2020 · Big Data

Design and Integration of a Real-Time Log Analysis System Using Flume, Kafka, Storm, Drools, and Redis

This article details the design, installation, and modular integration of Flume, Kafka, Storm, Drools, and Redis to build a real‑time log analysis pipeline for ETL systems, discussing architecture, configuration, code examples, and practical considerations for scalability and fault tolerance.

Big DataDroolsFlume

0 likes · 24 min read

Design and Integration of a Real-Time Log Analysis System Using Flume, Kafka, Storm, Drools, and Redis

iQIYI Technical Product Team

Mar 6, 2020 · Big Data

Real-Time Log Monitoring and Alerting for iQIYI Membership Services

To support over 100 million iQIYI members, the team rebuilt a real‑time log monitoring platform that gathers access, exception, Nginx and front‑end logs via a Venus‑Agent, streams them through Kafka to Spark Streaming and Flink, stores metrics in Druid, and provides minute‑level host and business alerts, achieving 80 % faster incident investigation, detecting 90 % of member complaints early, and generating more than 4,800 actionable alerts.

Big DataFlinkLog Analytics

0 likes · 11 min read

Real-Time Log Monitoring and Alerting for iQIYI Membership Services

Tencent Cloud Middleware

Mar 6, 2020 · Operations

Choosing the Right Disk Strategy for High‑Throughput Kafka Clusters

This article examines how to select and configure disk solutions—single‑disk, multi‑directory, RAID, and LVM—for Apache Kafka deployments, comparing performance, cost, scalability, and reliability to help operators build stable, high‑throughput messaging infrastructures.

Big DataCloud ComputingDisk Design

0 likes · 16 min read

Choosing the Right Disk Strategy for High‑Throughput Kafka Clusters

Suning Technology

Mar 5, 2020 · Artificial Intelligence

Will Retail + Internet Healthcare Survive Post‑COVID? Key Insights

After the pandemic, Suning’s Retail Technology Research Institute examines how the convergence of retail and internet medical services can address rising healthcare demand, resource shortages, and infection risks, leveraging big data, AI, and e‑commerce logistics to create integrated, non‑contact medical solutions and new business models.

AIBig DataHealthcare

0 likes · 13 min read

Will Retail + Internet Healthcare Survive Post‑COVID? Key Insights

Ctrip Technology

Mar 5, 2020 · Big Data

Design and Optimization of Ctrip's Hotel Data Intelligence Platform Using ClickHouse

This article describes how Ctrip built a unified hotel data intelligence platform, evaluated various database solutions, selected ClickHouse as the primary engine, and implemented performance, high‑availability, and monitoring strategies to handle billions of records and thousands of concurrent queries.

Big DataClickHouseCtrip

0 likes · 13 min read

Design and Optimization of Ctrip's Hotel Data Intelligence Platform Using ClickHouse

dbaplus Community

Mar 3, 2020 · Big Data

How MaFengWo Scaled Kafka for Real‑Time Big Data: Lessons and Best Practices

This article details MaFengWo's practical experience with Kafka in its big‑data platform, covering three core usage scenarios, a four‑stage evolution roadmap—including version upgrades, resource isolation, security and monitoring—and future plans such as transaction‑based deduplication and consumer throttling.

Big DataKafkaResource Isolation

0 likes · 17 min read

How MaFengWo Scaled Kafka for Real‑Time Big Data: Lessons and Best Practices

ITPUB

Mar 2, 2020 · Big Data

Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications

This article explains ZooKeeper’s architecture, key concepts such as roles, sessions, ZNodes, versioning, ACLs, and watchers, and demonstrates how it powers essential big‑data components like Hadoop’s ResourceManager and HBase’s master election, naming service, and distributed locking.

Big DataDistributed CoordinationHBase

0 likes · 23 min read

Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications

Beike Product & Technology

Feb 27, 2020 · Big Data

Real‑Time Computing with Apache Flink at Beike Zhaofang: Hermes Platform Overview and Future Plans

This article presents the evolution, architecture, and operational metrics of Beike Zhaofang's Hermes real‑time computing platform built on Apache Flink, detailing its business scale, SQL editors, task growth, monitoring, use cases, and future development directions.

Apache FlinkBig DataData Engineering

0 likes · 10 min read

Real‑Time Computing with Apache Flink at Beike Zhaofang: Hermes Platform Overview and Future Plans

Alibaba Cloud Developer

Feb 27, 2020 · Databases

How Cloud‑Native Distributed Databases Are Shaping the Future of Enterprise Data

This article reviews the evolution, market trends, core components, architectural challenges, and emerging technologies of cloud‑native distributed database systems, highlighting Alibaba Cloud's solutions such as POLARDB, AnalyticDB, and AI‑driven management platforms that enable elastic, high‑availability, and intelligent data services for modern enterprises.

Alibaba CloudBig DataHTAP

0 likes · 26 min read

How Cloud‑Native Distributed Databases Are Shaping the Future of Enterprise Data

Suning Technology

Feb 25, 2020 · Operations

How Post-Pandemic Retail Is Reinvented: Trends, Tech, and Opportunities

The Suning Retail Technology Research Institute analyzes post‑COVID retail trends, highlighting shifts in consumer behavior, the rise of product traceability, smart masks, AI‑enabled smart homes, remote work, online healthcare, and community group buying, while outlining the technologies driving these changes.

AIBig DataSmart Home

0 likes · 8 min read

How Post-Pandemic Retail Is Reinvented: Trends, Tech, and Opportunities

Big Data Technology & Architecture

Feb 24, 2020 · Big Data

Apache Ozone: Architecture, Design Principles, and Deployment Guide

This article introduces Apache Ozone, a scalable distributed object storage system for Hadoop, covering its background, core components, design principles, architecture, deployment steps, configuration examples, and basic command‑line operations for managing volumes, buckets, and keys.

Big DataCLIDistributed Systems

0 likes · 18 min read

Apache Ozone: Architecture, Design Principles, and Deployment Guide

Suning Technology

Feb 22, 2020 · Big Data

How SuNing’s Big Data Engine Powers Health‑Code Pandemic Management

During the COVID‑19 pandemic, SuNing launched a public travel information registration system that leverages massive big‑data processing, high‑concurrency architecture, Kafka streaming, and real‑time analytics to create a city‑wide health‑code network, enabling precise epidemic control, mobility tracking, and robust data privacy safeguards.

Big DataData PrivacyHealth Code

0 likes · 5 min read

How SuNing’s Big Data Engine Powers Health‑Code Pandemic Management

Qunar Tech Salon

Feb 21, 2020 · Artificial Intelligence

Building an End‑to‑End Data‑Model Loop for Alibaba XiaoMi AI Services

The article describes how Alibaba's XiaoMi AI platform constructs a closed‑loop pipeline—from data collection and annotation to model training, evaluation, and real‑time deployment—using multi‑dimensional data processing, visualization, and Spark‑based engines to accelerate iterative improvements and address operational pain points.

AIBig DataSpark

0 likes · 9 min read

Building an End‑to‑End Data‑Model Loop for Alibaba XiaoMi AI Services

21CTO

Feb 19, 2020 · Big Data

Building an Open-Source Big Data Analytics Stack: Challenges & Benefits

The article explains why modern companies rely on data‑driven decisions, outlines the two main challenges of tracking data and connecting it to BI, describes the three‑step analytics stack (integration, warehouse, analysis), and highlights the cost, flexibility, and security advantages of open‑source tools.

Big DataData IntegrationData Warehouse

0 likes · 5 min read

Building an Open-Source Big Data Analytics Stack: Challenges & Benefits

DataFunTalk

Feb 19, 2020 · Big Data

Design and Integration of Flink Batch Processing with Hive: Architecture, Features, and Performance Evaluation

This article presents the design of Flink's batch processing architecture, its integration with Hive through a unified Catalog API, details the enhancements in Flink 1.10, outlines future work, and reports a performance test showing roughly seven‑fold speedup over Hive on MapReduce.

Batch processingBig DataCatalog API

0 likes · 9 min read

Design and Integration of Flink Batch Processing with Hive: Architecture, Features, and Performance Evaluation

Big Data Technology Architecture

Feb 17, 2020 · Big Data

Evolution of Apache Kafka Versions and Their Key Features

This article reviews the historical evolution of Apache Kafka versions, explains the versioning scheme, highlights major features introduced in each release from 0.7.x to 2.x, and provides practical recommendations for selecting an appropriate Kafka version.

Big DataProducer ConsumerVersioning

0 likes · 9 min read

Evolution of Apache Kafka Versions and Their Key Features

MaGe Linux Operations

Feb 17, 2020 · Operations

How to Efficiently Split and Merge Large Log Files on Linux

When log files grow massive, traditional tools like vim, cat, grep, and awk become slow and memory‑hungry, but Linux’s split command lets you divide a huge file by line count or size, process the pieces individually, and later recombine them, dramatically improving analysis efficiency.

Big DataShell scriptingfile-handling

0 likes · 8 min read

How to Efficiently Split and Merge Large Log Files on Linux

DataFunTalk

Feb 17, 2020 · Artificial Intelligence

Building a Closed‑Loop AI System: From Data Collection to Model Deployment in Alibaba’s XiaoMi

This article explains how Alibaba’s XiaoMi team constructs a full‑cycle AI pipeline—covering real‑time and offline data processing, high‑dimensional visualization, model training, iterative feedback, and Spark‑based deployment—to accelerate intelligent product iteration while addressing common engineering pain points.

AIBig DataReal-time Processing

0 likes · 10 min read

Building a Closed‑Loop AI System: From Data Collection to Model Deployment in Alibaba’s XiaoMi

Big Data Technology & Architecture

Feb 16, 2020 · Big Data

Implementing User Purchase Behavior Tracking with Flink Broadcast State

This article explains how to use Flink's Broadcast State to track user purchase paths in real time, detailing the design, required Kafka streams, Java APIs, state management, dynamic configuration, code implementation, deployment steps, and example results for a big‑data streaming application.

Big DataBroadcast StateFlink

0 likes · 19 min read

Implementing User Purchase Behavior Tracking with Flink Broadcast State

Big Data Technology & Architecture

Feb 16, 2020 · Big Data

Implementing MySQL Binlog Synchronization to HDFS Using Canal

This article details a step‑by‑step guide for deploying Canal to capture MySQL binlog events, configure HA with ZooKeeper, design a client that parses binlog into JSON, asynchronously acknowledges messages, archive data to local files for batch upload to HDFS, and monitor latency for alerts.

Big DataCanalHDFS

0 likes · 10 min read

Implementing MySQL Binlog Synchronization to HDFS Using Canal

Suning Technology

Feb 15, 2020 · Artificial Intelligence

How AI and Unmanned Tech Are Redefining Retail in the Post‑Pandemic Era

The COVID‑19 pandemic accelerated instant consumption and O2O integration, prompting retailers to adopt AI‑driven unmanned stores, big‑data traceability, smart‑home solutions, and innovative mask and health‑product strategies, reshaping supply chains, operations, and consumer experiences.

AIBig DataCOVID-19

0 likes · 12 min read

How AI and Unmanned Tech Are Redefining Retail in the Post‑Pandemic Era

Big Data Technology & Architecture

Feb 13, 2020 · Big Data

Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage

This article describes how eBay's Central Application Logging (CAL) system generates massive daily logs, the challenges of Hadoop MapReduce job performance and resource consumption, and the step‑by‑step optimizations—reducing GC time, mitigating data skew, and improving algorithms—that cut execution time by over 60%, lowered cluster resource usage, and raised job success rates to nearly 100%.

Big DataData SkewHadoop

0 likes · 11 min read

Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage

Tencent Cloud Developer

Feb 13, 2020 · Big Data

Data Middle Platform: Vision, Architecture, and Business Value

The Data Middle Platform, described by Shi Kai, is a service‑oriented architecture that transforms raw enterprise data into reusable, real‑time APIs for business applications, bridging the gap between traditional warehouses and front‑end systems, accelerating digital transformation through unified governance, rapid development, and direct business value.

Big DataData ArchitectureData Middle Platform

0 likes · 26 min read

Data Middle Platform: Vision, Architecture, and Business Value

Big Data Technology & Architecture

Feb 10, 2020 · Big Data

Real‑time MySQL Binlog Capture with Canal: Principles, Architecture, Deployment and Comparison with Maxwell

This article explains how to use Alibaba's Canal to capture MySQL binlog changes in real time, covering its underlying protocol, component architecture, HA design with ZooKeeper, configuration steps, deployment examples, and a detailed comparison with alternative tools such as Maxwell and mysql_streamer.

Big DataCanalKafka

0 likes · 17 min read

Real‑time MySQL Binlog Capture with Canal: Principles, Architecture, Deployment and Comparison with Maxwell

58 Tech

Feb 10, 2020 · Big Data

Construction and Practice of a Site-wide User Behavior Data Warehouse at 58.com

This article systematically describes the challenges, design principles, modeling methods, layered architecture, implementation steps, and standards used in building a comprehensive user behavior data warehouse for 58.com, highlighting practical experiences and future improvement directions.

Big DataData QualityData Warehouse

0 likes · 11 min read

Construction and Practice of a Site-wide User Behavior Data Warehouse at 58.com

Big Data Technology & Architecture

Feb 9, 2020 · Big Data

Understanding Hadoop's Circular Buffer in the Shuffle Phase

This article explains how Hadoop's MapReduce shuffle uses a circular buffer to store serialized key/value pairs and their metadata, detailing its structure, initialization, write path, spill logic, and the background thread that sorts and writes data to disk.

Big DataHadoopJava

0 likes · 24 min read

Understanding Hadoop's Circular Buffer in the Shuffle Phase

Big Data Technology & Architecture

Feb 8, 2020 · Big Data

A Practical Guide to Reading Apache Spark Source Code and Understanding Its Core Design

This article explains why Spark is a mature big‑data framework, recommends which Spark versions to study, lists essential research papers, describes how to set up the development environment, and outlines the key components of Spark’s core architecture for effective source‑code exploration.

Apache SparkBig DataRDD

0 likes · 6 min read

A Practical Guide to Reading Apache Spark Source Code and Understanding Its Core Design

Big Data Technology & Architecture

Feb 6, 2020 · Big Data

Comparison of Hudi, Iceberg, and Delta Lake Table Formats

This article compares the design goals, data‑lake table formats—Hudi, Iceberg, and Delta—highlighting their common reliance on meta files and their distinct strengths for upserts, analytics, and unified streaming‑batch processing in modern big‑data environments.

Big DataData LakeDelta Lake

0 likes · 10 min read

Comparison of Hudi, Iceberg, and Delta Lake Table Formats

HomeTech

Feb 6, 2020 · Product Management

AutoBI One‑Stop Data Visualization Platform: Architecture, Technical Highlights, and Use Cases

The document outlines AutoBI, a company‑wide one‑stop data visualization platform, detailing its background, overall architecture, key technical components such as real‑time/offline data switching and query processing, integration capabilities, and practical case studies, highlighting efficiency gains and future development plans.

BackendBig DataData visualization

0 likes · 8 min read

AutoBI One‑Stop Data Visualization Platform: Architecture, Technical Highlights, and Use Cases

Big Data Technology & Architecture

Feb 5, 2020 · Big Data

Resolving Oozie Shell Scheduling Issues for Flink Jobs on CDH 6.3 with Kerberos Authentication

The article describes how to troubleshoot and fix Oozie shell‑action failures when submitting Flink jobs on a CDH 6.3 cluster with Kerberos, detailing environment‑variable conflicts, error messages, and the final solution using a clean environment and custom FLINK_CONF_DIR settings.

Big DataCDHFlink

0 likes · 7 min read

Resolving Oozie Shell Scheduling Issues for Flink Jobs on CDH 6.3 with Kerberos Authentication

360 Quality & Efficiency

Feb 5, 2020 · Artificial Intelligence

Key Takeaways from AICon: AI Fundamentals, Applications, and Future Directions

The article shares notes from the AICon global AI and machine learning conference, outlining AI’s three core elements—computing power, big data, and algorithms—its problem domains, current applications across industries, and future directions such as AI‑IoT‑5G integration.

AI ConferenceArtificial IntelligenceBig Data

0 likes · 6 min read

Key Takeaways from AICon: AI Fundamentals, Applications, and Future Directions

Youzan Coder

Feb 5, 2020 · Backend Development

Configurable Data Reconciliation Platform at Youzan: Design, Architecture, and Implementation

Youzan built a configurable data reconciliation platform that integrates new scenarios, processes massive real‑time and batch data, offers visual monitoring, automated correction, and flexible Groovy‑based logic across four DDD layers, achieving 99.99% stability while simplifying detection and resolution of cross‑system inconsistencies.

Big DataData ReconciliationDistributed Systems

0 likes · 15 min read

Configurable Data Reconciliation Platform at Youzan: Design, Architecture, and Implementation

Big Data Technology Architecture

Feb 1, 2020 · Big Data

Beike's Hermes Real‑Time Computing Platform: Architecture, Scale, and Future Roadmap

The article presents a comprehensive case study of Beike's Hermes real‑time computing platform, detailing its business evolution, Hermes architecture, SQL V1/V2 editors built on Spark and Flink, large‑scale deployment statistics, monitoring, diverse business use cases, and planned future enhancements.

Apache FlinkBeikeBig Data

0 likes · 11 min read

Beike's Hermes Real‑Time Computing Platform: Architecture, Scale, and Future Roadmap

Big Data Technology & Architecture

Jan 30, 2020 · Big Data

Comprehensive Guide to Spark Performance Optimization (Development, Resource, Data Skew, and Shuffle Tuning)

This article provides an in‑depth, step‑by‑step guide to optimizing Spark jobs, covering development‑time best practices, resource‑parameter tuning, data‑skew detection and mitigation techniques, and shuffle‑stage performance tweaks, complete with Scala code examples and practical recommendations.

Big DataData SkewPerformance Optimization

0 likes · 67 min read

Comprehensive Guide to Spark Performance Optimization (Development, Resource, Data Skew, and Shuffle Tuning)

Big Data Technology & Architecture

Jan 25, 2020 · Big Data

Spark Scala Example: Find the Most Frequent Visitor ID in a 500‑Million‑Record Dataset

This article demonstrates how to generate 500 million visitor IDs with Spark, use map‑reduce operations to count occurrences, and identify the ID with the highest visit count, while discussing performance considerations such as memory spilling and cluster resources.

Big DataRDDScala

0 likes · 11 min read

Spark Scala Example: Find the Most Frequent Visitor ID in a 500‑Million‑Record Dataset

Big Data Technology & Architecture

Jan 20, 2020 · Big Data

Understanding Data Middle Platform: Architecture, Components, and Operational Practices

The article explains the concept, architecture, and key components of a data middle platform—including data aggregation, development, asset management, service systems, and operational and security mechanisms—while also promoting related books and a giveaway.

Big DataData ArchitectureData Integration

0 likes · 7 min read

Understanding Data Middle Platform: Architecture, Components, and Operational Practices

Alibaba Cloud Developer

Jan 20, 2020 · Big Data

Alibaba’s Secrets to High‑Throughput Full‑Load and Low‑Latency Search Processing

This article details how Alibaba migrated its massive Taobao‑Tmall search workload to the search offline platform, tackling challenges of massive data volume, one‑to‑many joins, and hotspot sellers through a series of performance optimizations—including local joins, salt‑based data sharding, dynamic aggregation jobs, and asynchronous processing—to achieve high‑throughput full loads and low‑latency incremental updates.

AlibabaBig DataFlink

0 likes · 15 min read

Alibaba’s Secrets to High‑Throughput Full‑Load and Low‑Latency Search Processing

Big Data Technology & Architecture

Jan 19, 2020 · Big Data

Tencent's Elasticsearch Practices: Application Scenarios, Challenges, Optimizations, and Future Directions

This article details how Tencent leverages Elasticsearch for log analysis, search services, and time‑series data, outlines the specific challenges faced in high‑availability and cost‑efficiency, and presents the comprehensive optimization techniques and future open‑source contributions that improve performance, scalability, and reliability.

Big DataElasticsearchSearch

0 likes · 16 min read

Tencent's Elasticsearch Practices: Application Scenarios, Challenges, Optimizations, and Future Directions

Tencent Cloud Developer

Jan 19, 2020 · Backend Development

Tencent Kona JDK: OpenJDK Foundations, Technical Trends, and Big Data Practices

The talk reviews OpenJDK’s evolution, contrasts Oracle JDK, introduces Tencent’s Kona JDK as a free, long‑term, production‑hardened fork optimized for massive micro‑service and big‑data workloads, and discusses emerging Java‑on‑Java, value‑type, Project Panama/Loom, and SIMD Vector API trends shaping JVM performance.

Big DataCloud ComputingJVM

0 likes · 15 min read

Tencent Kona JDK: OpenJDK Foundations, Technical Trends, and Big Data Practices

Big Data Technology & Architecture

Jan 16, 2020 · Big Data

Kafka Interview Guide: Core Concepts, Architecture, and Practical Tips

This article compiles essential Kafka interview material, covering its role as a message queue, usage scenarios, architectural components, storage mechanisms, consumer group rebalancing, high‑availability features, replication details, ordering guarantees, producer/consumer client design, topic management, log retention, performance optimizations, and key monitoring metrics.

Big DataDistributed SystemsInterview

0 likes · 16 min read

Kafka Interview Guide: Core Concepts, Architecture, and Practical Tips

360 Tech Engineering

Jan 16, 2020 · Big Data

Real-Time and Offline Integrated Solution for Channel Analysis Data Processing

This article presents a comprehensive real‑time and offline integrated solution for a channel analysis system, detailing challenges, architecture, implementation using Flink, Spark Streaming, Kafka, Elasticsearch, and HIVE, and demonstrating minute‑level latency and high accuracy through performance evaluations.

Big DataElasticsearchFlink

0 likes · 10 min read

Real-Time and Offline Integrated Solution for Channel Analysis Data Processing

Architects Research Society

Jan 16, 2020 · Big Data

Elasticsearch vs Solr: Choosing the Right Open‑Source Search Engine

This article compares Elasticsearch and Solr, examining their history, community, licensing, core technologies, APIs, scalability, vendor support, ecosystem, performance, management tools, and visualization options to help organizations decide which open‑source search engine best fits their big‑data and search requirements.

Big DataElasticsearchSolr

0 likes · 12 min read

Elasticsearch vs Solr: Choosing the Right Open‑Source Search Engine

Big Data Technology & Architecture

Jan 13, 2020 · Big Data

Understanding ORC File Format in Hive: Structure, Storage, Indexes, Compression, and Configuration

This article explains the ORC (Optimized Record Columnar) file format used in Hive, covering its architecture, stripe and column storage, handling of complex data types, indexing mechanisms, compression streams, memory management, and key configuration parameters.

Big DataCompressionFile Format

0 likes · 14 min read

Understanding ORC File Format in Hive: Structure, Storage, Indexes, Compression, and Configuration

Big Data Technology & Architecture

Jan 10, 2020 · Big Data

Async I/O for Dimension Table Joins in Apache Flink

This article explains how to handle dimension table joins in Apache Flink streaming by leveraging Async I/O to perform non‑blocking external lookups, provides detailed code examples for both synchronous and asynchronous functions, discusses configuration parameters, and outlines best practices and pitfalls.

Big DataDimension Table JoinFlink

0 likes · 16 min read

Async I/O for Dimension Table Joins in Apache Flink

ITPUB

Jan 10, 2020 · Big Data

How MaFengWo Scales Kafka for Real‑Time Big Data: Lessons and Best Practices

This article details MaFengWo’s practical experience using Kafka across three core scenarios—real‑time storage, analytical data source, and business data subscription—while describing a four‑stage evolution that includes version upgrades, resource isolation, security and monitoring enhancements, and a comprehensive subscription platform, followed by future improvement plans.

Big DataData ReplayKafka

0 likes · 16 min read

How MaFengWo Scales Kafka for Real‑Time Big Data: Lessons and Best Practices

Architects' Tech Alliance

Jan 9, 2020 · Big Data

Building a Data Middle Platform: Practices and Architecture at NetEase Yanxuan

The article explains why companies are building data middle platforms, defines what a data middle platform is, and details NetEase Yanxuan’s architecture, including its data warehouse, data services, and BI platform, illustrating how these components enable data‑driven transformation and fine‑grained operations.

BIBig DataData Middle Platform

0 likes · 11 min read

Building a Data Middle Platform: Practices and Architecture at NetEase Yanxuan

DataFunTalk

Jan 9, 2020 · Databases

Exploring Spatiotemporal Data Management with Cassandra, GeoMesa, and GeoTrellis

This article presents a comprehensive overview of handling spatiotemporal data using Cassandra, covering data types, space‑filling curves, GeoHash encoding, the GeoMesa and GeoTrellis ecosystems, Cassandra storage schemas, and practical Spark integration for large‑scale geospatial analytics.

Big DataDatabasesGeoMesa

0 likes · 8 min read

Exploring Spatiotemporal Data Management with Cassandra, GeoMesa, and GeoTrellis

iQIYI Technical Product Team

Jan 9, 2020 · Big Data

Design and Evolution of iQIYI Real-Time Analysis Platform (RAP)

iQIYI’s Real‑Time Analysis Platform (RAP) combines Apache Druid with Spark/Flink to deliver minute‑level, low‑latency multidimensional analytics via a web wizard, supporting hundreds of streaming tasks and thousands of reports across membership, recommendation, and TV monitoring, while simplifying development and maintenance.

Apache DruidBig DataFlink

0 likes · 13 min read

Design and Evolution of iQIYI Real-Time Analysis Platform (RAP)

Big Data Technology & Architecture

Jan 8, 2020 · Big Data

Real-Time Data Warehouse Architecture and Challenges Using Flink, Kafka, and HBase

This article examines the design of a real-time data warehouse built on Flink, Kafka, and HBase, compares it with traditional offline warehouses, and discusses key challenges such as data accuracy, latency, and the complexity of maintaining real-time dimension tables.

Big DataData WarehouseFlink

0 likes · 10 min read

Real-Time Data Warehouse Architecture and Challenges Using Flink, Kafka, and HBase

Big Data Technology & Architecture

Jan 7, 2020 · Big Data

Real-time Data Processing with Kafka, Spark Streaming, and HBase: Implementation Guide

This article presents a step‑by‑step guide for building a real‑time data pipeline using Kafka as a message buffer, Spark‑Streaming's Direct Approach for processing, and HBase for storage, including code examples, Maven configuration, local cluster setup, and troubleshooting tips.

Big DataHBaseKafka

0 likes · 12 min read

Real-time Data Processing with Kafka, Spark Streaming, and HBase: Implementation Guide

Python Programming Learning Circle

Jan 7, 2020 · Fundamentals

Which Tech Skills Will Make You Irreplaceable in Today’s Job Market?

In a fiercely competitive internet era, technical professionals must continuously learn across fields such as information security, Python, cloud computing, big data, AI, software testing, IoT, and internet marketing to become the highly sought‑after talent that companies urgently need.

Artificial IntelligenceBig DataCloud Computing

0 likes · 7 min read

Which Tech Skills Will Make You Irreplaceable in Today’s Job Market?

Tongcheng Travel Technology Center

Jan 7, 2020 · Big Data

Design and Implementation of XFlink: A Flink‑Based Data Migration System on Yarn

The article describes the evolution from the legacy XDATA tool to the new XFlink system, detailing its architecture, core plugins, parser and deployment modules, resource management with Yarn, monitoring via Prometheus and Grafana, and planned enhancements such as Flink SQL configuration and modular plugins.

Big DataData MigrationDistributed Systems

0 likes · 10 min read

Design and Implementation of XFlink: A Flink‑Based Data Migration System on Yarn

Big Data Technology & Architecture

Jan 7, 2020 · Big Data

Using HyperLogLog for High-Performance Pre-Aggregation in Big Data with Spark-Alchemy

The article explains how pre‑aggregation combined with the HyperLogLog algorithm and Spark‑Alchemy's native HLL functions can dramatically accelerate distinct‑count calculations in big‑data workloads while maintaining low error rates and cross‑system compatibility.

Approximate Distinct CountBig DataHyperLogLog

0 likes · 7 min read

Using HyperLogLog for High-Performance Pre-Aggregation in Big Data with Spark-Alchemy

Top Architect

Jan 7, 2020 · Big Data

Technical Architecture Overview of Toutiao: Data Processing, User Modeling, and Recommendation System

This article provides a comprehensive overview of Toutiao's rapid growth and technical architecture, detailing its massive user base, data collection pipelines, user modeling, recommendation engines, storage solutions, message push mechanisms, micro‑service design, and virtualization PaaS platform.

Big DataToutiaoarchitecture

0 likes · 8 min read

Technical Architecture Overview of Toutiao: Data Processing, User Modeling, and Recommendation System

dbaplus Community

Jan 6, 2020 · Big Data

How 58.com Built a Scalable Flink‑Based Real‑Time Data Platform (Wstream)

The article details how 58.com designed and evolved its one‑stop real‑time computation platform Wstream, migrating from Storm and Spark Streaming to Apache Flink, and describes the architecture, task isolation, stream‑SQL features, monitoring, and ongoing optimizations that enable processing of over 600 billion records daily.

Big DataFlinkReal-time Streaming

0 likes · 12 min read

How 58.com Built a Scalable Flink‑Based Real‑Time Data Platform (Wstream)

Tencent Cloud Developer

Jan 6, 2020 · Big Data

Overview of TubeMQ: Principles, Architecture, Performance, and Open‑Source Strategy for Big‑Data Message Queues

TubeMQ is a trillion‑level, Java‑based distributed message‑queue middleware designed for massive‑data ingestion, offering 140 k TPS with sub‑5 ms latency, high reliability, low cost, and horizontal scalability, and is being open‑sourced to the Apache foundation to foster community collaboration and future expansion beyond traditional MQ functions.

Big DataDistributed SystemsMessage queue

0 likes · 15 min read

Overview of TubeMQ: Principles, Architecture, Performance, and Open‑Source Strategy for Big‑Data Message Queues

58 Tech

Jan 6, 2020 · Big Data

Design and Architecture of the 58DP Big Data Platform Task Scheduling System

The article presents a comprehensive overview of the 58DP big data platform's task scheduling system, detailing its background, architecture, high‑availability design, slot‑based resource management, scheduling models, task lifecycle, priority rules, dependency handling, failure recovery, and future enhancements.

Big DataTask Schedulingdistributed system

0 likes · 14 min read

Design and Architecture of the 58DP Big Data Platform Task Scheduling System

Didi Tech

Jan 5, 2020 · Big Data

Rolling Upgrade of HDFS from 2.7 to 3.2: Experience, Issues and Solutions

The team performed a rolling upgrade of HDFS from 2.7 to 3.2 on large clusters, resolving EditLog, Fsimage, StringTable and authentication incompatibilities by omitting EC data, using fallback images, rolling back commits and first upgrading to the latest 2.x release, following a staged JournalNode‑NameNode‑DataNode procedure, validating with rehearsals and a custom trash‑management tool, and achieving uninterrupted service, improved stability, performance and cost efficiency.

Big DataCluster MigrationHDFS

0 likes · 11 min read

Rolling Upgrade of HDFS from 2.7 to 3.2: Experience, Issues and Solutions