Tagged articles
3697 articles
Page 27 of 37
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 19, 2020 · Big Data

Understanding the Backpressure Mechanism in Spark Streaming

This article explains Spark Streaming's backpressure mechanism, detailing how batch intervals can cause data accumulation, the role of Receivers versus DirectKafkaInputDStream, configuration to enable backpressure, and the internal workings of RateController, ReceiverRateController, ReceiverSupervisor, BlockGenerator, and rate calculations for Kafka streams.

Big DataKafkaRateController
0 likes · 12 min read
Understanding the Backpressure Mechanism in Spark Streaming
Python Programming Learning Circle
Python Programming Learning Circle
Apr 16, 2020 · Big Data

Getting Started with PySpark: Creating SparkContext, Parallelizing Data, and Basic DataFrame Operations

This tutorial demonstrates how to initialize a SparkContext in PySpark, perform simple parallel computations such as temperature conversion and reduction, create a SparkSession to read CSV data, and apply common DataFrame operations like selecting columns, adding new columns, filtering, grouping, and aggregating.

Big DataPySparkSpark
0 likes · 5 min read
Getting Started with PySpark: Creating SparkContext, Parallelizing Data, and Basic DataFrame Operations
HomeTech
HomeTech
Apr 16, 2020 · Big Data

Home (ZhiJia) Distributed Task Scheduling System Overview

The article presents a comprehensive overview of the Home (ZhiJia) distributed task scheduling system, detailing its background, advantages, technology stack, architecture, core concepts, module responsibilities, IDE integration, and future improvement plans for big‑data processing workflows.

Big DataDistributed SchedulingMaster‑Slave
0 likes · 10 min read
Home (ZhiJia) Distributed Task Scheduling System Overview
dbaplus Community
dbaplus Community
Apr 15, 2020 · Big Data

How Ctrip Scaled Hadoop Across Data Centers: Architecture and Lessons

This article details Ctrip's Hadoop evolution, the challenges of expanding across multiple data centers, the evaluation of multi‑cluster versus single‑cluster designs, and the concrete architectural changes, migration tools, bandwidth monitoring, and future plans that enabled a stable cross‑datacenter big‑data platform.

Big DataCross-DataCenterHDFS
0 likes · 19 min read
How Ctrip Scaled Hadoop Across Data Centers: Architecture and Lessons
DataFunTalk
DataFunTalk
Apr 15, 2020 · Big Data

Apache Flink OLAP Engine: Architecture, Optimizations, and Use Cases

This article presents an in‑depth overview of Apache Flink's new OLAP engine, covering OLAP fundamentals, the three OLAP models, Flink's unified streaming‑batch‑OLAP architecture, performance optimizations, benchmark results, and future development directions.

Apache FlinkBig DataOLAP
0 likes · 11 min read
Apache Flink OLAP Engine: Architecture, Optimizations, and Use Cases
Big Data Technology Architecture
Big Data Technology Architecture
Apr 15, 2020 · Big Data

Real-Time Data Warehouse Practices: Case Studies from Meituan, NetEase, Zhihu, and OPPO

This article reviews the evolution of data warehouses from traditional offline models to modern real‑time architectures, presenting detailed case studies of Meituan, NetEase, Zhihu, and OPPO, and discusses layer designs, technology choices such as Flink, Kafka, and storage options, and key lessons for building scalable real‑time warehouses.

Big DataFlinkKafka
0 likes · 13 min read
Real-Time Data Warehouse Practices: Case Studies from Meituan, NetEase, Zhihu, and OPPO
Programmer DD
Programmer DD
Apr 12, 2020 · Big Data

Master Elasticsearch: From Basics to SpringBoot Integration and Advanced Queries

This comprehensive guide introduces Elasticsearch fundamentals, its features and use cases, then walks through integrating it with SpringBoot, configuring Maven dependencies, performing index and document operations, and demonstrates a variety of query types and aggregations using both RESTful APIs and Java code examples.

Big DataElasticsearchFull-Text Search
0 likes · 46 min read
Master Elasticsearch: From Basics to SpringBoot Integration and Advanced Queries
Amap Tech
Amap Tech
Apr 10, 2020 · Backend Development

Platformization of POI Deep Information Integration at Amap: Design and Implementation

Amap transformed its fragmented POI deep‑information pipelines into a unified platform that automates data acquisition, parsing, dimension alignment, specification mapping, and lifecycle management across billions of records, enabling product managers to integrate, debug, and scale diverse content‑provider feeds with real‑time, end‑to‑end control.

BackendBig DataConversion Engine
0 likes · 13 min read
Platformization of POI Deep Information Integration at Amap: Design and Implementation
Meituan Technology Team
Meituan Technology Team
Apr 9, 2020 · Big Data

Dual-Engine MOLAP + ROLAP Architecture with Apache Doris for Meituan Takeaway Data Warehouse

Meituan Takeaway’s data warehouse combines Apache Kylin’s MOLAP cubes for stable dimensions with Apache Doris’s MPP‑driven ROLAP engine to handle changing dimensions, detail queries, and near‑real‑time analytics, achieving millisecond‑level responses, reduced storage/compute costs, and simplifying operations across diverse analytical workloads.

Apache DorisBig DataData Warehouse
0 likes · 18 min read
Dual-Engine MOLAP + ROLAP Architecture with Apache Doris for Meituan Takeaway Data Warehouse
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 8, 2020 · Big Data

Common Apache Flink Exceptions and How to Resolve Them

This article enumerates typical Apache Flink deployment, job, and checkpoint errors—such as JDK version issues, resource shortages, task manager timeouts, and state migration problems—and provides practical troubleshooting steps and configuration tips to help engineers quickly diagnose and fix these failures.

Big DataCheckpointException
0 likes · 8 min read
Common Apache Flink Exceptions and How to Resolve Them
ITPUB
ITPUB
Apr 6, 2020 · Big Data

How to Build a Data Lake Quickly: Strategies, Tools, and Real‑World Cases

This article explains the origins and market growth of data lakes, compares them with traditional data warehouses, showcases major implementations like Amazon Galaxy and Club Factory, and provides practical guidance on choosing open‑source or commercial cloud solutions to construct a data lake efficiently while minimizing risk.

Big DataCloud ComputingData Architecture
0 likes · 10 min read
How to Build a Data Lake Quickly: Strategies, Tools, and Real‑World Cases
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 1, 2020 · Big Data

HBase Cluster Deployment Architecture, Configuration Optimization, and Application Layer Usage

This article details the evolution of HBase cluster deployment from mixed‑hardware/software setups to fully independent clusters, explains hardware and software considerations, presents memory and region planning, outlines key configuration parameters, and provides Spark integration examples for batch and real‑time queries and writes.

Big DataCluster DeploymentConfiguration Optimization
0 likes · 24 min read
HBase Cluster Deployment Architecture, Configuration Optimization, and Application Layer Usage
Xianyu Technology
Xianyu Technology
Mar 31, 2020 · Backend Development

Hermes Push System: Architecture and Design Overview

The Hermes Push System at Xianyu separates push decisions into three coordinated services—Configuration Center for audience and material data, Task Center for timing and orchestration, and Matching Center for real‑time content ranking—leveraging MySQL, ODPS, Flink, SchedulerX, MetaQ and Alibaba’s TPP/IGraph to boost click‑through rates, double user coverage, and achieve record daily active users, while planning to add open‑page notifications and deeper AI personalization.

AlibabaBackendBig Data
0 likes · 12 min read
Hermes Push System: Architecture and Design Overview
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 30, 2020 · Databases

HBase Optimization: JVM Tuning, Region Split Policies, BlockCache, and Compaction Strategies

This guide explains how to optimize HBase performance by adjusting JVM memory settings, selecting appropriate garbage collectors, configuring MSLAB and in‑memory compaction, choosing region split policies, tuning BlockCache implementations, and applying suitable compaction policies for different workloads.

Big DataBlockCacheHBase
0 likes · 18 min read
HBase Optimization: JVM Tuning, Region Split Policies, BlockCache, and Compaction Strategies
Tencent Cloud Developer
Tencent Cloud Developer
Mar 29, 2020 · Industry Insights

How Federated Learning Is Breaking Data Silos Across Clouds

This article examines the rise of federated learning as a solution to data islands, detailing regulatory pressures, technical foundations, industry implementations by WeBank, Tencent and VMware, and practical product workflows that enable secure, cross‑cloud AI collaboration.

Artificial IntelligenceBig DataData Collaboration
0 likes · 9 min read
How Federated Learning Is Breaking Data Silos Across Clouds
DataFunTalk
DataFunTalk
Mar 28, 2020 · Big Data

Applying Flink State Management for Real-Time Recommendation Scenarios

This article explains how Apache Flink's flexible state management can be leveraged to solve data correlation challenges in real‑time recommendation platforms, compares Flink with Spark and Storm, describes the underlying broadcast and managed state mechanisms, and provides a step‑by‑step implementation using Kafka, Druid, and custom broadcast functions.

Big DataFlinkStreaming
0 likes · 14 min read
Applying Flink State Management for Real-Time Recommendation Scenarios
Programmer DD
Programmer DD
Mar 27, 2020 · Big Data

How Leading Chinese Companies Scale Elasticsearch for Billions of Queries

This article surveys how major Chinese tech firms such as JD.com, Ctrip, Qunar, 58.com and Didi design, scale, and operate massive Elasticsearch clusters for search, real‑time analytics, and security, detailing architecture choices, shard strategies, data pipelines and performance optimizations.

Big DataDistributed SystemsElasticsearch
0 likes · 12 min read
How Leading Chinese Companies Scale Elasticsearch for Billions of Queries
Xianyu Technology
Xianyu Technology
Mar 26, 2020 · Big Data

Scalable User Behavior Data Collection and Auto-Generated Datasets for Xianyu

Xianyu created a highly extensible user‑behavior collection framework that standardizes data into a common ODPS schema, uses JavaScript Proxy to intercept navigation and API calls, maps business metrics via JSON, aggregates reports to cut dataset‑creation effort from days to minutes while avoiding heavy full‑tracking overhead.

AnalyticsBig DataData Collection
0 likes · 9 min read
Scalable User Behavior Data Collection and Auto-Generated Datasets for Xianyu
58 Tech
58 Tech
Mar 26, 2020 · Big Data

LPA-Detector: Distributed Label Propagation with Confidence Weights for Large‑Scale Graph Risk Detection

The article introduces LPA-Detector, an open‑source project that redesigns the Label Propagation Algorithm using Spark GraphX to add node confidence weights and relationship influence, achieving significant improvements in execution efficiency and detection accuracy for massive graph data in risk‑control scenarios.

Big DataDistributed computingRisk Detection
0 likes · 8 min read
LPA-Detector: Distributed Label Propagation with Confidence Weights for Large‑Scale Graph Risk Detection
360 Quality & Efficiency
360 Quality & Efficiency
Mar 24, 2020 · Big Data

Understanding Granularity in Data Warehouse Design

This article explains the concept of granularity in data warehouse design, describing data models composed of structures, operations, and constraints, illustrating how granularity affects storage detail, query performance, and resource consumption, and recommending a dual‑granularity approach to balance efficiency and analytical depth.

AnalyticsBig DataData Warehouse
0 likes · 5 min read
Understanding Granularity in Data Warehouse Design
Qunar Tech Salon
Qunar Tech Salon
Mar 19, 2020 · Big Data

Apache Kafka Overview: Architecture, Features, and Usage

This article provides a comprehensive introduction to Apache Kafka, covering its high‑throughput distributed architecture, core concepts such as topics, partitions, brokers, producers and consumers, design goals, performance characteristics, deployment steps, configuration, and example code for producers, consumers, and Spring Boot integration.

Big DataDistributed SystemsKafka
0 likes · 39 min read
Apache Kafka Overview: Architecture, Features, and Usage
Youzan Coder
Youzan Coder
Mar 18, 2020 · Big Data

The Evolution of Youzan’s Data Warehouse in a Big Data Environment

The article traces Youzan’s data warehouse from its chaotic early days lacking structure, through a 2016 Airflow‑driven construction phase that introduced layered ODS/DW/Data Mart architecture and naming standards, to a mature stage focused on efficiency, security, SparkSQL, dimensional modeling, metadata, and ongoing real‑time and governance challenges.

AirflowBig DataData Warehouse
0 likes · 20 min read
The Evolution of Youzan’s Data Warehouse in a Big Data Environment
58 Tech
58 Tech
Mar 16, 2020 · Fundamentals

Understanding Object Serialization: Principles, Frameworks, and Performance Optimizations

This article explains the concept of object serialization, compares generic formats like JSON/XML with binary approaches, discusses optimization principles, key performance metrics, and reviews major serialization frameworks such as Protobuf, Thrift, Hessian, Kryo, and Avro, while also covering TLV encoding, varint algorithms, and practical pitfalls.

Big DataBinaryProtobuf
0 likes · 16 min read
Understanding Object Serialization: Principles, Frameworks, and Performance Optimizations
DevOps
DevOps
Mar 16, 2020 · Operations

JD.com DevOps Case Study: Agile Transformation, Continuous Delivery, and Organizational Practices

This case study examines JD.com’s evolution into a technology‑driven enterprise, detailing its corporate culture, the “ABCDE” technology strategy, the implementation of DevOps and agile practices through the CALMS framework, and how unified continuous‑delivery platforms and operational metrics have driven growth, efficiency, and pandemic response.

Big DataContinuous DeliveryDevOps
0 likes · 16 min read
JD.com DevOps Case Study: Agile Transformation, Continuous Delivery, and Organizational Practices
Top Architect
Top Architect
Mar 13, 2020 · Big Data

Three Billion‑Scale MySQL‑to‑HBase Synchronization Solutions and Practical Implementation

This article presents a comprehensive guide for synchronizing massive MySQL datasets to HBase, covering environment preparation, fast MySQL data loading techniques, and three practical pipelines—Sqoop, Kafka‑Thrift, and Kafka‑Flink—along with performance comparisons and optimization tips for large‑scale data processing.

Big DataFlinkHBase
0 likes · 24 min read
Three Billion‑Scale MySQL‑to‑HBase Synchronization Solutions and Practical Implementation
Meituan Technology Team
Meituan Technology Team
Mar 12, 2020 · Big Data

Data Governance Practices in Meituan Delivery: Architecture, Standards, and Security

Meituan Delivery’s data‑governance framework combines a four‑layer warehouse architecture with comprehensive business, technical, security, and resource‑management standards, continuous metadata and security controls, and tools such as Wherehows and QuickSight, delivering standardized, secure, and easily shareable data while guiding future optimization and emerging‑technology adoption.

Big DataData ArchitectureData Security
0 likes · 27 min read
Data Governance Practices in Meituan Delivery: Architecture, Standards, and Security
Open Source Linux
Open Source Linux
Mar 12, 2020 · Big Data

Step-by-Step Guide to Build a Hadoop 2.9.2 Cluster on CentOS 7.5

This tutorial walks you through setting up a three‑node Hadoop 2.9.2 cluster on CentOS 7.5, covering environment preparation, password‑less SSH, user creation, JDK installation, Hadoop extraction, configuration file edits, directory setup, ownership changes, service startup, and verification via web UIs.

Big DataCentOSCluster Setup
0 likes · 13 min read
Step-by-Step Guide to Build a Hadoop 2.9.2 Cluster on CentOS 7.5
Tencent Tech
Tencent Tech
Mar 11, 2020 · Big Data

Scaling the Health Code: Tencent Cloud Elasticsearch at Billion-User Scale

Leveraging Tencent Cloud Elasticsearch, the nationwide COVID‑19 health code platform handled over 1.6 billion scans for more than 900 million users, achieving millisecond‑level search, seamless horizontal scaling, multi‑zone high availability, and robust security, while simplifying development through RESTful APIs and rich UI tools.

Big DataDistributed SystemsElasticsearch
0 likes · 12 min read
Scaling the Health Code: Tencent Cloud Elasticsearch at Billion-User Scale
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 9, 2020 · Big Data

How Alibaba Digitally Managed 100,000 Employees’ Return to the Office

Alibaba leveraged a suite of digital solutions—including a big‑data entry‑control system, AI‑driven mask detection, smart‑robot meal scheduling, predictive parking, environment regulation, and contactless services—to orchestrate a safe, orderly return of over 100,000 staff across its global campuses.

AIBig DataDigital Transformation
0 likes · 9 min read
How Alibaba Digitally Managed 100,000 Employees’ Return to the Office
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 6, 2020 · Big Data

Real-Time Log Monitoring and Alerting for iQIYI Membership Services

To support over 100 million iQIYI members, the team rebuilt a real‑time log monitoring platform that gathers access, exception, Nginx and front‑end logs via a Venus‑Agent, streams them through Kafka to Spark Streaming and Flink, stores metrics in Druid, and provides minute‑level host and business alerts, achieving 80 % faster incident investigation, detecting 90 % of member complaints early, and generating more than 4,800 actionable alerts.

Big DataFlinkLog Analytics
0 likes · 11 min read
Real-Time Log Monitoring and Alerting for iQIYI Membership Services
Suning Technology
Suning Technology
Mar 5, 2020 · Artificial Intelligence

Will Retail + Internet Healthcare Survive Post‑COVID? Key Insights

After the pandemic, Suning’s Retail Technology Research Institute examines how the convergence of retail and internet medical services can address rising healthcare demand, resource shortages, and infection risks, leveraging big data, AI, and e‑commerce logistics to create integrated, non‑contact medical solutions and new business models.

AIBig DataHealthcare
0 likes · 13 min read
Will Retail + Internet Healthcare Survive Post‑COVID? Key Insights
dbaplus Community
dbaplus Community
Mar 3, 2020 · Big Data

How MaFengWo Scaled Kafka for Real‑Time Big Data: Lessons and Best Practices

This article details MaFengWo's practical experience with Kafka in its big‑data platform, covering three core usage scenarios, a four‑stage evolution roadmap—including version upgrades, resource isolation, security and monitoring—and future plans such as transaction‑based deduplication and consumer throttling.

Big DataKafkaResource Isolation
0 likes · 17 min read
How MaFengWo Scaled Kafka for Real‑Time Big Data: Lessons and Best Practices
ITPUB
ITPUB
Mar 2, 2020 · Big Data

Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications

This article explains ZooKeeper’s architecture, key concepts such as roles, sessions, ZNodes, versioning, ACLs, and watchers, and demonstrates how it powers essential big‑data components like Hadoop’s ResourceManager and HBase’s master election, naming service, and distributed locking.

Big DataDistributed CoordinationHBase
0 likes · 23 min read
Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 27, 2020 · Databases

How Cloud‑Native Distributed Databases Are Shaping the Future of Enterprise Data

This article reviews the evolution, market trends, core components, architectural challenges, and emerging technologies of cloud‑native distributed database systems, highlighting Alibaba Cloud's solutions such as POLARDB, AnalyticDB, and AI‑driven management platforms that enable elastic, high‑availability, and intelligent data services for modern enterprises.

Alibaba CloudBig DataHTAP
0 likes · 26 min read
How Cloud‑Native Distributed Databases Are Shaping the Future of Enterprise Data
Suning Technology
Suning Technology
Feb 25, 2020 · Operations

How Post-Pandemic Retail Is Reinvented: Trends, Tech, and Opportunities

The Suning Retail Technology Research Institute analyzes post‑COVID retail trends, highlighting shifts in consumer behavior, the rise of product traceability, smart masks, AI‑enabled smart homes, remote work, online healthcare, and community group buying, while outlining the technologies driving these changes.

AIBig DataSmart Home
0 likes · 8 min read
How Post-Pandemic Retail Is Reinvented: Trends, Tech, and Opportunities
Suning Technology
Suning Technology
Feb 22, 2020 · Big Data

How SuNing’s Big Data Engine Powers Health‑Code Pandemic Management

During the COVID‑19 pandemic, SuNing launched a public travel information registration system that leverages massive big‑data processing, high‑concurrency architecture, Kafka streaming, and real‑time analytics to create a city‑wide health‑code network, enabling precise epidemic control, mobility tracking, and robust data privacy safeguards.

Big DataData PrivacyHealth Code
0 likes · 5 min read
How SuNing’s Big Data Engine Powers Health‑Code Pandemic Management
Qunar Tech Salon
Qunar Tech Salon
Feb 21, 2020 · Artificial Intelligence

Building an End‑to‑End Data‑Model Loop for Alibaba XiaoMi AI Services

The article describes how Alibaba's XiaoMi AI platform constructs a closed‑loop pipeline—from data collection and annotation to model training, evaluation, and real‑time deployment—using multi‑dimensional data processing, visualization, and Spark‑based engines to accelerate iterative improvements and address operational pain points.

AIBig DataSpark
0 likes · 9 min read
Building an End‑to‑End Data‑Model Loop for Alibaba XiaoMi AI Services
21CTO
21CTO
Feb 19, 2020 · Big Data

Building an Open-Source Big Data Analytics Stack: Challenges & Benefits

The article explains why modern companies rely on data‑driven decisions, outlines the two main challenges of tracking data and connecting it to BI, describes the three‑step analytics stack (integration, warehouse, analysis), and highlights the cost, flexibility, and security advantages of open‑source tools.

Big DataData IntegrationData Warehouse
0 likes · 5 min read
Building an Open-Source Big Data Analytics Stack: Challenges & Benefits
MaGe Linux Operations
MaGe Linux Operations
Feb 17, 2020 · Operations

How to Efficiently Split and Merge Large Log Files on Linux

When log files grow massive, traditional tools like vim, cat, grep, and awk become slow and memory‑hungry, but Linux’s split command lets you divide a huge file by line count or size, process the pieces individually, and later recombine them, dramatically improving analysis efficiency.

Big DataShell scriptingfile-handling
0 likes · 8 min read
How to Efficiently Split and Merge Large Log Files on Linux
DataFunTalk
DataFunTalk
Feb 17, 2020 · Artificial Intelligence

Building a Closed‑Loop AI System: From Data Collection to Model Deployment in Alibaba’s XiaoMi

This article explains how Alibaba’s XiaoMi team constructs a full‑cycle AI pipeline—covering real‑time and offline data processing, high‑dimensional visualization, model training, iterative feedback, and Spark‑based deployment—to accelerate intelligent product iteration while addressing common engineering pain points.

AIBig DataReal-time Processing
0 likes · 10 min read
Building a Closed‑Loop AI System: From Data Collection to Model Deployment in Alibaba’s XiaoMi
Suning Technology
Suning Technology
Feb 15, 2020 · Artificial Intelligence

How AI and Unmanned Tech Are Redefining Retail in the Post‑Pandemic Era

The COVID‑19 pandemic accelerated instant consumption and O2O integration, prompting retailers to adopt AI‑driven unmanned stores, big‑data traceability, smart‑home solutions, and innovative mask and health‑product strategies, reshaping supply chains, operations, and consumer experiences.

AIBig DataCOVID-19
0 likes · 12 min read
How AI and Unmanned Tech Are Redefining Retail in the Post‑Pandemic Era
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 13, 2020 · Big Data

Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage

This article describes how eBay's Central Application Logging (CAL) system generates massive daily logs, the challenges of Hadoop MapReduce job performance and resource consumption, and the step‑by‑step optimizations—reducing GC time, mitigating data skew, and improving algorithms—that cut execution time by over 60%, lowered cluster resource usage, and raised job success rates to nearly 100%.

Big DataData SkewHadoop
0 likes · 11 min read
Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage
Tencent Cloud Developer
Tencent Cloud Developer
Feb 13, 2020 · Big Data

Data Middle Platform: Vision, Architecture, and Business Value

The Data Middle Platform, described by Shi Kai, is a service‑oriented architecture that transforms raw enterprise data into reusable, real‑time APIs for business applications, bridging the gap between traditional warehouses and front‑end systems, accelerating digital transformation through unified governance, rapid development, and direct business value.

Big DataData ArchitectureData Middle Platform
0 likes · 26 min read
Data Middle Platform: Vision, Architecture, and Business Value
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 10, 2020 · Big Data

Real‑time MySQL Binlog Capture with Canal: Principles, Architecture, Deployment and Comparison with Maxwell

This article explains how to use Alibaba's Canal to capture MySQL binlog changes in real time, covering its underlying protocol, component architecture, HA design with ZooKeeper, configuration steps, deployment examples, and a detailed comparison with alternative tools such as Maxwell and mysql_streamer.

Big DataCanalKafka
0 likes · 17 min read
Real‑time MySQL Binlog Capture with Canal: Principles, Architecture, Deployment and Comparison with Maxwell
58 Tech
58 Tech
Feb 10, 2020 · Big Data

Construction and Practice of a Site-wide User Behavior Data Warehouse at 58.com

This article systematically describes the challenges, design principles, modeling methods, layered architecture, implementation steps, and standards used in building a comprehensive user behavior data warehouse for 58.com, highlighting practical experiences and future improvement directions.

Big DataData QualityData Warehouse
0 likes · 11 min read
Construction and Practice of a Site-wide User Behavior Data Warehouse at 58.com
HomeTech
HomeTech
Feb 6, 2020 · Product Management

AutoBI One‑Stop Data Visualization Platform: Architecture, Technical Highlights, and Use Cases

The document outlines AutoBI, a company‑wide one‑stop data visualization platform, detailing its background, overall architecture, key technical components such as real‑time/offline data switching and query processing, integration capabilities, and practical case studies, highlighting efficiency gains and future development plans.

BackendBig DataData visualization
0 likes · 8 min read
AutoBI One‑Stop Data Visualization Platform: Architecture, Technical Highlights, and Use Cases
Youzan Coder
Youzan Coder
Feb 5, 2020 · Backend Development

Configurable Data Reconciliation Platform at Youzan: Design, Architecture, and Implementation

Youzan built a configurable data reconciliation platform that integrates new scenarios, processes massive real‑time and batch data, offers visual monitoring, automated correction, and flexible Groovy‑based logic across four DDD layers, achieving 99.99% stability while simplifying detection and resolution of cross‑system inconsistencies.

Big DataData ReconciliationDistributed Systems
0 likes · 15 min read
Configurable Data Reconciliation Platform at Youzan: Design, Architecture, and Implementation
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 30, 2020 · Big Data

Comprehensive Guide to Spark Performance Optimization (Development, Resource, Data Skew, and Shuffle Tuning)

This article provides an in‑depth, step‑by‑step guide to optimizing Spark jobs, covering development‑time best practices, resource‑parameter tuning, data‑skew detection and mitigation techniques, and shuffle‑stage performance tweaks, complete with Scala code examples and practical recommendations.

Big DataData SkewPerformance Optimization
0 likes · 67 min read
Comprehensive Guide to Spark Performance Optimization (Development, Resource, Data Skew, and Shuffle Tuning)
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 20, 2020 · Big Data

Alibaba’s Secrets to High‑Throughput Full‑Load and Low‑Latency Search Processing

This article details how Alibaba migrated its massive Taobao‑Tmall search workload to the search offline platform, tackling challenges of massive data volume, one‑to‑many joins, and hotspot sellers through a series of performance optimizations—including local joins, salt‑based data sharding, dynamic aggregation jobs, and asynchronous processing—to achieve high‑throughput full loads and low‑latency incremental updates.

AlibabaBig DataFlink
0 likes · 15 min read
Alibaba’s Secrets to High‑Throughput Full‑Load and Low‑Latency Search Processing
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 19, 2020 · Big Data

Tencent's Elasticsearch Practices: Application Scenarios, Challenges, Optimizations, and Future Directions

This article details how Tencent leverages Elasticsearch for log analysis, search services, and time‑series data, outlines the specific challenges faced in high‑availability and cost‑efficiency, and presents the comprehensive optimization techniques and future open‑source contributions that improve performance, scalability, and reliability.

Big DataElasticsearchSearch
0 likes · 16 min read
Tencent's Elasticsearch Practices: Application Scenarios, Challenges, Optimizations, and Future Directions
Tencent Cloud Developer
Tencent Cloud Developer
Jan 19, 2020 · Backend Development

Tencent Kona JDK: OpenJDK Foundations, Technical Trends, and Big Data Practices

The talk reviews OpenJDK’s evolution, contrasts Oracle JDK, introduces Tencent’s Kona JDK as a free, long‑term, production‑hardened fork optimized for massive micro‑service and big‑data workloads, and discusses emerging Java‑on‑Java, value‑type, Project Panama/Loom, and SIMD Vector API trends shaping JVM performance.

Big DataCloud ComputingJVM
0 likes · 15 min read
Tencent Kona JDK: OpenJDK Foundations, Technical Trends, and Big Data Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 16, 2020 · Big Data

Kafka Interview Guide: Core Concepts, Architecture, and Practical Tips

This article compiles essential Kafka interview material, covering its role as a message queue, usage scenarios, architectural components, storage mechanisms, consumer group rebalancing, high‑availability features, replication details, ordering guarantees, producer/consumer client design, topic management, log retention, performance optimizations, and key monitoring metrics.

Big DataDistributed SystemsInterview
0 likes · 16 min read
Kafka Interview Guide: Core Concepts, Architecture, and Practical Tips
Architects Research Society
Architects Research Society
Jan 16, 2020 · Big Data

Elasticsearch vs Solr: Choosing the Right Open‑Source Search Engine

This article compares Elasticsearch and Solr, examining their history, community, licensing, core technologies, APIs, scalability, vendor support, ecosystem, performance, management tools, and visualization options to help organizations decide which open‑source search engine best fits their big‑data and search requirements.

Big DataElasticsearchSolr
0 likes · 12 min read
Elasticsearch vs Solr: Choosing the Right Open‑Source Search Engine
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 10, 2020 · Big Data

Async I/O for Dimension Table Joins in Apache Flink

This article explains how to handle dimension table joins in Apache Flink streaming by leveraging Async I/O to perform non‑blocking external lookups, provides detailed code examples for both synchronous and asynchronous functions, discusses configuration parameters, and outlines best practices and pitfalls.

Big DataDimension Table JoinFlink
0 likes · 16 min read
Async I/O for Dimension Table Joins in Apache Flink
ITPUB
ITPUB
Jan 10, 2020 · Big Data

How MaFengWo Scales Kafka for Real‑Time Big Data: Lessons and Best Practices

This article details MaFengWo’s practical experience using Kafka across three core scenarios—real‑time storage, analytical data source, and business data subscription—while describing a four‑stage evolution that includes version upgrades, resource isolation, security and monitoring enhancements, and a comprehensive subscription platform, followed by future improvement plans.

Big DataData ReplayKafka
0 likes · 16 min read
How MaFengWo Scales Kafka for Real‑Time Big Data: Lessons and Best Practices
DataFunTalk
DataFunTalk
Jan 9, 2020 · Databases

Exploring Spatiotemporal Data Management with Cassandra, GeoMesa, and GeoTrellis

This article presents a comprehensive overview of handling spatiotemporal data using Cassandra, covering data types, space‑filling curves, GeoHash encoding, the GeoMesa and GeoTrellis ecosystems, Cassandra storage schemas, and practical Spark integration for large‑scale geospatial analytics.

Big DataDatabasesGeoMesa
0 likes · 8 min read
Exploring Spatiotemporal Data Management with Cassandra, GeoMesa, and GeoTrellis
iQIYI Technical Product Team
iQIYI Technical Product Team
Jan 9, 2020 · Big Data

Design and Evolution of iQIYI Real-Time Analysis Platform (RAP)

iQIYI’s Real‑Time Analysis Platform (RAP) combines Apache Druid with Spark/Flink to deliver minute‑level, low‑latency multidimensional analytics via a web wizard, supporting hundreds of streaming tasks and thousands of reports across membership, recommendation, and TV monitoring, while simplifying development and maintenance.

Apache DruidBig DataFlink
0 likes · 13 min read
Design and Evolution of iQIYI Real-Time Analysis Platform (RAP)
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Jan 7, 2020 · Big Data

Design and Implementation of XFlink: A Flink‑Based Data Migration System on Yarn

The article describes the evolution from the legacy XDATA tool to the new XFlink system, detailing its architecture, core plugins, parser and deployment modules, resource management with Yarn, monitoring via Prometheus and Grafana, and planned enhancements such as Flink SQL configuration and modular plugins.

Big DataData MigrationDistributed Systems
0 likes · 10 min read
Design and Implementation of XFlink: A Flink‑Based Data Migration System on Yarn
dbaplus Community
dbaplus Community
Jan 6, 2020 · Big Data

How 58.com Built a Scalable Flink‑Based Real‑Time Data Platform (Wstream)

The article details how 58.com designed and evolved its one‑stop real‑time computation platform Wstream, migrating from Storm and Spark Streaming to Apache Flink, and describes the architecture, task isolation, stream‑SQL features, monitoring, and ongoing optimizations that enable processing of over 600 billion records daily.

Big DataFlinkReal-time Streaming
0 likes · 12 min read
How 58.com Built a Scalable Flink‑Based Real‑Time Data Platform (Wstream)
Tencent Cloud Developer
Tencent Cloud Developer
Jan 6, 2020 · Big Data

Overview of TubeMQ: Principles, Architecture, Performance, and Open‑Source Strategy for Big‑Data Message Queues

TubeMQ is a trillion‑level, Java‑based distributed message‑queue middleware designed for massive‑data ingestion, offering 140 k TPS with sub‑5 ms latency, high reliability, low cost, and horizontal scalability, and is being open‑sourced to the Apache foundation to foster community collaboration and future expansion beyond traditional MQ functions.

Big DataDistributed SystemsMessage queue
0 likes · 15 min read
Overview of TubeMQ: Principles, Architecture, Performance, and Open‑Source Strategy for Big‑Data Message Queues
58 Tech
58 Tech
Jan 6, 2020 · Big Data

Design and Architecture of the 58DP Big Data Platform Task Scheduling System

The article presents a comprehensive overview of the 58DP big data platform's task scheduling system, detailing its background, architecture, high‑availability design, slot‑based resource management, scheduling models, task lifecycle, priority rules, dependency handling, failure recovery, and future enhancements.

Big DataTask Schedulingdistributed system
0 likes · 14 min read
Design and Architecture of the 58DP Big Data Platform Task Scheduling System
Didi Tech
Didi Tech
Jan 5, 2020 · Big Data

Rolling Upgrade of HDFS from 2.7 to 3.2: Experience, Issues and Solutions

The team performed a rolling upgrade of HDFS from 2.7 to 3.2 on large clusters, resolving EditLog, Fsimage, StringTable and authentication incompatibilities by omitting EC data, using fallback images, rolling back commits and first upgrading to the latest 2.x release, following a staged JournalNode‑NameNode‑DataNode procedure, validating with rehearsals and a custom trash‑management tool, and achieving uninterrupted service, improved stability, performance and cost efficiency.

Big DataCluster MigrationHDFS
0 likes · 11 min read
Rolling Upgrade of HDFS from 2.7 to 3.2: Experience, Issues and Solutions