Tagged articles
3697 articles
Page 22 of 37
DataFunTalk
DataFunTalk
May 16, 2021 · Big Data

Efficient Data Update/Delete and Real‑time Processing in the Arctic Lakehouse System

This article explains the evolution from traditional data warehouses to modern lakehouse architectures, introduces the Arctic system’s dynamic hash tree for fast update/delete, describes file splitting with sequence/offset ordering, and compares copy‑on‑write versus merge‑on‑read techniques for achieving low‑latency analytics.

ArcticBig DataCopy-on-Write
0 likes · 12 min read
Efficient Data Update/Delete and Real‑time Processing in the Arctic Lakehouse System
Big Data Technology & Architecture
Big Data Technology & Architecture
May 15, 2021 · Big Data

One‑Stop Big Data Platform Construction: Practices from WeBank, Beike, and iQIYI

This article shares practical notes on building a one‑stop big data platform, outlining essential functions such as data extraction, cleaning, storage, analysis, governance, and security, and presents implementation case studies from WeBank, Beike, and iQIYI to illustrate real‑world architectures and solutions.

Big DataData Platformcase study
0 likes · 8 min read
One‑Stop Big Data Platform Construction: Practices from WeBank, Beike, and iQIYI
Architects Research Society
Architects Research Society
May 15, 2021 · Big Data

Data Warehouse vs Data Lake: Definitions, Differences, and Architectural Considerations

Data warehouses store structured data centrally for reporting and analysis, while data lakes retain raw data in various formats, offering flexible, low‑cost, schema‑on‑read processing; the article explains their definitions, key differences, common misconceptions, and why many organizations now combine both to enable self‑service big‑data analytics.

AnalyticsBig DataData Architecture
0 likes · 21 min read
Data Warehouse vs Data Lake: Definitions, Differences, and Architectural Considerations
DataFunTalk
DataFunTalk
May 14, 2021 · Big Data

Real‑time Billion‑Scale Data Transmission and AI Pipeline Architecture at Bilibili

This article presents a technical deep‑dive into Bilibili’s evolution from offline to real‑time data processing, describing the challenges of timeliness, ETL, AI feature engineering, and the design of a Flink‑on‑YARN incremental pipeline that supports trillion‑scale message throughput and AI‑driven real‑time applications.

AIBig DataFlink
0 likes · 27 min read
Real‑time Billion‑Scale Data Transmission and AI Pipeline Architecture at Bilibili
HelloTech
HelloTech
May 14, 2021 · Big Data

User Behavior Analysis System: Architecture, ClickHouse Cluster Deployment, and Analytical Techniques

The article describes a real‑time user behavior analysis platform built on a ClickHouse cluster, detailing its architecture, Hive‑to‑ClickHouse data ingestion with user‑ID routing, table designs for behavior and group data, and five analytical methods—event, funnel, path, retention, and attribution—leveraging shard‑level parallelism and custom functions for high efficiency.

AnalyticsBig DataClickHouse
0 likes · 20 min read
User Behavior Analysis System: Architecture, ClickHouse Cluster Deployment, and Analytical Techniques
ITPUB
ITPUB
May 14, 2021 · Big Data

How AnalyticDB Powers Petabyte-Scale Consumer Analytics in Alibaba’s Data Bank

The article details how Alibaba’s Data Bank leverages AnalyticDB’s cold‑hot tiered storage, high‑throughput real‑time writes, and low‑latency OLAP capabilities to handle petabyte‑scale consumer data, support flexible AIPL analysis, crowd profiling, and rapid audience selection while cutting costs and ensuring elasticity during peak events.

AnalyticDBBig DataCold-Hot Storage
0 likes · 14 min read
How AnalyticDB Powers Petabyte-Scale Consumer Analytics in Alibaba’s Data Bank
Volcano Engine Developer Services
Volcano Engine Developer Services
May 13, 2021 · Databases

Inside ByteGraph: How ByteDance Built a Scalable Distributed Graph Database

The article offers a comprehensive technical deep‑dive into ByteDance’s home‑grown distributed graph database and graph‑processing engine, ByteGraph, covering its directed‑property graph model, Gremlin query support, multi‑layer architecture, storage strategies for massive data, and real‑world graph‑computing practices.

Big DataByteGraphGraph Database
0 likes · 28 min read
Inside ByteGraph: How ByteDance Built a Scalable Distributed Graph Database
JD Retail Technology
JD Retail Technology
May 13, 2021 · Big Data

Evolution and Architecture of JD.com Self‑Operated Rebate Platform

The article details the development, challenges, and redesign of JD.com’s self‑operated rebate system, describing its early monolithic architecture, data‑intensive processing pipeline, migration to a modular, high‑availability platform built on Spark, Hive, and Elasticsearch, and the resulting performance and operational improvements.

Big DataETLSpark
0 likes · 16 min read
Evolution and Architecture of JD.com Self‑Operated Rebate Platform
DataFunTalk
DataFunTalk
May 12, 2021 · Big Data

Building a Unified Real‑Time and Offline OLAP Platform with DorisDB at Yuanfudao

The article describes how Yuanfudao's data middle platform built a high‑performance OLAP service using the MPP HOLAP engine DorisDB to unify real‑time and batch analytics, meet low‑latency and high‑concurrency requirements, and support diverse education‑industry use cases such as live‑stream monitoring, advertising, and order analytics.

Big DataDorisDBEducation Technology
0 likes · 13 min read
Building a Unified Real‑Time and Offline OLAP Platform with DorisDB at Yuanfudao
Tencent Tech
Tencent Tech
May 12, 2021 · Big Data

How Tencent Powered China’s 7th Census with Big Data and Cloud Tech

The article explains how China’s seventh national census, covering 1.41 billion people, was conducted using fully electronic data collection, self‑service mini‑programs, massive cloud‑native infrastructure, and high‑performance databases to achieve real‑time processing and unprecedented scale.

Big DataDatabasescensus
0 likes · 8 min read
How Tencent Powered China’s 7th Census with Big Data and Cloud Tech
DataFunTalk
DataFunTalk
May 11, 2021 · Big Data

Design and Practice of Baixin Bank's Flink‑Based Real‑Time Computing Platform and Hudi‑Powered Real‑Time Data Lake

This article details Baixin Bank's construction of a Flink‑driven real‑time computing platform integrated with Hudi as a real‑time data lake, covering background, architecture, data collection, transformation, storage layers, technical challenges, future roadmap, and practical lessons for similar big‑data initiatives.

Big DataData engineeringFlink
0 likes · 12 min read
Design and Practice of Baixin Bank's Flink‑Based Real‑Time Computing Platform and Hudi‑Powered Real‑Time Data Lake
Big Data Technology & Architecture
Big Data Technology & Architecture
May 11, 2021 · Big Data

Data Quality: Dimensions, Rules, and Constraints

The article explains the importance of data quality in the big data era, defines key quality dimensions such as completeness, uniqueness, validity, consistency, accuracy, timeliness, and credibility, and details how each dimension can be measured and enforced through specific constraints and validation rules.

AccuracyBig DataConsistency
0 likes · 9 min read
Data Quality: Dimensions, Rules, and Constraints
Architects Research Society
Architects Research Society
May 9, 2021 · Big Data

Data Lakes vs. Data Warehouses: Key Differences and Choosing the Right Approach

This article explains the fundamental distinctions between data lakes and data warehouses, outlines five critical differences—including data retention, type support, user support, adaptability, and insight speed—and offers guidance on selecting the appropriate solution based on organizational needs and technology options.

AnalyticsBig DataData Architecture
0 likes · 12 min read
Data Lakes vs. Data Warehouses: Key Differences and Choosing the Right Approach
Architecture Digest
Architecture Digest
May 7, 2021 · Big Data

Comprehensive Overview of Data Middle Platform Architecture and Practices

This article provides a detailed introduction to data middle platform concepts, covering data aggregation, ingestion tools, offline and real‑time development, data governance, service layers, monitoring, and deployment patterns, illustrating how enterprises build unified data ecosystems across various industries.

Big DataData PlatformData Warehouse
0 likes · 25 min read
Comprehensive Overview of Data Middle Platform Architecture and Practices
Qu Tech
Qu Tech
May 6, 2021 · Big Data

How JuiceFS Cut HDFS Load by 26% and Boost Presto Query Speed 13%

This case study details how integrating JuiceFS with Presto reduced HDFS cluster load by about 26%, achieved over 90% cache hit rate for ad‑hoc queries, and lowered average query latency by roughly 13%, while simplifying operations and improving system stability.

Big DataCacheHDFS
0 likes · 9 min read
How JuiceFS Cut HDFS Load by 26% and Boost Presto Query Speed 13%
DataFunTalk
DataFunTalk
May 5, 2021 · Big Data

JD's OLAP Architecture: Design, Challenges, and Solutions

This article explains how JD constructs its OLAP platform from data ingestion to storage, querying, and management, describing the diverse data sources, real‑time and offline processing, scalability, consistency, fault tolerance, and future optimization plans, while addressing key technical challenges and solutions.

Big DataDistributed SystemsJD.com
0 likes · 15 min read
JD's OLAP Architecture: Design, Challenges, and Solutions
DataFunTalk
DataFunTalk
May 4, 2021 · Big Data

Design and Implementation of a Real-Time Data Transmission Platform Based on Apache Flink at AutoHome

This article presents the background, requirements, architectural design, component interaction, and implementation details of AutoHome's real‑time data transmission platform built on Apache Flink, highlighting its high availability, exactly‑once semantics, scalability, DDL handling, and integration with existing streaming services.

Apache FlinkBig DataData Streaming
0 likes · 18 min read
Design and Implementation of a Real-Time Data Transmission Platform Based on Apache Flink at AutoHome
Top Architect
Top Architect
May 4, 2021 · Big Data

Overview of CDC Tools: Canal, Maxwell, Databus, and Alibaba DTS

This article introduces four change‑data‑capture solutions—Canal, Maxwell, Databus, and Alibaba Data Transmission Service (DTS)—explaining their principles, processing steps, features, and practical advantages for real‑time data synchronization and migration in big‑data environments.

Alibaba DTSBig DataCDC
0 likes · 6 min read
Overview of CDC Tools: Canal, Maxwell, Databus, and Alibaba DTS
Python Crawling & Data Mining
Python Crawling & Data Mining
May 4, 2021 · Big Data

Unlock 100+ Free Data APIs with Just 3 Lines of Python

This article introduces the GoPUP library, which provides over a hundred free data interfaces—including social media indexes, macro‑economic figures, company information, and epidemic statistics—accessible with simple Python code, making data analysis faster and easier.

APIBig DataPython
0 likes · 7 min read
Unlock 100+ Free Data APIs with Just 3 Lines of Python
DataFunTalk
DataFunTalk
May 2, 2021 · Big Data

Continuous Optimization and Practice of Flink at Kuaishou

This article presents Kuaishou's comprehensive engineering practices for improving Flink's stability, task startup latency, and SQL performance, including high‑availability Kafka connectors, fault‑recovery mechanisms, I/O reductions, asynchronous job upgrades, aggregation optimizations, and future resource‑utilization plans.

Big DataFlinkKafka
0 likes · 10 min read
Continuous Optimization and Practice of Flink at Kuaishou
Architects' Tech Alliance
Architects' Tech Alliance
May 2, 2021 · Big Data

Understanding Data Middle Platform: Concepts, Drivers, Architecture, and Industry Trends

The article explains the concept of a data middle platform, its role in integrating and centralizing enterprise data, the drivers behind its adoption, architectural layers, implementation challenges, market landscape, and real‑world case studies, highlighting how big‑data, cloud and AI technologies enable digital transformation.

AIBig DataDigital Transformation
0 likes · 15 min read
Understanding Data Middle Platform: Concepts, Drivers, Architecture, and Industry Trends
IT Architects Alliance
IT Architects Alliance
May 1, 2021 · Big Data

Comprehensive Guide to ELK Stack (Elasticsearch, Logstash, Kibana) Installation, Configuration, and Architecture

This article provides a detailed overview of the ELK stack—including Elasticsearch, Logstash, Kibana, and Beats—explaining its components, why to use it for centralized log management, various deployment architectures, system tuning, security setup, and step‑by‑step installation and configuration commands for a production‑grade environment.

Big DataELKElasticsearch
0 likes · 22 min read
Comprehensive Guide to ELK Stack (Elasticsearch, Logstash, Kibana) Installation, Configuration, and Architecture
Programmer DD
Programmer DD
Apr 30, 2021 · Big Data

Kafka 2.8.0 Release: Say Goodbye to ZooKeeper with Raft Metadata Mode

Kafka 2.8.0, released on April 19, 2021, introduces the groundbreaking Raft Metadata mode that eliminates the need for ZooKeeper, alongside numerous new features, bug fixes, and enhancements such as API controls for stream threads, SASL_SSL mutual TLS, and IP rate limiting.

Big DataKafkaRaft
0 likes · 5 min read
Kafka 2.8.0 Release: Say Goodbye to ZooKeeper with Raft Metadata Mode
Architect
Architect
Apr 29, 2021 · Big Data

ELK Stack (Elasticsearch, Logstash, Kibana) Overview, Architecture, Installation, and Configuration Guide (Version 7.7.0)

This article provides a comprehensive introduction to the ELK stack—including component descriptions, architectural diagrams, reasons for adoption, and step‑by‑step installation and configuration of Filebeat, Logstash, Elasticsearch, and Kibana on Linux, with optional Kafka integration for advanced pipelines.

Big DataELKElasticsearch
0 likes · 22 min read
ELK Stack (Elasticsearch, Logstash, Kibana) Overview, Architecture, Installation, and Configuration Guide (Version 7.7.0)
DataFunTalk
DataFunTalk
Apr 28, 2021 · Big Data

Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features

This article explains how NVIDIA's RAPIDS Accelerator leverages GPUs to speed up Apache Spark 3.0 workloads, detailing the underlying architecture, benchmark results on TPC‑DS and recommendation models, required configuration changes, supported operators, shuffle optimizations, and the enhancements introduced in versions 0.2 and 0.3.

Apache SparkBig DataGPU Acceleration
0 likes · 19 min read
Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features
DataFunTalk
DataFunTalk
Apr 27, 2021 · Big Data

Implementing CDC‑to‑Hudi for Real‑Time Mutable Data in a Big Data System

This article describes how Linkflow migrated mutable customer data from MySQL to an Apache Hudi data lake using Debezium‑in‑Flink CDC, addressing challenges such as snapshot resumability, partial updates, row‑key merging, schema evolution, indexing, and concurrent writes to achieve minute‑level data freshness and improved offline processing performance.

Apache HudiBig DataCDC
0 likes · 21 min read
Implementing CDC‑to‑Hudi for Real‑Time Mutable Data in a Big Data System
DataFunTalk
DataFunTalk
Apr 23, 2021 · Big Data

Building and Evolving Zhihu’s Flink‑Based Data Integration Platform

This article details Zhihu’s transition from a Sqoop‑driven data integration system to a Flink‑centric platform, covering business scenarios, historical architecture, design goals, technology choices, performance optimizations, and future plans for unified streaming‑batch processing across diverse storage systems.

Batch processingBig DataData Integration
0 likes · 14 min read
Building and Evolving Zhihu’s Flink‑Based Data Integration Platform
IT Architects Alliance
IT Architects Alliance
Apr 23, 2021 · Industry Insights

Inside Toutiao’s Massive Scale: How the News App Handles Billions of Requests

This article provides an in‑depth technical overview of Toutiao’s rapid growth, data collection pipelines, user modeling, cold‑start strategies, recommendation engine architecture, storage solutions, push notification system, microservice design, and its three‑layer PaaS platform, illustrating how the news app serves hundreds of millions of users daily.

Big DataSystem architectureToutiao
0 likes · 8 min read
Inside Toutiao’s Massive Scale: How the News App Handles Billions of Requests
Laravel Tech Community
Laravel Tech Community
Apr 22, 2021 · Big Data

Apache Kafka 2.8.0 Release Highlights and New Features

Apache Kafka 2.8.0 introduces several significant enhancements, including a new group API, mutual TLS authentication for SASL_SSL listeners, JSON request/response logging, broker connection rate limiting, topic identifiers, self‑managed quorum replacing ZooKeeper, and numerous improvements to Streams and Connect APIs for more reliable real‑time data pipelines.

Apache KafkaBig DataDistributed Systems
0 likes · 2 min read
Apache Kafka 2.8.0 Release Highlights and New Features
Xianyu Technology
Xianyu Technology
Apr 22, 2021 · Big Data

Real-time Performance Optimization of the Mahé Selection and Delivery System

By classifying data streams, aggregating large‑scale T+1 records in six‑hour windows, encoding attributes with multi‑value mappings, storing compressed rule‑hit backups, and synchronizing recall tables in real time, Mahé’s selection‑and‑delivery pipeline cut end‑to‑end latency from minutes to seconds, achieving robust second‑level responsiveness.

Big DataPerformance OptimizationSystem architecture
0 likes · 12 min read
Real-time Performance Optimization of the Mahé Selection and Delivery System
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 22, 2021 · Big Data

Debunking Common Misconceptions About Data Lakes

This article debunks eight common misconceptions about data lakes, explains why they are not mutually exclusive with data warehouses, clarifies that they are not limited to Hadoop or raw data only, and provides practical tips for building flexible, secure, and business‑driven data lake solutions.

AnalyticsBig DataCloud Services
0 likes · 21 min read
Debunking Common Misconceptions About Data Lakes
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Apr 21, 2021 · Big Data

Designing an Industrial Internet Big Data Platform: Key Strategies

This article presents a comprehensive construction plan for an Industrial Internet big data platform, detailing its overall architecture, data acquisition, edge processing, cloud storage, analytics, security measures, and deployment best practices to enable scalable and reliable industrial IoT solutions.

Big DataIndustrial InternetIoT
0 likes · 1 min read
Designing an Industrial Internet Big Data Platform: Key Strategies
JD Tech
JD Tech
Apr 20, 2021 · Databases

Space-Filling Curves for Efficient Multidimensional Data Storage and Querying

This article introduces space-filling curves such as Z‑ordering, Hilbert, and XZ‑Ordering, explaining their mapping algorithms and how they transform multidimensional spatial data into one‑dimensional indices for efficient storage and querying in key‑value databases, while discussing challenges and practical examples.

Big DataDatabasesSpace-filling Curves
0 likes · 12 min read
Space-Filling Curves for Efficient Multidimensional Data Storage and Querying
Meituan Technology Team
Meituan Technology Team
Apr 15, 2021 · Big Data

Data Governance Practices at Meituan Hotel & Travel Platform

Meituan’s hotel‑travel platform tackled exploding data‑quality, cost, efficiency, and security issues by establishing a full‑link governance framework—standardized processes, a Data Management Committee, and unified “One Model, One Logic, One Service, One Portal” systems—that cut per‑unit costs by ~40%, boosted engineer productivity over 60%, eliminated major security incidents, and set the stage for autonomous, AI‑driven data governance.

Big DataData QualityData Security
0 likes · 32 min read
Data Governance Practices at Meituan Hotel & Travel Platform
TAL Education Technology
TAL Education Technology
Apr 15, 2021 · Artificial Intelligence

Tsinghua University and TAL Launch Phase II Collaboration on Intelligent Education Research

On April 15, Tsinghua University's Computer Science Department and TAL Education's Joint Research Center inaugurated Phase II of their partnership to advance intelligent education through AI-driven teaching environments, interactive mechanisms, knowledge‑graph construction, and personalized assessment technologies.

Artificial IntelligenceBig DataCollaboration
0 likes · 7 min read
Tsinghua University and TAL Launch Phase II Collaboration on Intelligent Education Research
dbaplus Community
dbaplus Community
Apr 14, 2021 · Big Data

Master Spark Performance: Key Tuning, Shuffle & Join Optimization

This guide compiles practical Spark tuning techniques, covering essential configuration parameters, programming best‑practices, detailed shuffle mechanics, and join optimization strategies, while also addressing common errors and mitigation steps, enabling developers to improve performance and resource utilization in large‑scale data processing jobs.

Big DataError HandlingJOIN optimization
0 likes · 25 min read
Master Spark Performance: Key Tuning, Shuffle & Join Optimization
Programmer DD
Programmer DD
Apr 13, 2021 · Big Data

What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features

This article provides a comprehensive overview of HDFS—including its design goals, core components, data read/write workflows, high‑availability mechanisms, federation, storage policies, colocation benefits, and practical usage scenarios—explaining why it is the foundational distributed file system for large‑scale data processing.

Big DataData StorageFederation
0 likes · 17 min read
What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features
DevOps
DevOps
Apr 12, 2021 · Fundamentals

Understanding the Digital Economy: Definition, Evolution, and Why It Matters Now

The article explains what the digital economy is, its relationship with digital transformation, the strategic importance placed on it by China's 14th Five‑Year Plan, and offers guidance for IT professionals on how to respond to this emerging national priority.

Artificial IntelligenceBig DataDigital Economy
0 likes · 14 min read
Understanding the Digital Economy: Definition, Evolution, and Why It Matters Now
DataFunTalk
DataFunTalk
Apr 9, 2021 · Big Data

iQIYI Data Middle Platform: Architecture, Capabilities, and Future Outlook

This article explains how iQIYI’s data middle platform addresses the rapid growth and challenges of big data by providing a unified, standardized, and service‑oriented architecture that includes data production, processing, governance, metadata, AI‑enhanced services, and a roadmap for future enhancements.

AIBig Dataarchitecture
0 likes · 23 min read
iQIYI Data Middle Platform: Architecture, Capabilities, and Future Outlook
Top Architect
Top Architect
Apr 9, 2021 · Big Data

Technical Architecture and Data Processing of Toutiao News Feed System

This article provides a comprehensive overview of Toutiao's rapid growth, massive user base, data collection pipelines, user modeling, recommendation engine, storage solutions, message push strategies, micro‑service architecture, and virtualization PaaS platform, illustrating how big‑data technologies enable personalized news delivery at scale.

Big DataToutiaodata pipeline
0 likes · 8 min read
Technical Architecture and Data Processing of Toutiao News Feed System
Big Data Technology Architecture
Big Data Technology Architecture
Apr 8, 2021 · Big Data

Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions

This article explains the small‑file problem in Spark SQL on HDFS, its impact on NameNode memory and query performance, describes how dynamic partition inserts and shuffle settings generate many files, and presents practical solutions such as partition‑based distribution, random bucketing and adaptive query execution to control file count.

Big DataHadoopSmall Files
0 likes · 12 min read
Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions
Sohu Tech Products
Sohu Tech Products
Apr 7, 2021 · Big Data

Data Warehouse Architecture and Modeling with Alibaba MaxCompute and DataWorks

This tutorial explains how to select a technical architecture, design a three‑layer data warehouse (ODS, CDM, ADS), model tables and dimensions, choose storage strategies, handle slowly changing dimensions, synchronize data with DataWorks, and implement dimensional modeling and fact tables using Alibaba MaxCompute for big‑data analytics.

Big DataData WarehouseDataWorks
0 likes · 32 min read
Data Warehouse Architecture and Modeling with Alibaba MaxCompute and DataWorks
Big Data Technology Architecture
Big Data Technology Architecture
Apr 5, 2021 · Big Data

Evolution of Real‑Time Data Warehouses: From 1.0 to 3.0 and the Road to Batch‑Stream Unified Architecture

The article reviews the current state of offline Hive‑based data warehouses, explains the emergence of real‑time data warehouses (1.0) built on Kafka and Flink, discusses their limitations, and outlines the progression toward batch‑stream unified architectures (2.0 and 3.0) leveraging data‑lake technologies such as Iceberg.

Batch-Stream IntegrationBig DataFlink
0 likes · 13 min read
Evolution of Real‑Time Data Warehouses: From 1.0 to 3.0 and the Road to Batch‑Stream Unified Architecture
Python Crawling & Data Mining
Python Crawling & Data Mining
Apr 4, 2021 · Big Data

Mastering User Behavior Analysis: 6 Essential Techniques for Data‑Driven Growth

This article explains six key user‑behavior analysis methods—event analysis, retention analysis, distribution analysis, conversion‑funnel analysis, path analysis, and session analysis—showing how they help businesses understand user actions, optimize product design, improve conversion rates, and boost revenue through data‑driven insights.

Big DataRetention Analysisconversion funnel
0 likes · 11 min read
Mastering User Behavior Analysis: 6 Essential Techniques for Data‑Driven Growth
Architect
Architect
Apr 3, 2021 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article explains advanced Spark performance tuning techniques, focusing on diagnosing and resolving data skew and shuffle bottlenecks through stage analysis, key distribution inspection, and a variety of practical solutions such as Hive pre‑processing, key filtering, parallelism increase, two‑stage aggregation, map‑join, and combined strategies, while also covering ShuffleManager internals and related configuration parameters.

Big DataData SkewPerformance tuning
0 likes · 47 min read
Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning
Architect
Architect
Apr 2, 2021 · Big Data

Spark Performance Optimization Guide: Development and Resource Tuning

This article provides a comprehensive guide to Spark performance optimization, covering development‑level tuning principles, resource configuration parameters, practical code examples, and best‑practice recommendations to achieve high‑throughput big‑data processing.

Big DataOptimizationRDD
0 likes · 33 min read
Spark Performance Optimization Guide: Development and Resource Tuning
Alibaba Cloud Native
Alibaba Cloud Native
Apr 2, 2021 · Cloud Native

How Fluid Turns Kubernetes into a High‑Performance Data Logistics System

This article explains how the open‑source Fluid project addresses the inefficiencies of data‑intensive AI and big‑data workloads in cloud‑native Kubernetes environments by introducing a data‑centric abstraction, dual orchestration mechanisms, and seamless integration with Alluxio to achieve faster, secure, and scalable data access.

AlluxioBig DataData Management
0 likes · 19 min read
How Fluid Turns Kubernetes into a High‑Performance Data Logistics System
DataFunTalk
DataFunTalk
Mar 29, 2021 · Big Data

Beike's OLAP Platform: Druid Adoption, Architecture, Performance Comparison, and Operational Optimizations

This article details Beike's large‑scale OLAP platform, explaining why Druid was chosen over Kylin, describing the platform's four‑layer architecture, presenting performance and storage benchmarks, and outlining practical improvements to data ingestion, real‑time distinct counting, and cluster stability for high‑concurrency business scenarios.

Big DataDruidOLAP
0 likes · 19 min read
Beike's OLAP Platform: Druid Adoption, Architecture, Performance Comparison, and Operational Optimizations
Programmer DD
Programmer DD
Mar 29, 2021 · Big Data

Mastering Kafka: High‑Throughput Distributed Messaging Explained

This comprehensive guide introduces Kafka as a high‑throughput, distributed, publish‑subscribe messaging system, detailing its core concepts, architecture, features, replication, log management, reliability guarantees, and typical use cases such as log collection, real‑time analytics, and cross‑cluster mirroring.

Big DataDistributed MessagingKafka
0 likes · 15 min read
Mastering Kafka: High‑Throughput Distributed Messaging Explained
DataFunTalk
DataFunTalk
Mar 27, 2021 · Big Data

Kuaishou's HDFS Architecture, Scale, Challenges, and Practices

This article presents an in‑depth technical overview of Kuaishou's massive HDFS deployment, detailing its architecture, petabyte‑scale data and thousands‑of‑node clusters, the key scalability challenges faced, and the custom solutions—including FixedOrder, RBF balancer, observer read, slow‑node mitigation, and tiered protection—implemented to keep the system performant and reliable.

Big DataData engineeringHDFS
0 likes · 12 min read
Kuaishou's HDFS Architecture, Scale, Challenges, and Practices
HelloTech
HelloTech
Mar 26, 2021 · Big Data

Data Quality and Interface Semantic Monitoring for Algorithm Testing Platform

The article describes how algorithm testing teams tackled data‑quality and interface‑semantic monitoring problems by building a unified business monitoring platform that checks table, storage and service consistency, validates response semantics, and, through dashboards, alerts and correction tools, quickly identified dozens of offline and online issues, guiding future reliability enhancements.

AIBig DataData Quality
0 likes · 26 min read
Data Quality and Interface Semantic Monitoring for Algorithm Testing Platform
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 26, 2021 · Big Data

Evolution of iQIYI's Real-Time Big Data Ecosystem

iQIYI transformed its data infrastructure from a traditional offline T+1 model to a comprehensive real‑time ecosystem—leveraging Kafka, Flink, a three‑layer Stream Data Service Platform, the Talos drag‑and‑drop pipeline, and a Druid‑based analytics platform—to enable low‑latency monitoring, personalized recommendations, ad targeting, and continuous machine‑learning workflows while planning future stream‑batch integration and lake‑warehouse convergence.

AnalyticsBig DataData Warehouse
0 likes · 13 min read
Evolution of iQIYI's Real-Time Big Data Ecosystem
Ctrip Technology
Ctrip Technology
Mar 25, 2021 · Big Data

Challenges and Approaches for Real‑Time Data Aggregation Analysis

The article examines the key challenges of real‑time data aggregation—data freshness, timely processing, and result visibility—and surveys common solutions such as timestamp‑based sync, CDC, full and incremental computation, storage formats, and trigger mechanisms.

Big DataCDCIncremental Computation
0 likes · 11 min read
Challenges and Approaches for Real‑Time Data Aggregation Analysis
Suning Technology
Suning Technology
Mar 24, 2021 · Big Data

How C2M Is Powering the Industrial Internet Boom in 2021

The article examines how policy‑driven industrial internet initiatives, combined with data‑rich C2M models and AIoT integration, are reshaping manufacturing in China, highlighting Suning's smart‑fridge case, strategic partnerships, and the broader push toward a digital‑first industrial era.

AIoTBig DataC2M
0 likes · 8 min read
How C2M Is Powering the Industrial Internet Boom in 2021
DataFunTalk
DataFunTalk
Mar 24, 2021 · Big Data

Practical Experience of Using DorisDB for Real-Time and Offline Analytics in KuJiaLe's Big Data Platform

This article details how KuJiaLe's big data team replaced their legacy ADB and Presto clusters with a DorisDB MPP database, achieving sub‑second query latency, unified real‑time and offline analytics, simplified ETL pipelines, and significant cost savings while supporting billion‑row tables and high‑QPS workloads.

Big DataDorisDBETL
0 likes · 9 min read
Practical Experience of Using DorisDB for Real-Time and Offline Analytics in KuJiaLe's Big Data Platform
DataFunTalk
DataFunTalk
Mar 21, 2021 · Big Data

Single‑Point Recovery and Regional Checkpoint in Flink: Design, Implementation, and Optimizations

This article presents ByteDance's recent Flink enhancements, detailing a single‑point recovery mechanism for the network layer and a regional checkpoint strategy that together improve failover latency, reduce output loss, and enable scalable, high‑throughput stream processing for large‑scale real‑time recommendation workloads.

Big DataCheckpointFlink
0 likes · 12 min read
Single‑Point Recovery and Regional Checkpoint in Flink: Design, Implementation, and Optimizations
dbaplus Community
dbaplus Community
Mar 20, 2021 · Big Data

How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop

This article details how a bank transformed its retail system data pipeline from a monolithic DB2 setup to a distributed Oracle‑Hadoop architecture, evaluated five extraction tools, selected Sqoop direct mode, and implemented customizations to achieve over 50% performance gains and reliable incremental data capture.

Big DataDirect ModeHadoop
0 likes · 11 min read
How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop
Xianyu Technology
Xianyu Technology
Mar 18, 2021 · Backend Development

Multi-Engine Concurrent Search Architecture for Idlefish

Idlefish’s new multi‑engine concurrent search architecture replaces the tightly‑coupled single‑engine pipeline with deep engine isolation, asynchronous multi‑engine recall, and unified result merging, cutting dump build time from 14 h to 5 h, shrinking memory use dramatically, improving latency by only ~15 ms, and boosting exposure by 50 % and orders by 33 %.

Big DataLuaQuery Planning
0 likes · 10 min read
Multi-Engine Concurrent Search Architecture for Idlefish
Sohu Tech Products
Sohu Tech Products
Mar 17, 2021 · Big Data

Understanding Simhash: From Traditional Hash to Random Projection LSH

This article explains the principles and implementation of Simhash, covering the shortcomings of traditional hash functions, the use of cosine similarity, random projection for dimensionality reduction, locality‑sensitive hashing, and practical optimizations for large‑scale duplicate detection.

Big DataCosine SimilarityLocality Sensitive Hashing
0 likes · 24 min read
Understanding Simhash: From Traditional Hash to Random Projection LSH
dbaplus Community
dbaplus Community
Mar 16, 2021 · Big Data

How Kuaishou Scales YARN to Tens of Thousands of Nodes with the Kwai Scheduler

This article explains how Kuaishou’s massive offline compute clusters—tens of thousands of machines processing hundreds of petabytes daily—are managed by a heavily customized YARN stack and the home‑grown Kwai Scheduler, detailing architecture, scheduler evolution, multi‑scenario optimizations, and future scaling plans.

Big DataCluster OptimizationKwai Scheduler
0 likes · 14 min read
How Kuaishou Scales YARN to Tens of Thousands of Nodes with the Kwai Scheduler
JD Cloud Developers
JD Cloud Developers
Mar 15, 2021 · Artificial Intelligence

Top Tech Weekly: AI Earthquake Monitor, PyTorch 1.8, Language Rankings & More

This developer community weekly roundup highlights CCTV's new big‑data governance platform, RedMonk's programming language rankings, Chromium‑based browsers adopting a four‑week release cycle, PyTorch 1.8 with AMD support, the world’s first AI‑driven earthquake monitoring system, Red Hat OpenShift 4.7, a deep meta‑learning model for city sales prediction, and a CVPR breakthrough in controllable human image generation.

Artificial IntelligenceBig DataPyTorch
0 likes · 9 min read
Top Tech Weekly: AI Earthquake Monitor, PyTorch 1.8, Language Rankings & More
DataFunTalk
DataFunTalk
Mar 15, 2021 · Big Data

Ten Gotchas When Migrating Spark Jobs to Flink

This article shares ten practical pitfalls encountered while moving hour‑level Spark session processing jobs to Apache Flink, covering parallelism skew, state TTL, checkpoint handling, logging, debugging, state migration, Reduce vs Process, input validation, event‑time handling, and the trade‑offs of storing data inside Flink.

Big DataFlinkStreaming
0 likes · 19 min read
Ten Gotchas When Migrating Spark Jobs to Flink
Suning Technology
Suning Technology
Mar 13, 2021 · Artificial Intelligence

How Suning’s AI‑Driven Digital Transformation Is Redefining Retail

At the 2021 National Retail CIO Conference in Shanghai, Suning’s Director Wang Junjie detailed the company’s AI, big‑data and cloud‑based three‑step digital transformation strategy, its suite of five mature digital products, and its call for partners to extend these solutions across industries.

Big DataCloud ComputingDigital Transformation
0 likes · 4 min read
How Suning’s AI‑Driven Digital Transformation Is Redefining Retail
vivo Internet Technology
vivo Internet Technology
Mar 10, 2021 · Big Data

Path Analysis Model Design and Engineering Implementation for Internet Data Operations

The article details the design and engineering of a high‑performance path analysis model for internet data operations, explaining session handling, Sankey visualizations, adjacency‑table storage, multi‑granular session partitioning, Spark‑to‑ClickHouse pipelines, and optimizations that enable billion‑scale user‑path queries in about one second.

Big DataClickHouseOLAP
0 likes · 21 min read
Path Analysis Model Design and Engineering Implementation for Internet Data Operations
DataFunTalk
DataFunTalk
Mar 10, 2021 · Big Data

Hive MetaStore Challenges and Optimizations at Kuaishou

At Kuaishou, the Hive MetaStore service, which stores metadata for Hive, faced scalability and performance challenges due to massive dynamic partitions and high query volume, leading to a series of architectural optimizations—including read‑write separation, API enhancements, traffic control, and federation—to improve stability and efficiency.

Big DataHiveKuaishou
0 likes · 15 min read
Hive MetaStore Challenges and Optimizations at Kuaishou
JD Cloud Developers
JD Cloud Developers
Mar 8, 2021 · Artificial Intelligence

Weekly Developer Highlights: Flutter 2, JD Cloud, Flink 1.12.2, AI Breakthroughs

This week’s developer roundup covers Google’s Flutter 2 launch, JD Cloud’s next‑gen server, Apache Flink 1.12.2 bug‑fix release, sidewalk robots classified as pedestrians, Microsoft Mesh mixed‑reality platform, Facebook’s self‑supervised SEER model, plus recent AI research from EMNLP and COLING conferences.

Artificial IntelligenceBig DataFlutter
0 likes · 8 min read
Weekly Developer Highlights: Flutter 2, JD Cloud, Flink 1.12.2, AI Breakthroughs
Top Architect
Top Architect
Mar 5, 2021 · Big Data

Elasticsearch Indexing and Search Optimization: Principles, Lucene Internals, and Performance Tuning

This article explains the architecture and core concepts of Elasticsearch and Lucene, outlines the requirements for cross‑month and high‑speed queries on massive datasets, and provides detailed index and search performance tuning techniques—including bulk writes, shard routing, doc‑values management, and pagination strategies—to achieve sub‑second response times on billions of records.

Big DataElasticsearchIndex Optimization
0 likes · 13 min read
Elasticsearch Indexing and Search Optimization: Principles, Lucene Internals, and Performance Tuning
Big Data Technology Architecture
Big Data Technology Architecture
Mar 4, 2021 · Big Data

Improving Interactive Analysis on Massive Datasets with Data Clustering and Data Skipping Using Spark and Iceberg

This article explores how data clustering techniques such as linear order, Z‑order, and Hilbert‑curve ordering can be applied in Apache Spark and Apache Iceberg to achieve efficient data skipping on terabyte‑scale tables, dramatically reducing file scans and enabling sub‑second interactive analytics for multi‑dimensional queries.

Big DataData ClusteringData Skipping
0 likes · 20 min read
Improving Interactive Analysis on Massive Datasets with Data Clustering and Data Skipping Using Spark and Iceberg
Suning Technology
Suning Technology
Mar 3, 2021 · Big Data

How Can China Build a Secure, Free Data Sharing Ecosystem?

The article examines China's push for free public data sharing, highlighting policy directives, the need for top‑level design, security standards, and education to create a unified, safe data‑governance framework that fuels the digital economy.

Big DataDigital Economydata governance
0 likes · 6 min read
How Can China Build a Secure, Free Data Sharing Ecosystem?
21CTO
21CTO
Mar 2, 2021 · Big Data

How Suning’s Data Platform Unifies OLAP, Metrics, Visualization & Reporting

Suning’s Data Middle Platform integrates an accelerated OLAP engine, a star‑schema metric system, a visualization tool built on standardized dimensions, and a unified report portal to solve data silos, improve security, and enable enterprises to evolve into technology‑driven organizations.

AnalyticsBig DataData Platform
0 likes · 3 min read
How Suning’s Data Platform Unifies OLAP, Metrics, Visualization & Reporting
Laravel Tech Community
Laravel Tech Community
Feb 28, 2021 · Big Data

Apache Beam 2.28.0 Release Highlights and New Features

Apache Beam 2.28.0 introduces extensive Parquet support, new hash functions in BeamSQL and ZetaSQL, ApproximateDistinct via HLL, enhanced I/O connectors including SpannerIO for Numeric fields, ParquetIO schema support, KafkaTableProvider thrift, HadoopFormatIO key/value cloning skip, and various other improvements.

Apache BeamBatchBig Data
0 likes · 3 min read
Apache Beam 2.28.0 Release Highlights and New Features
DataFunTalk
DataFunTalk
Feb 28, 2021 · Big Data

Migrating Youzan Offline Spark Platform to Kubernetes: Architecture, Optimizations, and Lessons Learned

This article details how Youzan's offline Spark computing platform was transformed for the cloud‑native era by migrating from YARN to Kubernetes, introducing containerization, storage‑compute separation, dynamic allocation, deployment optimizations, and a collection of practical lessons to reduce cost and improve resource utilization.

Big DataPerformance OptimizationSpark
0 likes · 27 min read
Migrating Youzan Offline Spark Platform to Kubernetes: Architecture, Optimizations, and Lessons Learned
TAL Education Technology
TAL Education Technology
Feb 25, 2021 · Databases

ClickHouse Overview: Architecture, Features, Performance, and Practical Use Cases at TAL Education

This article provides a comprehensive overview of ClickHouse, covering its background, core features, columnar storage, vectorized execution engine, table engines, distributed architecture, performance benchmarks, real‑world deployment at TAL Education, monitoring practices, encountered challenges, and future planning.

Big DataClickHouseColumnar Database
0 likes · 18 min read
ClickHouse Overview: Architecture, Features, Performance, and Practical Use Cases at TAL Education
DataFunTalk
DataFunTalk
Feb 23, 2021 · Big Data

Meituan Hotel & Travel Data Governance: Journey, Practices, and Future Directions

This article outlines Meituan's hotel‑travel data governance evolution, describing the key quality, cost, security, standardization and efficiency challenges faced as the business scaled, and detailing the organizational, technical, metric, service and product‑entry solutions implemented to achieve systematic, measurable, and automated data governance.

Big DataData Securitydata governance
0 likes · 19 min read
Meituan Hotel & Travel Data Governance: Journey, Practices, and Future Directions
DataFunTalk
DataFunTalk
Feb 22, 2021 · Big Data

Optimizing Flink Real-Time Task Resources: Memory and Message Processing Perspectives

This article explores practical methods for optimizing Flink real‑time task resources on Kubernetes, focusing on memory usage analysis via GC logs and message‑processing capacity assessment, proposing automated detection of over‑provisioned memory and CPU, and outlining a workflow for resource adjustment to reduce costs.

Big DataFlinkGC Analysis
0 likes · 18 min read
Optimizing Flink Real-Time Task Resources: Memory and Message Processing Perspectives
dbaplus Community
dbaplus Community
Feb 18, 2021 · Big Data

How JD Search Scaled Real‑Time Analytics with Flink and Doris

This article details JD Search's journey from a Storm‑based pipeline to a Flink‑driven architecture backed by Apache Doris, covering business requirements, technical challenges, design trade‑offs, performance optimizations for massive traffic spikes, and future plans for their real‑time OLAP data warehouse.

Big DataDorisFlink
0 likes · 12 min read
How JD Search Scaled Real‑Time Analytics with Flink and Doris
DataFunTalk
DataFunTalk
Feb 17, 2021 · Big Data

Apache Iceberg 0.11.0: New Partition Support, SortOrder, Flink Streaming Reader, and Ecosystem Integrations

The article details Apache Iceberg 0.11.0's core enhancements—including partition changes, SortOrder, extensive Flink and Spark integrations, CDC/Upsert support, hash‑based write distribution to reduce small files, and upcoming 0.12.0 roadmap—while providing practical SQL and API examples for data‑lake practitioners.

Apache IcebergBig DataCDC
0 likes · 13 min read
Apache Iceberg 0.11.0: New Partition Support, SortOrder, Flink Streaming Reader, and Ecosystem Integrations