Tagged articles

3697 articles

Page 22 of 37

May 16, 2021 · Big Data

Efficient Data Update/Delete and Real‑time Processing in the Arctic Lakehouse System

This article explains the evolution from traditional data warehouses to modern lakehouse architectures, introduces the Arctic system’s dynamic hash tree for fast update/delete, describes file splitting with sequence/offset ordering, and compares copy‑on‑write versus merge‑on‑read techniques for achieving low‑latency analytics.

ArcticBig DataCopy-on-Write

0 likes · 12 min read

Efficient Data Update/Delete and Real‑time Processing in the Arctic Lakehouse System

Big Data Technology & Architecture

May 15, 2021 · Big Data

One‑Stop Big Data Platform Construction: Practices from WeBank, Beike, and iQIYI

This article shares practical notes on building a one‑stop big data platform, outlining essential functions such as data extraction, cleaning, storage, analysis, governance, and security, and presents implementation case studies from WeBank, Beike, and iQIYI to illustrate real‑world architectures and solutions.

Big DataData Platformcase study

0 likes · 8 min read

One‑Stop Big Data Platform Construction: Practices from WeBank, Beike, and iQIYI

Architects Research Society

May 15, 2021 · Big Data

Data Warehouse vs Data Lake: Definitions, Differences, and Architectural Considerations

Data warehouses store structured data centrally for reporting and analysis, while data lakes retain raw data in various formats, offering flexible, low‑cost, schema‑on‑read processing; the article explains their definitions, key differences, common misconceptions, and why many organizations now combine both to enable self‑service big‑data analytics.

AnalyticsBig DataData Architecture

0 likes · 21 min read

Data Warehouse vs Data Lake: Definitions, Differences, and Architectural Considerations

DataFunTalk

May 14, 2021 · Big Data

Real‑time Billion‑Scale Data Transmission and AI Pipeline Architecture at Bilibili

This article presents a technical deep‑dive into Bilibili’s evolution from offline to real‑time data processing, describing the challenges of timeliness, ETL, AI feature engineering, and the design of a Flink‑on‑YARN incremental pipeline that supports trillion‑scale message throughput and AI‑driven real‑time applications.

AIBig DataFlink

0 likes · 27 min read

Real‑time Billion‑Scale Data Transmission and AI Pipeline Architecture at Bilibili

HelloTech

May 14, 2021 · Big Data

User Behavior Analysis System: Architecture, ClickHouse Cluster Deployment, and Analytical Techniques

The article describes a real‑time user behavior analysis platform built on a ClickHouse cluster, detailing its architecture, Hive‑to‑ClickHouse data ingestion with user‑ID routing, table designs for behavior and group data, and five analytical methods—event, funnel, path, retention, and attribution—leveraging shard‑level parallelism and custom functions for high efficiency.

AnalyticsBig DataClickHouse

0 likes · 20 min read

User Behavior Analysis System: Architecture, ClickHouse Cluster Deployment, and Analytical Techniques

iQIYI Technical Product Team

May 14, 2021 · Industry Insights

How iQIYI Merges AI, Big Data, and Cloud to Revolutionize Entertainment Production

In a keynote at the 2021 iQIYI World Conference, the CTO outlined how AI, big data, and cloud computing power three intelligent production suites, interactive user features, and immersive XR live concerts, illustrating the company’s tech‑driven strategy to reshape entertainment creation and consumption.

AIBig DataCloud Computing

0 likes · 9 min read

How iQIYI Merges AI, Big Data, and Cloud to Revolutionize Entertainment Production

ITPUB

May 14, 2021 · Big Data

How AnalyticDB Powers Petabyte-Scale Consumer Analytics in Alibaba’s Data Bank

The article details how Alibaba’s Data Bank leverages AnalyticDB’s cold‑hot tiered storage, high‑throughput real‑time writes, and low‑latency OLAP capabilities to handle petabyte‑scale consumer data, support flexible AIPL analysis, crowd profiling, and rapid audience selection while cutting costs and ensuring elasticity during peak events.

AnalyticDBBig DataCold-Hot Storage

0 likes · 14 min read

How AnalyticDB Powers Petabyte-Scale Consumer Analytics in Alibaba’s Data Bank

Volcano Engine Developer Services

May 13, 2021 · Databases

Inside ByteGraph: How ByteDance Built a Scalable Distributed Graph Database

The article offers a comprehensive technical deep‑dive into ByteDance’s home‑grown distributed graph database and graph‑processing engine, ByteGraph, covering its directed‑property graph model, Gremlin query support, multi‑layer architecture, storage strategies for massive data, and real‑world graph‑computing practices.

Big DataByteGraphGraph Database

0 likes · 28 min read

Inside ByteGraph: How ByteDance Built a Scalable Distributed Graph Database

JD Retail Technology

May 13, 2021 · Big Data

Evolution and Architecture of JD.com Self‑Operated Rebate Platform

The article details the development, challenges, and redesign of JD.com’s self‑operated rebate system, describing its early monolithic architecture, data‑intensive processing pipeline, migration to a modular, high‑availability platform built on Spark, Hive, and Elasticsearch, and the resulting performance and operational improvements.

Big DataETLSpark

0 likes · 16 min read

Evolution and Architecture of JD.com Self‑Operated Rebate Platform

DataFunTalk

May 12, 2021 · Big Data

Building a Unified Real‑Time and Offline OLAP Platform with DorisDB at Yuanfudao

The article describes how Yuanfudao's data middle platform built a high‑performance OLAP service using the MPP HOLAP engine DorisDB to unify real‑time and batch analytics, meet low‑latency and high‑concurrency requirements, and support diverse education‑industry use cases such as live‑stream monitoring, advertising, and order analytics.

Big DataDorisDBEducation Technology

0 likes · 13 min read

Building a Unified Real‑Time and Offline OLAP Platform with DorisDB at Yuanfudao

Tencent Advertising Technology

May 12, 2021 · Artificial Intelligence

2021 Tencent Advertising Algorithm Competition Live Streams and Technical Insights

The 2021 Tencent Advertising Algorithm Competition featured live streams on May 10-12, 2021, with experts discussing the competition's technical aspects and practical applications of the Angel distributed machine learning framework.

AIBig Datamachine learning

0 likes · 4 min read

2021 Tencent Advertising Algorithm Competition Live Streams and Technical Insights

Tencent Tech

May 12, 2021 · Big Data

How Tencent Powered China’s 7th Census with Big Data and Cloud Tech

The article explains how China’s seventh national census, covering 1.41 billion people, was conducted using fully electronic data collection, self‑service mini‑programs, massive cloud‑native infrastructure, and high‑performance databases to achieve real‑time processing and unprecedented scale.

Big DataDatabasescensus

0 likes · 8 min read

How Tencent Powered China’s 7th Census with Big Data and Cloud Tech

Yuanfudao Tech

May 12, 2021 · Databases

Building a Unified Real‑time and Offline OLAP Platform with DorisDB at Yuanfudao

Yuanfudao's data middle platform leverages the MPP database DorisDB to create a unified OLAP system that supports both real‑time and batch analytics, handling millions of queries daily with sub‑second latency while meeting complex business requirements across its education services.

Big DataData WarehouseDatabase

0 likes · 12 min read

DataFunTalk

May 11, 2021 · Big Data

Design and Practice of Baixin Bank's Flink‑Based Real‑Time Computing Platform and Hudi‑Powered Real‑Time Data Lake

This article details Baixin Bank's construction of a Flink‑driven real‑time computing platform integrated with Hudi as a real‑time data lake, covering background, architecture, data collection, transformation, storage layers, technical challenges, future roadmap, and practical lessons for similar big‑data initiatives.

Big DataData engineeringFlink

0 likes · 12 min read

Design and Practice of Baixin Bank's Flink‑Based Real‑Time Computing Platform and Hudi‑Powered Real‑Time Data Lake

Big Data Technology & Architecture

May 11, 2021 · Big Data

Data Quality: Dimensions, Rules, and Constraints

The article explains the importance of data quality in the big data era, defines key quality dimensions such as completeness, uniqueness, validity, consistency, accuracy, timeliness, and credibility, and details how each dimension can be measured and enforced through specific constraints and validation rules.

AccuracyBig DataConsistency

0 likes · 9 min read

Data Quality: Dimensions, Rules, and Constraints

Alibaba Cloud Native

May 10, 2021 · Cloud Native

What Is Fluid? A Cloud‑Native Data Orchestration and Acceleration Platform

Fluid is an open‑source cloud‑native data orchestration and acceleration system that runs on Kubernetes, offering storage‑agnostic datasets, distributed caching, intelligent scheduling, and performance optimizations for data‑intensive AI and big‑data workloads.

AIBig DataData Orchestration

0 likes · 6 min read

What Is Fluid? A Cloud‑Native Data Orchestration and Acceleration Platform

Architects Research Society

May 9, 2021 · Big Data

Data Lakes vs. Data Warehouses: Key Differences and Choosing the Right Approach

This article explains the fundamental distinctions between data lakes and data warehouses, outlines five critical differences—including data retention, type support, user support, adaptability, and insight speed—and offers guidance on selecting the appropriate solution based on organizational needs and technology options.

AnalyticsBig DataData Architecture

0 likes · 12 min read

Data Lakes vs. Data Warehouses: Key Differences and Choosing the Right Approach

Architecture Digest

May 7, 2021 · Big Data

Comprehensive Overview of Data Middle Platform Architecture and Practices

This article provides a detailed introduction to data middle platform concepts, covering data aggregation, ingestion tools, offline and real‑time development, data governance, service layers, monitoring, and deployment patterns, illustrating how enterprises build unified data ecosystems across various industries.

Big DataData PlatformData Warehouse

0 likes · 25 min read

Comprehensive Overview of Data Middle Platform Architecture and Practices

Qu Tech

May 6, 2021 · Big Data

How JuiceFS Cut HDFS Load by 26% and Boost Presto Query Speed 13%

This case study details how integrating JuiceFS with Presto reduced HDFS cluster load by about 26%, achieved over 90% cache hit rate for ad‑hoc queries, and lowered average query latency by roughly 13%, while simplifying operations and improving system stability.

Big DataCacheHDFS

0 likes · 9 min read

How JuiceFS Cut HDFS Load by 26% and Boost Presto Query Speed 13%

21CTO

May 5, 2021 · Big Data

AWS Unveils EMR Studio IDE for Data Scientists, Highlights Linux Kernel Security

AWS introduces a new EMR Studio IDE to accelerate data science workflows, while the Linux community bans University of Minnesota contributions over malicious patches and Google Chrome adopts Intel‑Microsoft hardware‑enforced stack protection to harden browser security.

Big DataCETChrome

0 likes · 6 min read

AWS Unveils EMR Studio IDE for Data Scientists, Highlights Linux Kernel Security

DataFunTalk

May 5, 2021 · Big Data

JD's OLAP Architecture: Design, Challenges, and Solutions

This article explains how JD constructs its OLAP platform from data ingestion to storage, querying, and management, describing the diverse data sources, real‑time and offline processing, scalability, consistency, fault tolerance, and future optimization plans, while addressing key technical challenges and solutions.

Big DataDistributed SystemsJD.com

0 likes · 15 min read

JD's OLAP Architecture: Design, Challenges, and Solutions

DataFunTalk

May 4, 2021 · Big Data

Design and Implementation of a Real-Time Data Transmission Platform Based on Apache Flink at AutoHome

This article presents the background, requirements, architectural design, component interaction, and implementation details of AutoHome's real‑time data transmission platform built on Apache Flink, highlighting its high availability, exactly‑once semantics, scalability, DDL handling, and integration with existing streaming services.

Apache FlinkBig DataData Streaming

0 likes · 18 min read

Design and Implementation of a Real-Time Data Transmission Platform Based on Apache Flink at AutoHome

Top Architect

May 4, 2021 · Big Data

Overview of CDC Tools: Canal, Maxwell, Databus, and Alibaba DTS

This article introduces four change‑data‑capture solutions—Canal, Maxwell, Databus, and Alibaba Data Transmission Service (DTS)—explaining their principles, processing steps, features, and practical advantages for real‑time data synchronization and migration in big‑data environments.

Alibaba DTSBig DataCDC

0 likes · 6 min read

Overview of CDC Tools: Canal, Maxwell, Databus, and Alibaba DTS

Python Crawling & Data Mining

May 4, 2021 · Big Data

Unlock 100+ Free Data APIs with Just 3 Lines of Python

This article introduces the GoPUP library, which provides over a hundred free data interfaces—including social media indexes, macro‑economic figures, company information, and epidemic statistics—accessible with simple Python code, making data analysis faster and easier.

APIBig DataPython

0 likes · 7 min read

Unlock 100+ Free Data APIs with Just 3 Lines of Python

DataFunTalk

May 2, 2021 · Big Data

Continuous Optimization and Practice of Flink at Kuaishou

This article presents Kuaishou's comprehensive engineering practices for improving Flink's stability, task startup latency, and SQL performance, including high‑availability Kafka connectors, fault‑recovery mechanisms, I/O reductions, asynchronous job upgrades, aggregation optimizations, and future resource‑utilization plans.

Big DataFlinkKafka

0 likes · 10 min read

Continuous Optimization and Practice of Flink at Kuaishou

Architects' Tech Alliance

May 2, 2021 · Big Data

Understanding Data Middle Platform: Concepts, Drivers, Architecture, and Industry Trends

The article explains the concept of a data middle platform, its role in integrating and centralizing enterprise data, the drivers behind its adoption, architectural layers, implementation challenges, market landscape, and real‑world case studies, highlighting how big‑data, cloud and AI technologies enable digital transformation.

AIBig DataDigital Transformation

0 likes · 15 min read

Understanding Data Middle Platform: Concepts, Drivers, Architecture, and Industry Trends

IT Architects Alliance

May 1, 2021 · Big Data

Comprehensive Guide to ELK Stack (Elasticsearch, Logstash, Kibana) Installation, Configuration, and Architecture

This article provides a detailed overview of the ELK stack—including Elasticsearch, Logstash, Kibana, and Beats—explaining its components, why to use it for centralized log management, various deployment architectures, system tuning, security setup, and step‑by‑step installation and configuration commands for a production‑grade environment.

Big DataELKElasticsearch

0 likes · 22 min read

Comprehensive Guide to ELK Stack (Elasticsearch, Logstash, Kibana) Installation, Configuration, and Architecture

Programmer DD

Apr 30, 2021 · Big Data

Kafka 2.8.0 Release: Say Goodbye to ZooKeeper with Raft Metadata Mode

Kafka 2.8.0, released on April 19, 2021, introduces the groundbreaking Raft Metadata mode that eliminates the need for ZooKeeper, alongside numerous new features, bug fixes, and enhancements such as API controls for stream threads, SASL_SSL mutual TLS, and IP rate limiting.

Big DataKafkaRaft

0 likes · 5 min read

Kafka 2.8.0 Release: Say Goodbye to ZooKeeper with Raft Metadata Mode

Tencent Cloud Developer

Apr 29, 2021 · Industry Insights

Future of Databases & Big Data: Insights from the First Techo TVP Summit

The inaugural Techo TVP Developer Summit in Shenzhen gathered over 500 developers to explore the latest trends in databases, distributed systems, big data, and cloud‑native technologies, offering expert analyses, real‑world case studies, and career guidance for data professionals.

Big DataData engineeringDatabases

0 likes · 19 min read

Future of Databases & Big Data: Insights from the First Techo TVP Summit

Architect

Apr 29, 2021 · Big Data

ELK Stack (Elasticsearch, Logstash, Kibana) Overview, Architecture, Installation, and Configuration Guide (Version 7.7.0)

This article provides a comprehensive introduction to the ELK stack—including component descriptions, architectural diagrams, reasons for adoption, and step‑by‑step installation and configuration of Filebeat, Logstash, Elasticsearch, and Kibana on Linux, with optional Kafka integration for advanced pipelines.

Big DataELKElasticsearch

0 likes · 22 min read

ELK Stack (Elasticsearch, Logstash, Kibana) Overview, Architecture, Installation, and Configuration Guide (Version 7.7.0)

DataFunTalk

Apr 28, 2021 · Big Data

Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features

This article explains how NVIDIA's RAPIDS Accelerator leverages GPUs to speed up Apache Spark 3.0 workloads, detailing the underlying architecture, benchmark results on TPC‑DS and recommendation models, required configuration changes, supported operators, shuffle optimizations, and the enhancements introduced in versions 0.2 and 0.3.

Apache SparkBig DataGPU Acceleration

0 likes · 19 min read

Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features

Practical DevOps Architecture

Apr 28, 2021 · Big Data

Step-by-Step Hadoop Environment Setup and Configuration on Three Linux Servers

This guide walks through preparing three Linux servers, installing JDK 1.8, configuring Hadoop core, HDFS, MapReduce, and YARN XML files, setting Java environment variables, formatting HDFS, and starting all services to access the Hadoop web UI.

Big DataHDFSHadoop

0 likes · 4 min read

Step-by-Step Hadoop Environment Setup and Configuration on Three Linux Servers

DataFunTalk

Apr 27, 2021 · Big Data

Implementing CDC‑to‑Hudi for Real‑Time Mutable Data in a Big Data System

This article describes how Linkflow migrated mutable customer data from MySQL to an Apache Hudi data lake using Debezium‑in‑Flink CDC, addressing challenges such as snapshot resumability, partial updates, row‑key merging, schema evolution, indexing, and concurrent writes to achieve minute‑level data freshness and improved offline processing performance.

Apache HudiBig DataCDC

0 likes · 21 min read

Implementing CDC‑to‑Hudi for Real‑Time Mutable Data in a Big Data System

DataFunTalk

Apr 26, 2021 · Big Data

Detailed Design and Practical Application of Apache Iceberg at NetEase Cloud Music

This article explains the motivations behind Apache Iceberg, its design principles such as snapshot and MVCC, compares it with Hive, and describes how NetEase Cloud Music adopted Iceberg to improve metadata handling, query performance, and operational stability for massive daily log data.

Apache IcebergBig DataData Lake

0 likes · 13 min read

Detailed Design and Practical Application of Apache Iceberg at NetEase Cloud Music

Tencent Advertising Technology

Apr 26, 2021 · Artificial Intelligence

Tencent Ad Algorithm Competition and Its Academic Recognition

The Tencent Ad Algorithm Competition, now in its fourth edition, has gained significant academic recognition by aligning with the ACM MM Grand Challenge, introducing new tracks in video advertising technology to address multimedia challenges in the 5G era.

5G TechnologyACM MMBig Data

0 likes · 3 min read

Tencent Ad Algorithm Competition and Its Academic Recognition

DataFunTalk

Apr 23, 2021 · Big Data

Building and Evolving Zhihu’s Flink‑Based Data Integration Platform

This article details Zhihu’s transition from a Sqoop‑driven data integration system to a Flink‑centric platform, covering business scenarios, historical architecture, design goals, technology choices, performance optimizations, and future plans for unified streaming‑batch processing across diverse storage systems.

Batch processingBig DataData Integration

0 likes · 14 min read

Building and Evolving Zhihu’s Flink‑Based Data Integration Platform

IT Architects Alliance

Apr 23, 2021 · Industry Insights

Inside Toutiao’s Massive Scale: How the News App Handles Billions of Requests

This article provides an in‑depth technical overview of Toutiao’s rapid growth, data collection pipelines, user modeling, cold‑start strategies, recommendation engine architecture, storage solutions, push notification system, microservice design, and its three‑layer PaaS platform, illustrating how the news app serves hundreds of millions of users daily.

Big DataSystem architectureToutiao

0 likes · 8 min read

Inside Toutiao’s Massive Scale: How the News App Handles Billions of Requests

Laravel Tech Community

Apr 22, 2021 · Big Data

Apache Kafka 2.8.0 Release Highlights and New Features

Apache Kafka 2.8.0 introduces several significant enhancements, including a new group API, mutual TLS authentication for SASL_SSL listeners, JSON request/response logging, broker connection rate limiting, topic identifiers, self‑managed quorum replacing ZooKeeper, and numerous improvements to Streams and Connect APIs for more reliable real‑time data pipelines.

Apache KafkaBig DataDistributed Systems

0 likes · 2 min read

Apache Kafka 2.8.0 Release Highlights and New Features

Xianyu Technology

Apr 22, 2021 · Big Data

Real-time Performance Optimization of the Mahé Selection and Delivery System

By classifying data streams, aggregating large‑scale T+1 records in six‑hour windows, encoding attributes with multi‑value mappings, storing compressed rule‑hit backups, and synchronizing recall tables in real time, Mahé’s selection‑and‑delivery pipeline cut end‑to‑end latency from minutes to seconds, achieving robust second‑level responsiveness.

Big DataPerformance OptimizationSystem architecture

0 likes · 12 min read

Real-time Performance Optimization of the Mahé Selection and Delivery System

Big Data Technology & Architecture

Apr 22, 2021 · Big Data

Debunking Common Misconceptions About Data Lakes

This article debunks eight common misconceptions about data lakes, explains why they are not mutually exclusive with data warehouses, clarifies that they are not limited to Hadoop or raw data only, and provides practical tips for building flexible, secure, and business‑driven data lake solutions.

AnalyticsBig DataCloud Services

0 likes · 21 min read

Debunking Common Misconceptions About Data Lakes

ITFLY8 Architecture Home

Apr 21, 2021 · Big Data

Designing an Industrial Internet Big Data Platform: Key Strategies

This article presents a comprehensive construction plan for an Industrial Internet big data platform, detailing its overall architecture, data acquisition, edge processing, cloud storage, analytics, security measures, and deployment best practices to enable scalable and reliable industrial IoT solutions.

Big DataIndustrial InternetIoT

0 likes · 1 min read

Designing an Industrial Internet Big Data Platform: Key Strategies

Full-Stack Internet Architecture

Apr 20, 2021 · Big Data

Building Near Real-Time Elasticsearch Indexes for PB‑Scale Data

This article explains how to construct near real‑time Elasticsearch indexes for petabyte‑level datasets by comparing MySQL limitations, describing Elasticsearch fundamentals, and detailing a pipeline that uses Hive, wide tables, MySQL binlog, Canal, and Otter to achieve second‑level index updates.

Big DataCanalElasticsearch

0 likes · 18 min read

Building Near Real-Time Elasticsearch Indexes for PB‑Scale Data

JD Tech

Apr 20, 2021 · Databases

Space-Filling Curves for Efficient Multidimensional Data Storage and Querying

This article introduces space-filling curves such as Z‑ordering, Hilbert, and XZ‑Ordering, explaining their mapping algorithms and how they transform multidimensional spatial data into one‑dimensional indices for efficient storage and querying in key‑value databases, while discussing challenges and practical examples.

Big DataDatabasesSpace-filling Curves

0 likes · 12 min read

Space-Filling Curves for Efficient Multidimensional Data Storage and Querying

DataFunTalk

Apr 17, 2021 · Big Data

Evolution of Beike's OLAP Platform Architecture: From Hive‑MySQL to Multi‑Engine Support

This article reviews the evolution of Beike's OLAP platform—from the early Hive‑to‑MySQL stage, through a Kylin‑based architecture, to a flexible multi‑engine solution—detailing the design choices, metric system, engine selection criteria, encountered challenges, and future development plans.

AnalyticsBig DataDruid

0 likes · 24 min read

Evolution of Beike's OLAP Platform Architecture: From Hive‑MySQL to Multi‑Engine Support

Meituan Technology Team

Apr 15, 2021 · Big Data

Data Governance Practices at Meituan Hotel & Travel Platform

Meituan’s hotel‑travel platform tackled exploding data‑quality, cost, efficiency, and security issues by establishing a full‑link governance framework—standardized processes, a Data Management Committee, and unified “One Model, One Logic, One Service, One Portal” systems—that cut per‑unit costs by ~40%, boosted engineer productivity over 60%, eliminated major security incidents, and set the stage for autonomous, AI‑driven data governance.

Big DataData QualityData Security

0 likes · 32 min read

Data Governance Practices at Meituan Hotel & Travel Platform

TAL Education Technology

Apr 15, 2021 · Artificial Intelligence

Tsinghua University and TAL Launch Phase II Collaboration on Intelligent Education Research

On April 15, Tsinghua University's Computer Science Department and TAL Education's Joint Research Center inaugurated Phase II of their partnership to advance intelligent education through AI-driven teaching environments, interactive mechanisms, knowledge‑graph construction, and personalized assessment technologies.

Artificial IntelligenceBig DataCollaboration

0 likes · 7 min read

Tsinghua University and TAL Launch Phase II Collaboration on Intelligent Education Research

dbaplus Community

Apr 14, 2021 · Big Data

Master Spark Performance: Key Tuning, Shuffle & Join Optimization

This guide compiles practical Spark tuning techniques, covering essential configuration parameters, programming best‑practices, detailed shuffle mechanics, and join optimization strategies, while also addressing common errors and mitigation steps, enabling developers to improve performance and resource utilization in large‑scale data processing jobs.

Big DataError HandlingJOIN optimization

0 likes · 25 min read

Master Spark Performance: Key Tuning, Shuffle & Join Optimization

Programmer DD

Apr 14, 2021 · Big Data

Understanding HDFS Architecture: Key Components, Protocols, and Limitations

This article explains HDFS’s master‑slave architecture, detailing the roles of NameNode and DataNode, namespace management, communication protocols, client functions, common configuration parameters, maintenance commands, and the inherent limitations of a single‑NameNode design.

Big DataDataNodeHDFS

0 likes · 5 min read

Understanding HDFS Architecture: Key Components, Protocols, and Limitations

Programmer DD

Apr 13, 2021 · Big Data

What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features

This article provides a comprehensive overview of HDFS—including its design goals, core components, data read/write workflows, high‑availability mechanisms, federation, storage policies, colocation benefits, and practical usage scenarios—explaining why it is the foundational distributed file system for large‑scale data processing.

Big DataData StorageFederation

0 likes · 17 min read

What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features

DevOps

Apr 12, 2021 · Fundamentals

Understanding the Digital Economy: Definition, Evolution, and Why It Matters Now

The article explains what the digital economy is, its relationship with digital transformation, the strategic importance placed on it by China's 14th Five‑Year Plan, and offers guidance for IT professionals on how to respond to this emerging national priority.

Artificial IntelligenceBig DataDigital Economy

0 likes · 14 min read

Understanding the Digital Economy: Definition, Evolution, and Why It Matters Now

DataFunTalk

Apr 9, 2021 · Big Data

iQIYI Data Middle Platform: Architecture, Capabilities, and Future Outlook

This article explains how iQIYI’s data middle platform addresses the rapid growth and challenges of big data by providing a unified, standardized, and service‑oriented architecture that includes data production, processing, governance, metadata, AI‑enhanced services, and a roadmap for future enhancements.

AIBig Dataarchitecture

0 likes · 23 min read

iQIYI Data Middle Platform: Architecture, Capabilities, and Future Outlook

Top Architect

Apr 9, 2021 · Big Data

Technical Architecture and Data Processing of Toutiao News Feed System

This article provides a comprehensive overview of Toutiao's rapid growth, massive user base, data collection pipelines, user modeling, recommendation engine, storage solutions, message push strategies, micro‑service architecture, and virtualization PaaS platform, illustrating how big‑data technologies enable personalized news delivery at scale.

Big DataToutiaodata pipeline

0 likes · 8 min read

Technical Architecture and Data Processing of Toutiao News Feed System

Big Data Technology Architecture

Apr 8, 2021 · Big Data

Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions

This article explains the small‑file problem in Spark SQL on HDFS, its impact on NameNode memory and query performance, describes how dynamic partition inserts and shuffle settings generate many files, and presents practical solutions such as partition‑based distribution, random bucketing and adaptive query execution to control file count.

Big DataHadoopSmall Files

0 likes · 12 min read

Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions

Sohu Tech Products

Apr 7, 2021 · Big Data

Data Warehouse Architecture and Modeling with Alibaba MaxCompute and DataWorks

This tutorial explains how to select a technical architecture, design a three‑layer data warehouse (ODS, CDM, ADS), model tables and dimensions, choose storage strategies, handle slowly changing dimensions, synchronize data with DataWorks, and implement dimensional modeling and fact tables using Alibaba MaxCompute for big‑data analytics.

Big DataData WarehouseDataWorks

0 likes · 32 min read

Data Warehouse Architecture and Modeling with Alibaba MaxCompute and DataWorks

Big Data Technology Architecture

Apr 5, 2021 · Big Data

Evolution of Real‑Time Data Warehouses: From 1.0 to 3.0 and the Road to Batch‑Stream Unified Architecture

The article reviews the current state of offline Hive‑based data warehouses, explains the emergence of real‑time data warehouses (1.0) built on Kafka and Flink, discusses their limitations, and outlines the progression toward batch‑stream unified architectures (2.0 and 3.0) leveraging data‑lake technologies such as Iceberg.

Batch-Stream IntegrationBig DataFlink

0 likes · 13 min read

Evolution of Real‑Time Data Warehouses: From 1.0 to 3.0 and the Road to Batch‑Stream Unified Architecture

Python Crawling & Data Mining

Apr 4, 2021 · Big Data

Mastering User Behavior Analysis: 6 Essential Techniques for Data‑Driven Growth

This article explains six key user‑behavior analysis methods—event analysis, retention analysis, distribution analysis, conversion‑funnel analysis, path analysis, and session analysis—showing how they help businesses understand user actions, optimize product design, improve conversion rates, and boost revenue through data‑driven insights.

Big DataRetention Analysisconversion funnel

0 likes · 11 min read

Mastering User Behavior Analysis: 6 Essential Techniques for Data‑Driven Growth

Architect

Apr 3, 2021 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article explains advanced Spark performance tuning techniques, focusing on diagnosing and resolving data skew and shuffle bottlenecks through stage analysis, key distribution inspection, and a variety of practical solutions such as Hive pre‑processing, key filtering, parallelism increase, two‑stage aggregation, map‑join, and combined strategies, while also covering ShuffleManager internals and related configuration parameters.

Big DataData SkewPerformance tuning

0 likes · 47 min read

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

Architect

Apr 2, 2021 · Big Data

Spark Performance Optimization Guide: Development and Resource Tuning

This article provides a comprehensive guide to Spark performance optimization, covering development‑level tuning principles, resource configuration parameters, practical code examples, and best‑practice recommendations to achieve high‑throughput big‑data processing.

Big DataOptimizationRDD

0 likes · 33 min read

Spark Performance Optimization Guide: Development and Resource Tuning

Alibaba Cloud Native

Apr 2, 2021 · Cloud Native

How Fluid Turns Kubernetes into a High‑Performance Data Logistics System

This article explains how the open‑source Fluid project addresses the inefficiencies of data‑intensive AI and big‑data workloads in cloud‑native Kubernetes environments by introducing a data‑centric abstraction, dual orchestration mechanisms, and seamless integration with Alluxio to achieve faster, secure, and scalable data access.

AlluxioBig DataData Management

0 likes · 19 min read

How Fluid Turns Kubernetes into a High‑Performance Data Logistics System

Ctrip Technology

Apr 1, 2021 · Big Data

Design and Implementation of a Binlog‑Based Real‑Time Data Foundation Layer for Ctrip Finance

This article describes how Ctrip Finance built a unified financial data center by collecting MySQL binlog streams with Canal, transporting them via Kafka, persisting to HDFS with Spark‑Streaming, and merging into Hive tables, while addressing performance, idempotency, delete handling, and data‑quality checks.

Big Databinlogdata pipeline

0 likes · 14 min read

Design and Implementation of a Binlog‑Based Real‑Time Data Foundation Layer for Ctrip Finance

DataFunTalk

Mar 29, 2021 · Big Data

Beike's OLAP Platform: Druid Adoption, Architecture, Performance Comparison, and Operational Optimizations

This article details Beike's large‑scale OLAP platform, explaining why Druid was chosen over Kylin, describing the platform's four‑layer architecture, presenting performance and storage benchmarks, and outlining practical improvements to data ingestion, real‑time distinct counting, and cluster stability for high‑concurrency business scenarios.

Big DataDruidOLAP

0 likes · 19 min read

Beike's OLAP Platform: Druid Adoption, Architecture, Performance Comparison, and Operational Optimizations

Programmer DD

Mar 29, 2021 · Big Data

Mastering Kafka: High‑Throughput Distributed Messaging Explained

This comprehensive guide introduces Kafka as a high‑throughput, distributed, publish‑subscribe messaging system, detailing its core concepts, architecture, features, replication, log management, reliability guarantees, and typical use cases such as log collection, real‑time analytics, and cross‑cluster mirroring.

Big DataDistributed MessagingKafka

0 likes · 15 min read

Mastering Kafka: High‑Throughput Distributed Messaging Explained

DataFunTalk

Mar 27, 2021 · Big Data

Kuaishou's HDFS Architecture, Scale, Challenges, and Practices

This article presents an in‑depth technical overview of Kuaishou's massive HDFS deployment, detailing its architecture, petabyte‑scale data and thousands‑of‑node clusters, the key scalability challenges faced, and the custom solutions—including FixedOrder, RBF balancer, observer read, slow‑node mitigation, and tiered protection—implemented to keep the system performant and reliable.

Big DataData engineeringHDFS

0 likes · 12 min read

Kuaishou's HDFS Architecture, Scale, Challenges, and Practices

HelloTech

Mar 26, 2021 · Big Data

Data Quality and Interface Semantic Monitoring for Algorithm Testing Platform

The article describes how algorithm testing teams tackled data‑quality and interface‑semantic monitoring problems by building a unified business monitoring platform that checks table, storage and service consistency, validates response semantics, and, through dashboards, alerts and correction tools, quickly identified dozens of offline and online issues, guiding future reliability enhancements.

AIBig DataData Quality

0 likes · 26 min read

Data Quality and Interface Semantic Monitoring for Algorithm Testing Platform

iQIYI Technical Product Team

Mar 26, 2021 · Big Data

Evolution of iQIYI's Real-Time Big Data Ecosystem

iQIYI transformed its data infrastructure from a traditional offline T+1 model to a comprehensive real‑time ecosystem—leveraging Kafka, Flink, a three‑layer Stream Data Service Platform, the Talos drag‑and‑drop pipeline, and a Druid‑based analytics platform—to enable low‑latency monitoring, personalized recommendations, ad targeting, and continuous machine‑learning workflows while planning future stream‑batch integration and lake‑warehouse convergence.

AnalyticsBig DataData Warehouse

0 likes · 13 min read

Evolution of iQIYI's Real-Time Big Data Ecosystem

Ctrip Technology

Mar 25, 2021 · Big Data

Challenges and Approaches for Real‑Time Data Aggregation Analysis

The article examines the key challenges of real‑time data aggregation—data freshness, timely processing, and result visibility—and surveys common solutions such as timestamp‑based sync, CDC, full and incremental computation, storage formats, and trigger mechanisms.

Big DataCDCIncremental Computation

0 likes · 11 min read

Challenges and Approaches for Real‑Time Data Aggregation Analysis

Suning Technology

Mar 24, 2021 · Big Data

How C2M Is Powering the Industrial Internet Boom in 2021

The article examines how policy‑driven industrial internet initiatives, combined with data‑rich C2M models and AIoT integration, are reshaping manufacturing in China, highlighting Suning's smart‑fridge case, strategic partnerships, and the broader push toward a digital‑first industrial era.

AIoTBig DataC2M

0 likes · 8 min read

How C2M Is Powering the Industrial Internet Boom in 2021

ITFLY8 Architecture Home

Mar 24, 2021 · Big Data

Inside Suning’s Data Platform: How OLAP, Metrics and Visualization Power Business

Suning’s data middle platform integrates an accelerated OLAP engine, a star‑schema metrics system, a standardized visualization tool, and a unified report portal to break data silos, enhance security, and transform traditional enterprises into technology‑driven businesses.

Big DataMetricsOLAP

0 likes · 3 min read

Inside Suning’s Data Platform: How OLAP, Metrics and Visualization Power Business

DataFunTalk

Mar 24, 2021 · Big Data

Practical Experience of Using DorisDB for Real-Time and Offline Analytics in KuJiaLe's Big Data Platform

This article details how KuJiaLe's big data team replaced their legacy ADB and Presto clusters with a DorisDB MPP database, achieving sub‑second query latency, unified real‑time and offline analytics, simplified ETL pipelines, and significant cost savings while supporting billion‑row tables and high‑QPS workloads.

Big DataDorisDBETL

0 likes · 9 min read

Practical Experience of Using DorisDB for Real-Time and Offline Analytics in KuJiaLe's Big Data Platform

AntTech

Mar 23, 2021 · Big Data

From MapReduce to Ray: The Evolution of Big Data Computing Engines and Career Opportunities

This article traces the history of big‑data computing engines—from early MapReduce and Hadoop through Spark, Storm, Flink, and the newer Ray—explaining their technical advances, real‑world applications in AI and finance, and why graduates should consider a career in this rapidly evolving field.

AIBig DataDistributed computing

0 likes · 16 min read

From MapReduce to Ray: The Evolution of Big Data Computing Engines and Career Opportunities

DataFunTalk

Mar 21, 2021 · Big Data

Single‑Point Recovery and Regional Checkpoint in Flink: Design, Implementation, and Optimizations

This article presents ByteDance's recent Flink enhancements, detailing a single‑point recovery mechanism for the network layer and a regional checkpoint strategy that together improve failover latency, reduce output loss, and enable scalable, high‑throughput stream processing for large‑scale real‑time recommendation workloads.

Big DataCheckpointFlink

0 likes · 12 min read

Single‑Point Recovery and Regional Checkpoint in Flink: Design, Implementation, and Optimizations

Architect's Alchemy Furnace

Mar 20, 2021 · Databases

Boost Elasticsearch Performance with Hot‑Cold Data Node Separation

This article explains how to configure Elasticsearch nodes for hot and cold data, assign special node attributes, adjust index templates, and use API calls to migrate data, demonstrating significant query speed improvements through real‑world performance tests.

Big DataElasticsearchNode Configuration

0 likes · 8 min read

Boost Elasticsearch Performance with Hot‑Cold Data Node Separation

dbaplus Community

Mar 20, 2021 · Big Data

How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop

This article details how a bank transformed its retail system data pipeline from a monolithic DB2 setup to a distributed Oracle‑Hadoop architecture, evaluated five extraction tools, selected Sqoop direct mode, and implemented customizations to achieve over 50% performance gains and reliable incremental data capture.

Big DataDirect ModeHadoop

0 likes · 11 min read

How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop

Alibaba Terminal Technology

Mar 19, 2021 · Frontend Development

How Alibaba’s Frontend AI Boosts Developer Efficiency on the Feitian Big Data Platform

This article explores Alibaba Cloud's Feitian big data platform and its front‑end intelligent solutions—covering smart editors, code recommendation, code diagnostics, automated visualization, and algorithm engineering—to illustrate how AI enhances developer productivity and product intelligence.

AIAlibaba CloudBig Data

0 likes · 9 min read

How Alibaba’s Frontend AI Boosts Developer Efficiency on the Feitian Big Data Platform

AntTech

Mar 19, 2021 · Artificial Intelligence

Network Effects in Marketing: Graph Neural Network–Based Relationship Prediction and Clustered A/B Testing

This article presents a graph‑neural‑network approach to predict user influence, cluster users with distributed Louvain methods, and conduct network‑aware A/B experiments that accurately evaluate large‑scale marketing campaigns despite strong network effects.

A/B testingBig DataGraph Neural Network

0 likes · 9 min read

Network Effects in Marketing: Graph Neural Network–Based Relationship Prediction and Clustered A/B Testing

Suning Technology

Mar 18, 2021 · Operations

How Suning Carrefour Accelerated Digital Transformation: Lessons in Operations and AI

Suning Carrefour’s rapid digital overhaul since joining Suning in 2019 showcases how AI, big data, and omni‑channel strategies can boost store efficiency, reshape business models, integrate supply chains, and drive high‑growth retail performance.

AIBig DataDigital Transformation

0 likes · 9 min read

How Suning Carrefour Accelerated Digital Transformation: Lessons in Operations and AI

Xianyu Technology

Mar 18, 2021 · Backend Development

Multi-Engine Concurrent Search Architecture for Idlefish

Idlefish’s new multi‑engine concurrent search architecture replaces the tightly‑coupled single‑engine pipeline with deep engine isolation, asynchronous multi‑engine recall, and unified result merging, cutting dump build time from 14 h to 5 h, shrinking memory use dramatically, improving latency by only ~15 ms, and boosting exposure by 50 % and orders by 33 %.

Big DataLuaQuery Planning

0 likes · 10 min read

Multi-Engine Concurrent Search Architecture for Idlefish

Sohu Tech Products

Mar 17, 2021 · Big Data

Understanding Simhash: From Traditional Hash to Random Projection LSH

This article explains the principles and implementation of Simhash, covering the shortcomings of traditional hash functions, the use of cosine similarity, random projection for dimensionality reduction, locality‑sensitive hashing, and practical optimizations for large‑scale duplicate detection.

Big DataCosine SimilarityLocality Sensitive Hashing

0 likes · 24 min read

Understanding Simhash: From Traditional Hash to Random Projection LSH

dbaplus Community

Mar 16, 2021 · Big Data

How Kuaishou Scales YARN to Tens of Thousands of Nodes with the Kwai Scheduler

This article explains how Kuaishou’s massive offline compute clusters—tens of thousands of machines processing hundreds of petabytes daily—are managed by a heavily customized YARN stack and the home‑grown Kwai Scheduler, detailing architecture, scheduler evolution, multi‑scenario optimizations, and future scaling plans.

Big DataCluster OptimizationKwai Scheduler

0 likes · 14 min read

How Kuaishou Scales YARN to Tens of Thousands of Nodes with the Kwai Scheduler

JD Cloud Developers

Mar 15, 2021 · Artificial Intelligence

Top Tech Weekly: AI Earthquake Monitor, PyTorch 1.8, Language Rankings & More

This developer community weekly roundup highlights CCTV's new big‑data governance platform, RedMonk's programming language rankings, Chromium‑based browsers adopting a four‑week release cycle, PyTorch 1.8 with AMD support, the world’s first AI‑driven earthquake monitoring system, Red Hat OpenShift 4.7, a deep meta‑learning model for city sales prediction, and a CVPR breakthrough in controllable human image generation.

Artificial IntelligenceBig DataPyTorch

0 likes · 9 min read

Top Tech Weekly: AI Earthquake Monitor, PyTorch 1.8, Language Rankings & More

DataFunTalk

Mar 15, 2021 · Big Data

Ten Gotchas When Migrating Spark Jobs to Flink

This article shares ten practical pitfalls encountered while moving hour‑level Spark session processing jobs to Apache Flink, covering parallelism skew, state TTL, checkpoint handling, logging, debugging, state migration, Reduce vs Process, input validation, event‑time handling, and the trade‑offs of storing data inside Flink.

Big DataFlinkStreaming

0 likes · 19 min read

Ten Gotchas When Migrating Spark Jobs to Flink

Code Ape Tech Column

Mar 15, 2021 · Big Data

How to Find Common URLs in 5 Billion-Entry Files with Only 4 GB RAM

Given two files each containing 5 billion 64‑byte URLs (≈320 GB total) and only 4 GB of memory, the solution partitions the URLs by hash modulo 1000 into 1,000 smaller files, then uses hash sets to identify the intersecting URLs efficiently.

Big DataMemory Optimizationhash partition

0 likes · 3 min read

How to Find Common URLs in 5 Billion-Entry Files with Only 4 GB RAM

Python Crawling & Data Mining

Mar 14, 2021 · Artificial Intelligence

Quantitative Investing: Myths, Realities, and How AI Fits In

This article demystifies quantitative investing by explaining its basic concepts, common strategies, historical growth, inherent limitations, and the role of AI and big data, while urging investors to view quant methods as tools rather than a universal solution.

AIBig Datafinancial modeling

0 likes · 13 min read

Quantitative Investing: Myths, Realities, and How AI Fits In

Suning Technology

Mar 13, 2021 · Artificial Intelligence

How Suning’s AI‑Driven Digital Transformation Is Redefining Retail

At the 2021 National Retail CIO Conference in Shanghai, Suning’s Director Wang Junjie detailed the company’s AI, big‑data and cloud‑based three‑step digital transformation strategy, its suite of five mature digital products, and its call for partners to extend these solutions across industries.

Big DataCloud ComputingDigital Transformation

0 likes · 4 min read

How Suning’s AI‑Driven Digital Transformation Is Redefining Retail

vivo Internet Technology

Mar 10, 2021 · Big Data

Path Analysis Model Design and Engineering Implementation for Internet Data Operations

The article details the design and engineering of a high‑performance path analysis model for internet data operations, explaining session handling, Sankey visualizations, adjacency‑table storage, multi‑granular session partitioning, Spark‑to‑ClickHouse pipelines, and optimizations that enable billion‑scale user‑path queries in about one second.

Big DataClickHouseOLAP

0 likes · 21 min read

Path Analysis Model Design and Engineering Implementation for Internet Data Operations

DataFunTalk

Mar 10, 2021 · Big Data

Hive MetaStore Challenges and Optimizations at Kuaishou

At Kuaishou, the Hive MetaStore service, which stores metadata for Hive, faced scalability and performance challenges due to massive dynamic partitions and high query volume, leading to a series of architectural optimizations—including read‑write separation, API enhancements, traffic control, and federation—to improve stability and efficiency.

Big DataHiveKuaishou

0 likes · 15 min read

Hive MetaStore Challenges and Optimizations at Kuaishou

Tencent Cloud Developer

Mar 10, 2021 · Cloud Native

How Cloud‑Native Data Lakes Slash Costs and Boost Performance on Public Cloud

The article analyzes the challenges of moving traditional on‑premise big‑data platforms to the cloud, outlines the cost‑saving opportunities of cloud‑native data lakes, presents three core architectural principles, and reviews Tencent Cloud's data lake product suite and its key use cases.

Big DataData LakeObject Storage

0 likes · 11 min read

How Cloud‑Native Data Lakes Slash Costs and Boost Performance on Public Cloud

JD Cloud Developers

Mar 8, 2021 · Artificial Intelligence

Weekly Developer Highlights: Flutter 2, JD Cloud, Flink 1.12.2, AI Breakthroughs

This week’s developer roundup covers Google’s Flutter 2 launch, JD Cloud’s next‑gen server, Apache Flink 1.12.2 bug‑fix release, sidewalk robots classified as pedestrians, Microsoft Mesh mixed‑reality platform, Facebook’s self‑supervised SEER model, plus recent AI research from EMNLP and COLING conferences.

Artificial IntelligenceBig DataFlutter

0 likes · 8 min read

Weekly Developer Highlights: Flutter 2, JD Cloud, Flink 1.12.2, AI Breakthroughs

Top Architect

Mar 5, 2021 · Big Data

Elasticsearch Indexing and Search Optimization: Principles, Lucene Internals, and Performance Tuning

This article explains the architecture and core concepts of Elasticsearch and Lucene, outlines the requirements for cross‑month and high‑speed queries on massive datasets, and provides detailed index and search performance tuning techniques—including bulk writes, shard routing, doc‑values management, and pagination strategies—to achieve sub‑second response times on billions of records.

Big DataElasticsearchIndex Optimization

0 likes · 13 min read

Elasticsearch Indexing and Search Optimization: Principles, Lucene Internals, and Performance Tuning

Big Data Technology Architecture

Mar 4, 2021 · Big Data

Improving Interactive Analysis on Massive Datasets with Data Clustering and Data Skipping Using Spark and Iceberg

This article explores how data clustering techniques such as linear order, Z‑order, and Hilbert‑curve ordering can be applied in Apache Spark and Apache Iceberg to achieve efficient data skipping on terabyte‑scale tables, dramatically reducing file scans and enabling sub‑second interactive analytics for multi‑dimensional queries.

Big DataData ClusteringData Skipping

0 likes · 20 min read

Improving Interactive Analysis on Massive Datasets with Data Clustering and Data Skipping Using Spark and Iceberg

Suning Technology

Mar 3, 2021 · Big Data

How Can China Build a Secure, Free Data Sharing Ecosystem?

The article examines China's push for free public data sharing, highlighting policy directives, the need for top‑level design, security standards, and education to create a unified, safe data‑governance framework that fuels the digital economy.

Big DataDigital Economydata governance

0 likes · 6 min read

How Can China Build a Secure, Free Data Sharing Ecosystem?

21CTO

Mar 2, 2021 · Big Data

How Suning’s Data Platform Unifies OLAP, Metrics, Visualization & Reporting

Suning’s Data Middle Platform integrates an accelerated OLAP engine, a star‑schema metric system, a visualization tool built on standardized dimensions, and a unified report portal to solve data silos, improve security, and enable enterprises to evolve into technology‑driven organizations.

AnalyticsBig DataData Platform

0 likes · 3 min read

How Suning’s Data Platform Unifies OLAP, Metrics, Visualization & Reporting

Laravel Tech Community

Feb 28, 2021 · Big Data

Apache Beam 2.28.0 Release Highlights and New Features

Apache Beam 2.28.0 introduces extensive Parquet support, new hash functions in BeamSQL and ZetaSQL, ApproximateDistinct via HLL, enhanced I/O connectors including SpannerIO for Numeric fields, ParquetIO schema support, KafkaTableProvider thrift, HadoopFormatIO key/value cloning skip, and various other improvements.

Apache BeamBatchBig Data

0 likes · 3 min read

Apache Beam 2.28.0 Release Highlights and New Features

DataFunTalk

Feb 28, 2021 · Big Data

Migrating Youzan Offline Spark Platform to Kubernetes: Architecture, Optimizations, and Lessons Learned

This article details how Youzan's offline Spark computing platform was transformed for the cloud‑native era by migrating from YARN to Kubernetes, introducing containerization, storage‑compute separation, dynamic allocation, deployment optimizations, and a collection of practical lessons to reduce cost and improve resource utilization.

Big DataPerformance OptimizationSpark

0 likes · 27 min read

Migrating Youzan Offline Spark Platform to Kubernetes: Architecture, Optimizations, and Lessons Learned

TAL Education Technology

Feb 25, 2021 · Databases

ClickHouse Overview: Architecture, Features, Performance, and Practical Use Cases at TAL Education

This article provides a comprehensive overview of ClickHouse, covering its background, core features, columnar storage, vectorized execution engine, table engines, distributed architecture, performance benchmarks, real‑world deployment at TAL Education, monitoring practices, encountered challenges, and future planning.

Big DataClickHouseColumnar Database

0 likes · 18 min read

ClickHouse Overview: Architecture, Features, Performance, and Practical Use Cases at TAL Education

Python Programming Learning Circle

Feb 25, 2021 · Big Data

Parallel Computing and Python Multiprocessing: Concepts, Models, and Practical Examples

This article explains the fundamentals of parallel computing in the big‑data era, compares parallelism and concurrency, outlines GPU and distributed‑computing solutions, and provides a detailed guide to Python’s multiprocessing module with code examples, performance tests, and practical tips.

Big DataDistributed computingGPU

0 likes · 18 min read

Parallel Computing and Python Multiprocessing: Concepts, Models, and Practical Examples

DataFunTalk

Feb 23, 2021 · Big Data

Meituan Hotel & Travel Data Governance: Journey, Practices, and Future Directions

This article outlines Meituan's hotel‑travel data governance evolution, describing the key quality, cost, security, standardization and efficiency challenges faced as the business scaled, and detailing the organizational, technical, metric, service and product‑entry solutions implemented to achieve systematic, measurable, and automated data governance.

Big DataData Securitydata governance

0 likes · 19 min read

Meituan Hotel & Travel Data Governance: Journey, Practices, and Future Directions

DataFunTalk

Feb 22, 2021 · Big Data

Optimizing Flink Real-Time Task Resources: Memory and Message Processing Perspectives

This article explores practical methods for optimizing Flink real‑time task resources on Kubernetes, focusing on memory usage analysis via GC logs and message‑processing capacity assessment, proposing automated detection of over‑provisioned memory and CPU, and outlining a workflow for resource adjustment to reduce costs.

Big DataFlinkGC Analysis

0 likes · 18 min read

Optimizing Flink Real-Time Task Resources: Memory and Message Processing Perspectives

dbaplus Community

Feb 18, 2021 · Big Data

How JD Search Scaled Real‑Time Analytics with Flink and Doris

This article details JD Search's journey from a Storm‑based pipeline to a Flink‑driven architecture backed by Apache Doris, covering business requirements, technical challenges, design trade‑offs, performance optimizations for massive traffic spikes, and future plans for their real‑time OLAP data warehouse.

Big DataDorisFlink

0 likes · 12 min read

How JD Search Scaled Real‑Time Analytics with Flink and Doris

DataFunTalk

Feb 17, 2021 · Big Data

Apache Iceberg 0.11.0: New Partition Support, SortOrder, Flink Streaming Reader, and Ecosystem Integrations

The article details Apache Iceberg 0.11.0's core enhancements—including partition changes, SortOrder, extensive Flink and Spark integrations, CDC/Upsert support, hash‑based write distribution to reduce small files, and upcoming 0.12.0 roadmap—while providing practical SQL and API examples for data‑lake practitioners.

Apache IcebergBig DataCDC

0 likes · 13 min read

Apache Iceberg 0.11.0: New Partition Support, SortOrder, Flink Streaming Reader, and Ecosystem Integrations