Tag

Data Skew

0 views collected around this technical thread.

DaTaobao Tech
DaTaobao Tech
Jun 21, 2024 · Big Data

Flink Real-Time Data Development: Cases on Data Skew, Watermark Failure, and GroupBy Issues

The article walks through three Flink streaming pitfalls—data‑skew‑induced back‑pressure, lost watermarks after interval joins, and ineffective group‑by causing duplicate rows—and shows how to resolve them with two‑stage distinct aggregation, hash‑based key distribution, processing‑time windows or split jobs, and mini‑batch buffering.

Data SkewFlinkSQL
0 likes · 14 min read
Flink Real-Time Data Development: Cases on Data Skew, Watermark Failure, and GroupBy Issues
JD Tech
JD Tech
Jun 14, 2023 · Big Data

Understanding and Solving Data Skew in Offline Big Data Development (Hive & Spark)

This article explains the concept of data skew in offline big‑data jobs, describes its symptoms and root causes, and provides practical optimization techniques for Hive and Spark—including partitioning strategies, map‑join usage, adaptive query settings, and monitoring approaches—to prevent performance degradation and runtime failures.

Data SkewHiveSQL
0 likes · 17 min read
Understanding and Solving Data Skew in Offline Big Data Development (Hive & Spark)
JD Retail Technology
JD Retail Technology
Apr 14, 2023 · Big Data

Understanding Data Skew and Its Mitigation in Hive and Spark

This article explains the concept of data skew, its symptoms such as slow tasks and OOM errors, and provides comprehensive mitigation techniques and configuration examples for Hive and Spark, including custom partitioning, map joins, adaptive execution, and key detection methods.

Adaptive ExecutionBig DataData Skew
0 likes · 15 min read
Understanding Data Skew and Its Mitigation in Hive and Spark
Data Thinking Notes
Data Thinking Notes
Dec 21, 2022 · Big Data

Why Your Spark Batch Job Fails: Memory Limits, Data Skew, and Practical Fixes

This article examines a recurring Spark batch task failure caused by OutOfMemory errors and data skew, details the investigation steps—including increasing executor memory, raising parallelism, and analyzing shuffle metrics—and proposes solutions such as data validation, filtering oversized keys, and memory adjustments.

Batch ProcessingBig DataData Skew
0 likes · 4 min read
Why Your Spark Batch Job Fails: Memory Limits, Data Skew, and Practical Fixes
JD Tech
JD Tech
Dec 1, 2022 · Databases

Understanding Redis Data Skew and Hotkey Detection with JD Open‑Source hotkey Solution

This article explains the concept of Redis data skew, its causes and impacts, explores data volume and access skew classifications, presents mitigation strategies, and provides a comprehensive source‑code walkthrough of JD's open‑source hotkey framework—including client, worker, and dashboard components—for detecting and handling hot keys in distributed cache clusters.

Data SkewHotKeyJava
0 likes · 54 min read
Understanding Redis Data Skew and Hotkey Detection with JD Open‑Source hotkey Solution
Data Thinking Notes
Data Thinking Notes
Nov 22, 2022 · Big Data

Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It

A Sqoop job that normally finishes within 2.5 hours occasionally takes more than 8 hours due to data skew caused by an unsuitable split column, and the article details the investigation, root‑cause analysis, and a practical solution using a better split column and adjusted parallelism.

Big DataData SkewHive
0 likes · 5 min read
Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It
Data Thinking Notes
Data Thinking Notes
Oct 24, 2022 · Big Data

How to Diagnose and Fix Spark Data Skew: Practical Optimization Techniques

This article explains the causes of Spark data skew, how to locate skewed tasks using the Web UI, and presents six optimization methods—including increasing shuffle parallelism, filtering abnormal keys, two‑stage aggregation, map‑join, key sampling, and random‑prefix joins—plus a real‑world case study.

Big DataData SkewJoin
0 likes · 21 min read
How to Diagnose and Fix Spark Data Skew: Practical Optimization Techniques
NetEase LeiHuo UX Big Data Technology
NetEase LeiHuo UX Big Data Technology
Oct 17, 2022 · Big Data

Understanding Data Skew and Its Mitigation Strategies in Distributed Computing

The article explains what data skew is in distributed computing, analyzes its logical and data‑level causes, and presents preventive and remedial techniques such as data partitioning, logical replacement, two‑stage aggregation, increasing parallelism, and data cleaning to improve processing efficiency.

Big DataData SkewPerformance Optimization
0 likes · 8 min read
Understanding Data Skew and Its Mitigation Strategies in Distributed Computing
DaTaobao Tech
DaTaobao Tech
Sep 6, 2022 · Big Data

SQL Optimization Techniques for ODPS (Open Data Processing Service)

The article presents practical ODPS SQL optimization strategies—including explicit column selection, partition limiting, multi‑insert, proper handling of nulls, join‑type choices, map‑join and skew hints, bucketed tables, and tuned task parameters—illustrated with three real‑world cases that dramatically cut execution time and resource usage.

Big DataData SkewHive
0 likes · 23 min read
SQL Optimization Techniques for ODPS (Open Data Processing Service)
DataFunTalk
DataFunTalk
Jun 28, 2022 · Big Data

JD Retail Traffic Data Warehouse Architecture and Processing Practices

This article presents a comprehensive technical overview of JD.com’s retail traffic data processing pipeline, detailing the multi‑layer data warehouse architecture, real‑time and offline data flows, a large‑scale back‑fill case using Iceberg and OLAP, data‑skew detection and mitigation techniques, and future directions involving unified Flink‑Spark streaming‑batch solutions.

Big DataData SkewFlink
0 likes · 12 min read
JD Retail Traffic Data Warehouse Architecture and Processing Practices
Architect
Architect
Jan 7, 2022 · Big Data

Spark Performance Optimization: Principles, Memory Model, Resource Tuning, Data Skew and Shuffle Tuning

This article provides an in‑depth guide to Spark performance optimization, covering the ten development principles, static and unified memory models, resource parameter tuning, data skew detection and mitigation techniques, as well as shuffle‑related configuration adjustments, supplemented with practical code examples and diagrams.

Big DataData SkewMemory Model
0 likes · 40 min read
Spark Performance Optimization: Principles, Memory Model, Resource Tuning, Data Skew and Shuffle Tuning
Big Data Technology Architecture
Big Data Technology Architecture
Aug 24, 2021 · Big Data

Comprehensive Guide to Spark Performance Optimization, Data Skew Mitigation, and Troubleshooting

This article presents a detailed collection of Spark performance‑tuning techniques—including submit‑script parameters, RDD and operator optimizations, parallelism and memory settings, broadcast variables, Kryo serialization, locality wait adjustments—as well as systematic methods for detecting and resolving data skew and common runtime issues such as shuffle failures, serialization errors, and JVM memory problems.

Big DataData SkewJVM Tuning
0 likes · 21 min read
Comprehensive Guide to Spark Performance Optimization, Data Skew Mitigation, and Troubleshooting
Laravel Tech Community
Laravel Tech Community
May 9, 2021 · Backend Development

Understanding Consistent Hashing: From Simple Modulo Hash to Optimizations

This article explains the drawbacks of a basic modulo hash algorithm for key distribution, demonstrates how consistent hashing resolves scaling and node‑failure issues, and discusses virtual‑node techniques to mitigate data skew and improve load balancing in distributed cache systems.

Data Skewconsistent hashingdistributed caching
0 likes · 5 min read
Understanding Consistent Hashing: From Simple Modulo Hash to Optimizations
Architect
Architect
Apr 3, 2021 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article explains advanced Spark performance tuning techniques, focusing on diagnosing and resolving data skew and shuffle bottlenecks through stage analysis, key distribution inspection, and a variety of practical solutions such as Hive pre‑processing, key filtering, parallelism increase, two‑stage aggregation, map‑join, and combined strategies, while also covering ShuffleManager internals and related configuration parameters.

Big DataData SkewPerformance Tuning
0 likes · 47 min read
Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning
Big Data Technology Architecture
Big Data Technology Architecture
Apr 1, 2021 · Big Data

Spark Adaptive Execution: Dynamic Shuffle Partition, Broadcast Join, and Skew Handling

The article explains the limitations of static shuffle partitions, execution‑plan estimation, and data skew in Spark SQL, and describes how Spark Adaptive Execution can automatically adjust shuffle partition numbers, switch join strategies, and mitigate skew through configurable parameters and code examples.

Adaptive ExecutionBroadcast JoinData Skew
0 likes · 11 min read
Spark Adaptive Execution: Dynamic Shuffle Partition, Broadcast Join, and Skew Handling
Big Data Technology Architecture
Big Data Technology Architecture
Mar 10, 2021 · Big Data

Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Solutions, and Shuffle Tuning

This guide presents a complete Spark performance optimization handbook covering development‑time best practices, resource‑parameter tuning, detailed data‑skew detection and mitigation techniques, advanced shuffle‑engine configurations, and practical code examples to help engineers build faster, more reliable Spark jobs.

Big DataData SkewPerformance Optimization
0 likes · 69 min read
Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Solutions, and Shuffle Tuning
Architect
Architect
Dec 13, 2020 · Big Data

Understanding and Solving Data Skew in Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, illustrates typical scenarios that cause it, and provides practical strategies and platform‑specific optimizations to detect, mitigate, and prevent skew in big‑data processing pipelines.

Big DataData SkewHadoop
0 likes · 13 min read
Understanding and Solving Data Skew in Hadoop and Spark
vivo Internet Technology
vivo Internet Technology
Nov 11, 2020 · Big Data

Understanding Distributed Hash Tables (DHT) and Their Improvements

The article explains how Distributed Hash Tables replace simple modulo hashing with a ring‑based scheme, demonstrates severe data skew in basic implementations, and shows that adding multiple virtual nodes plus a load‑boundary factor dramatically balances storage and request distribution across cluster nodes.

Big DataDHTData Skew
0 likes · 9 min read
Understanding Distributed Hash Tables (DHT) and Their Improvements
Big Data Technology Architecture
Big Data Technology Architecture
Mar 21, 2020 · Big Data

Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Handling, and Shuffle Tuning

This article presents a complete guide to Spark performance optimization, covering development‑time best practices, resource‑parameter tuning, systematic detection and resolution of data skew, and detailed shuffle‑related parameter adjustments, all illustrated with Scala code examples.

Data SkewPerformance OptimizationShuffle
0 likes · 67 min read
Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Handling, and Shuffle Tuning