Tagged articles

7 articles

Page 1 of 1

Nov 18, 2025 · Big Data

Master Spark SQL: From DataFrames to Catalyst Optimization and Real-World Use Cases

This comprehensive guide walks you through Spark SQL fundamentals—including DataFrame and Dataset APIs—delves into the Catalyst optimizer and Tungsten engine, presents practical Java examples, and shares concrete tuning techniques and real-world ETL scenarios for handling large‑scale data.

CatalystETLOptimization

0 likes · 8 min read

Master Spark SQL: From DataFrames to Catalyst Optimization and Real-World Use Cases

IT Services Circle

Mar 21, 2022 · Big Data

Understanding Spark Shuffle: Hash, Sort, and Tungsten Sort Mechanisms

This article explains the evolution and inner workings of Spark's shuffle phase, comparing the original Hash‑based shuffle, the default Sort‑based shuffle, the optimized Tungsten‑Sort shuffle, and related configuration options that affect performance and file handling in large‑scale data processing.

Hash ShuffleShuffleSort-Shuffle

0 likes · 17 min read

Understanding Spark Shuffle: Hash, Sort, and Tungsten Sort Mechanisms

Big Data Technology & Architecture

Dec 19, 2021 · Big Data

Understanding Spark Catalyst and Tungsten Optimizations in Spark SQL

This article explains how Spark SQL's Catalyst optimizer performs logical and physical planning, details the Tungsten engine's data‑structure and whole‑stage code generation improvements, compares them with the Volcano iterator model, and provides code examples and PDF resources for deeper study.

Big DataCatalystSQL optimization

0 likes · 12 min read

Understanding Spark Catalyst and Tungsten Optimizations in Spark SQL

Big Data Technology & Architecture

Dec 16, 2021 · Big Data

Understanding Spark SQL Join Strategies, Catalyst Optimizer, and Tungsten for Big Data Processing

This article explains Spark SQL join classifications, the mechanics of Nested Loop Join, Sort‑Merge Join, and Hash Join, and describes how the Catalyst optimizer and Tungsten project improve query execution and memory efficiency in large‑scale data environments.

Big DataCatalystJoin

0 likes · 9 min read

Understanding Spark SQL Join Strategies, Catalyst Optimizer, and Tungsten for Big Data Processing

Big Data Technology Architecture

Apr 28, 2019 · Big Data

Apache Spark Memory Management: Storage and Execution Memory (Part 2)

This article continues the deep dive into Apache Spark memory management, explaining storage memory handling—including RDD persistence, caching, eviction, and disk spilling—as well as execution memory allocation for multi-tasking and shuffle operations, and detailing Spark’s internal structures such as BlockManager, StorageLevel, and Tungsten page management.

Apache SparkMemory ManagementRDD Persistence

0 likes · 13 min read

Apache Spark Memory Management: Storage and Execution Memory (Part 2)

Qunar Tech Salon

Aug 29, 2016 · Big Data

Whole‑Stage Code Generation and Vectorization in Apache Spark’s Tungsten Engine

The article explains how Spark 2.0’s second‑generation Tungsten engine replaces the traditional Volcano iterator model with whole‑stage code generation and vectorization, eliminating virtual calls, keeping temporary data in CPU registers, and using loop unrolling and SIMD to achieve order‑of‑magnitude performance gains on large‑scale data workloads.

Apache SparkTungstenWhole-stage code generation

0 likes · 12 min read

Whole‑Stage Code Generation and Vectorization in Apache Spark’s Tungsten Engine

High Availability Architecture

Jan 6, 2016 · Big Data

Spark Latest Features, Tungsten Project, and Hulu’s Production Practices

This article reviews Spark's evolution from version 1.2 to 1.6, explains the DataFrame and Tungsten projects, shares Hulu’s real‑world Spark deployments, and discusses performance‑related challenges such as stack overflow, streaming receiver latency, and class‑loader deadlocks.

DataFramesDataset APIHulu

0 likes · 17 min read

Spark Latest Features, Tungsten Project, and Hulu’s Production Practices