Tag

Apache Spark

1 views collected around this technical thread.

Python Programming Learning Circle
Python Programming Learning Circle
May 22, 2025 · Big Data

Introduction to PySpark: Features, Core Components, Sample Code, and Use Cases

This article introduces PySpark as the Python API for Apache Spark, explains Spark's core concepts and advantages, details PySpark's main components and a simple code example, compares it with Pandas, and outlines typical big‑data scenarios and further learning directions.

Apache SparkBig DataDataFrames
0 likes · 5 min read
Introduction to PySpark: Features, Core Components, Sample Code, and Use Cases
DataFunSummit
DataFunSummit
Jan 9, 2025 · Big Data

Spark SQL Window Function Optimizations: Concepts, Techniques, and Q&A

This article explains Spark SQL's window function fundamentals, introduces two key optimizations—Offset Window Frame and Infer Window Group Limit—and provides a detailed Q&A covering implementation details, execution plan impacts, and underlying architecture.

Apache SparkBig DataOptimization
0 likes · 13 min read
Spark SQL Window Function Optimizations: Concepts, Techniques, and Q&A
DataFunSummit
DataFunSummit
Aug 14, 2024 · Big Data

Solving Typical Issues in Migrating to Spark 3.1: Multiple Catalog, Hive‑SQL to Spark‑SQL Migration, and Performance & Stability Optimizations at Xiaomi

This article shares Xiaomi's experience building a next‑generation one‑stop data development platform on Spark 3.1, covering typical challenges such as Multiple Catalog implementation, Hive‑SQL to Spark‑SQL migration, offline Spark performance and stability optimizations, and future roadmap plans.

Apache SparkBig DataSQL Migration
0 likes · 18 min read
Solving Typical Issues in Migrating to Spark 3.1: Multiple Catalog, Hive‑SQL to Spark‑SQL Migration, and Performance & Stability Optimizations at Xiaomi
DataFunSummit
DataFunSummit
Aug 1, 2024 · Big Data

Deep Dive into Apache Spark SQL: Concepts, Core Components, and API

This article provides a comprehensive overview of Apache Spark SQL, covering its fundamental concepts such as TreeNode, AST, and QueryPlan, the distinction between logical and physical plans, the rule‑execution framework, core components like SparkSqlParser and Analyzer, as well as the Spark Session, Dataset/DataFrame, and various writer APIs, supplemented by a detailed Q&A session.

Apache SparkBig DataData Processing
0 likes · 19 min read
Deep Dive into Apache Spark SQL: Concepts, Core Components, and API
DataFunSummit
DataFunSummit
Jul 11, 2024 · Big Data

Design Principles of the Spark Core – DataFun Introduction to Apache Spark (Part 1)

This article provides a comprehensive overview of Apache Spark, covering its origins, key characteristics, core concepts such as RDD, DAG, partitioning and dependencies, the internal architecture including SparkConf, SparkContext, SparkEnv, storage and scheduling systems, as well as deployment models and the company behind the product.

Apache SparkBig DataData Processing
0 likes · 16 min read
Design Principles of the Spark Core – DataFun Introduction to Apache Spark (Part 1)
DataFunTalk
DataFunTalk
Apr 9, 2024 · Big Data

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

This article shares Xiaomi's real‑world challenges and solutions when building a new Spark 3.1‑based data platform, covering Multiple Catalog implementation, Hive‑to‑Spark SQL migration, automated batch upgrades, performance and stability optimizations, and future roadmap for vectorized execution.

Apache SparkBig DataHive
0 likes · 14 min read
Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform
Airbnb Technology Team
Airbnb Technology Team
Mar 1, 2024 · Big Data

Riverbed: A Scalable Data Framework for Real‑time and Batch Processing at Airbnb

Airbnb’s Riverbed framework unifies streaming CDC events and batch Spark jobs behind a GraphQL‑based declarative API to automatically build and maintain distributed materialized views, using Kafka‑partitioned ordering and version control to deliver billions of daily updates with low‑latency reads for features such as payments and search.

AirbnbApache SparkData Engineering
0 likes · 8 min read
Riverbed: A Scalable Data Framework for Real‑time and Batch Processing at Airbnb
DataFunTalk
DataFunTalk
Dec 31, 2023 · Big Data

Apache Celeborn (Incubating): Addressing Traditional Shuffle Limitations in Big Data Processing

Apache Celeborn (Incubating) is a remote shuffle service designed to overcome the inefficiencies, high storage demands, network overhead, and limited fault tolerance of traditional Spark shuffle implementations by introducing push‑shuffle, partition splitting, columnar shuffle, multi‑layer storage, and elastic, stable, and scalable architectures.

Apache SparkBig DataCeleborn
0 likes · 15 min read
Apache Celeborn (Incubating): Addressing Traditional Shuffle Limitations in Big Data Processing
iQIYI Technical Product Team
iQIYI Technical Product Team
Sep 15, 2023 · Big Data

Apache Spark at iQIYI: Current Status and Optimization

iQIYI now relies on Apache Spark as its main offline engine, processing over 200 000 daily tasks for ETL, data synchronization and analytics, while recent optimizations—dynamic resource allocation, adaptive query execution, compression, rebalance, Z‑order and resource‑governance—have cut compute usage by ~27 %, storage by up to 76 % and improved query speed, completing a large‑scale migration from Hive and paving the way for Spark 3.4 and Iceberg support.

Apache SparkBig DataSQL Service
0 likes · 21 min read
Apache Spark at iQIYI: Current Status and Optimization
DataFunTalk
DataFunTalk
Aug 5, 2023 · Big Data

Apache Celeborn (Incubating): Design, Performance, Stability, and Elasticity of a Remote Shuffle Service

This article reviews the limitations of traditional Spark shuffle, introduces Apache Celeborn (Incubating) as a remote shuffle service, and details its design for performance, stability, and elasticity, including push shuffle, partition splitting, columnar shuffle, multi‑layer storage, congestion control, and real‑world evaluation.

Apache SparkBig DataCeleborn
0 likes · 19 min read
Apache Celeborn (Incubating): Design, Performance, Stability, and Elasticity of a Remote Shuffle Service
DataFunSummit
DataFunSummit
Dec 10, 2022 · Big Data

Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions

This presentation details how Guanyuan Data leverages Apache Spark within its self‑service analytics platform, covering product features, flexible deployment, resource isolation, performance challenges, architectural solutions, and future cloud‑native enhancements to support thousands of users and massive query workloads.

Apache SparkBig DataResource Scheduling
0 likes · 14 min read
Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions
DataFunSummit
DataFunSummit
Sep 27, 2022 · Big Data

Apache Spark Adaptive Query Execution and Kyuubi Optimization Practices for Data Warehousing

This article presents a detailed overview of Apache Spark's Adaptive Query Execution evolution, its optimization techniques, and performance gains, followed by an in‑depth discussion of Apache Kyuubi's architecture, security integrations, cloud‑native capabilities, and practical Rebalance + Z‑Order strategies that enhance data‑warehouse task efficiency and query performance.

Adaptive Query ExecutionApache SparkBig Data Optimization
0 likes · 19 min read
Apache Spark Adaptive Query Execution and Kyuubi Optimization Practices for Data Warehousing
DataFunTalk
DataFunTalk
May 19, 2022 · Big Data

SeaTunnel: Distributed Data Integration Platform and Its Application in Traffic Management

This article introduces Apache SeaTunnel, a distributed, high‑performance data integration platform built on Spark and Flink, outlines its technical features, workflow, and plugin ecosystem, and details a concrete traffic‑management use case involving incremental Oracle‑to‑warehouse data synchronization with Spark resources and scheduled shell scripts.

Apache FlinkApache SparkBig Data
0 likes · 12 min read
SeaTunnel: Distributed Data Integration Platform and Its Application in Traffic Management
DataFunTalk
DataFunTalk
Apr 7, 2022 · Big Data

Apache Kyuubi: Architecture, Use Cases, Community, and Mobile Cloud Deployment

This article introduces Apache Kyuubi—a multi‑tenant Thrift JDBC/ODBC service built on Spark—detailing its architecture, advantages over Spark Thrift Server, real‑world use cases, open‑source community progress, and practical deployment strategies on mobile cloud, Kubernetes, and with Trino.

Apache SparkBig DataKubernetes
0 likes · 16 min read
Apache Kyuubi: Architecture, Use Cases, Community, and Mobile Cloud Deployment
Big Data Technology Architecture
Big Data Technology Architecture
Nov 28, 2021 · Big Data

EMR Studio: Architecture and Features for Simplifying Big Data Development

EMR Studio is a one‑stop, open‑source‑compatible big data development platform that integrates Zeppelin, Jupyter, Airflow and a custom Cluster Manager to streamline job creation, scheduling, monitoring, and cluster switching, thereby addressing common usability challenges in Spark, Flink, Hive, and Presto workflows.

AirflowApache SparkBig Data
0 likes · 9 min read
EMR Studio: Architecture and Features for Simplifying Big Data Development
Big Data Technology Architecture
Big Data Technology Architecture
Jun 29, 2021 · Big Data

Implementing and Registering a Custom SparkListener in Apache Spark

This article explains how to create a custom SparkListener in Apache Spark, provides Scala code examples for the listener and a main application, and details two registration approaches—via Spark configuration or SparkContext—along with a comprehensive list of listener event methods.

Apache SparkBig DataEvent Listener
0 likes · 5 min read
Implementing and Registering a Custom SparkListener in Apache Spark
DataFunTalk
DataFunTalk
Apr 28, 2021 · Big Data

Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features

This article explains how NVIDIA's RAPIDS Accelerator leverages GPUs to speed up Apache Spark 3.0 workloads, detailing the underlying architecture, benchmark results on TPC‑DS and recommendation models, required configuration changes, supported operators, shuffle optimizations, and the enhancements introduced in versions 0.2 and 0.3.

Apache SparkBig DataGPU Acceleration
0 likes · 19 min read
Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features
Tencent Cloud Developer
Tencent Cloud Developer
Nov 13, 2020 · Big Data

Apache Spark Core: Architecture, Components, and Execution Flow

Apache Spark Core is a high‑performance, fault‑tolerant engine that abstracts distributed computation through SparkContext, DAG and Task schedulers, supports in‑memory and disk storage, runs on various cluster managers (YARN, Kubernetes, etc.), and unifies batch, streaming, ML and graph processing via its rich ecosystem.

Apache SparkBig DataDAG Scheduler
0 likes · 17 min read
Apache Spark Core: Architecture, Components, and Execution Flow
Big Data Technology Architecture
Big Data Technology Architecture
Aug 12, 2020 · Big Data

Overview of New Features and Improvements in Apache Spark 3.0

Apache Spark 3.0 introduces a suite of performance enhancements, richer APIs, improved monitoring, SQL compatibility, new data sources, and ecosystem extensions, including Adaptive Query Execution, Dynamic Partition Pruning, Join Hints, pandas UDF improvements, and accelerator‑aware scheduling, to boost scalability and ease of use for big‑data workloads.

Adaptive Query ExecutionApache SparkBig Data
0 likes · 15 min read
Overview of New Features and Improvements in Apache Spark 3.0
Big Data Technology Architecture
Big Data Technology Architecture
Aug 8, 2020 · Big Data

Overview of SQL Performance Improvements in Apache Spark 3.0

Apache Spark 3.0 introduces extensive SQL performance enhancements, including a new explain format, expanded join hints, adaptive query execution, dynamic partition pruning, enhanced nested column pruning, improved aggregation code generation, and support for newer Scala and Java versions, all aimed at optimizing query planning and execution.

Adaptive Query ExecutionApache SparkBig Data
0 likes · 14 min read
Overview of SQL Performance Improvements in Apache Spark 3.0