Tagged articles

55 articles

Page 1 of 1

Apr 17, 2026 · Big Data

What Spark 4.0 Brings: VARIANT Type, Native SQL UDFs, and Serverless Enhancements

Apache Spark 4.0 introduces a high‑performance VARIANT data type for semi‑structured JSON, native SQL UDFs that eliminate Python UDF bottlenecks, a richer Python DataSource API, a new pipeline syntax, upgraded Structured Streaming state management, and Alibaba Cloud EMR Serverless optimizations that together deliver up to 30% speed gains and seamless migration from Spark 3.x.

Apache SparkPython APISQL UDF

0 likes · 12 min read

What Spark 4.0 Brings: VARIANT Type, Native SQL UDFs, and Serverless Enhancements

Big Data Technology & Architecture

Dec 10, 2025 · Big Data

What’s New in Apache Spark 4.0? Deep Dive into 2025 Core Updates

The 2025 release of Apache Spark 4.0 brings a comprehensive overhaul—including default ANSI SQL mode, full SQL scripting support, a new Real‑Time streaming mode, adaptive query execution, dynamic memory management, and GPU‑accelerated MLlib—significantly boosting performance, reliability, and developer productivity across big‑data workloads.

Apache SparkBig DataGPU Acceleration

0 likes · 9 min read

What’s New in Apache Spark 4.0? Deep Dive into 2025 Core Updates

Ray's Galactic Tech

Nov 18, 2025 · Big Data

From Zero to Mastery: A Complete Roadmap to Learn Apache Spark

This guide outlines a step‑by‑step learning path for Apache Spark, covering core concepts, environment setup, hands‑on WordCount code, API mastery, ecosystem extensions like Structured Streaming and MLlib, deployment options, performance tuning, and practical project advice.

Apache SparkPySparkStreaming

0 likes · 7 min read

From Zero to Mastery: A Complete Roadmap to Learn Apache Spark

SF Technology Team

Sep 29, 2025 · Big Data

How SF Tech Cut 10,000 CPU Cores with Apache Gluten – A Deep Dive

This article details how SF Technology adopted Apache Gluten with Velox to accelerate Spark queries, describing the architecture, task lifecycle, management framework, simulation system, unified SQL, fallback mechanisms, dynamic memory tuning, columnar shuffle, and future plans that together saved over 10,000 CPU cores and reduced operator fallback rates to around 4%.

Apache SparkGlutenPerformance

0 likes · 16 min read

How SF Tech Cut 10,000 CPU Cores with Apache Gluten – A Deep Dive

Python Programming Learning Circle

May 22, 2025 · Big Data

Introduction to PySpark: Features, Core Components, Sample Code, and Use Cases

This article introduces PySpark as the Python API for Apache Spark, explains Spark's core concepts and advantages, details PySpark's main components and a simple code example, compares it with Pandas, and outlines typical big‑data scenarios and further learning directions.

Apache SparkBig DataDataFrames

0 likes · 5 min read

Introduction to PySpark: Features, Core Components, Sample Code, and Use Cases

DataFunSummit

Jan 9, 2025 · Big Data

Spark SQL Window Function Optimizations: Concepts, Techniques, and Q&A

This article explains Spark SQL's window function fundamentals, introduces two key optimizations—Offset Window Frame and Infer Window Group Limit—and provides a detailed Q&A covering implementation details, execution plan impacts, and underlying architecture.

Apache SparkBig DataOptimization

0 likes · 13 min read

Spark SQL Window Function Optimizations: Concepts, Techniques, and Q&A

Big Data Technology & Architecture

Nov 12, 2024 · Big Data

Adaptive Query Execution (AQE) in Apache Spark 4.0: A Revolution in Query Optimization

This article explains how Adaptive Query Execution (AQE) in Apache Spark 4.0 dynamically optimizes query plans through features such as join reordering, partition pruning, skew handling and coalescing, delivering significant performance gains, resource efficiency and reduced manual tuning across real‑world big‑data workloads.

Adaptive Query ExecutionApache SparkBig Data

0 likes · 13 min read

Adaptive Query Execution (AQE) in Apache Spark 4.0: A Revolution in Query Optimization

Baidu Geek Talk

Oct 22, 2024 · Big Data

How Baidu’s DATAPILOT Uses NVIDIA RAPIDS to Supercharge SQL Analytics

Baidu’s DATAPILOT platform combines natural‑language interaction with GPU‑accelerated Spark‑RAPIDS to turn complex, multi‑table SQL queries into seconds‑fast results, boosting ad‑revenue analysis efficiency by up to five‑fold while reducing infrastructure costs.

Apache SparkBaiduBig Data

0 likes · 10 min read

How Baidu’s DATAPILOT Uses NVIDIA RAPIDS to Supercharge SQL Analytics

DataFunSummit

Aug 14, 2024 · Big Data

Solving Typical Issues in Migrating to Spark 3.1: Multiple Catalog, Hive‑SQL to Spark‑SQL Migration, and Performance & Stability Optimizations at Xiaomi

This article shares Xiaomi's experience building a next‑generation one‑stop data development platform on Spark 3.1, covering typical challenges such as Multiple Catalog implementation, Hive‑SQL to Spark‑SQL migration, offline Spark performance and stability optimizations, and future roadmap plans.

Apache SparkBig DataData Platform

0 likes · 18 min read

Solving Typical Issues in Migrating to Spark 3.1: Multiple Catalog, Hive‑SQL to Spark‑SQL Migration, and Performance & Stability Optimizations at Xiaomi

DataFunSummit

Aug 1, 2024 · Big Data

Deep Dive into Apache Spark SQL: Concepts, Core Components, and API

This article provides a comprehensive overview of Apache Spark SQL, covering its fundamental concepts such as TreeNode, AST, and QueryPlan, the distinction between logical and physical plans, the rule‑execution framework, core components like SparkSqlParser and Analyzer, as well as the Spark Session, Dataset/DataFrame, and various writer APIs, supplemented by a detailed Q&A session.

Apache SparkBig DataDistributed computing

0 likes · 19 min read

Deep Dive into Apache Spark SQL: Concepts, Core Components, and API

DataFunSummit

Jul 11, 2024 · Big Data

Design Principles of the Spark Core – DataFun Introduction to Apache Spark (Part 1)

This article provides a comprehensive overview of Apache Spark, covering its origins, key characteristics, core concepts such as RDD, DAG, partitioning and dependencies, the internal architecture including SparkConf, SparkContext, SparkEnv, storage and scheduling systems, as well as deployment models and the company behind the product.

Apache SparkBig DataDistributed computing

0 likes · 16 min read

Design Principles of the Spark Core – DataFun Introduction to Apache Spark (Part 1)

Meituan Technology Team

Jun 20, 2024 · Big Data

Vectorized Execution in Apache Spark: Meituan’s Practice with Gluten and Velox

Meituan enhances Apache Spark by integrating the Gluten‑Velox vectorized execution engine, converting row‑wise operations to columnar SIMD processing, which yields over 40 % memory savings and up to 13 % faster runtimes across thousands of ETL jobs, while addressing stability, ORC support, shuffle redesign, and off‑heap memory optimization.

Apache SparkBig DataC

0 likes · 30 min read

Vectorized Execution in Apache Spark: Meituan’s Practice with Gluten and Velox

Test Development Learning Exchange

Jun 12, 2024 · Big Data

Getting Started with PySpark: Install, Code, and Performance Tips

This guide introduces Apache Spark's Python API, showing how to install PySpark, launch an interactive shell, create a SparkSession, read and write data from various sources, perform transformations, and apply key performance‑tuning practices for efficient big‑data processing.

Apache SparkBig DataPerformance tuning

0 likes · 5 min read

Getting Started with PySpark: Install, Code, and Performance Tips

Alibaba Cloud Big Data AI Platform

Jun 3, 2024 · Big Data

Build a Serverless Spark Log Analysis Pipeline on Alibaba Cloud EMR

This guide walks you through using Alibaba Cloud EMR Serverless Spark to ingest OSS‑HDFS audit logs, create source, detail, and summary data‑warehouse tables with Spark SQL, orchestrate the tasks in a workflow, and schedule daily IP‑traffic analysis, complete with code snippets and UI screenshots.

Apache SparkData WarehouseEMR Serverless Spark

0 likes · 12 min read

Build a Serverless Spark Log Analysis Pipeline on Alibaba Cloud EMR

DataFunTalk

Apr 9, 2024 · Big Data

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

This article shares Xiaomi's real‑world challenges and solutions when building a new Spark 3.1‑based data platform, covering Multiple Catalog implementation, Hive‑to‑Spark SQL migration, automated batch upgrades, performance and stability optimizations, and future roadmap for vectorized execution.

Apache SparkBig DataData Migration

0 likes · 14 min read

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

Airbnb Technology Team

Mar 1, 2024 · Big Data

Riverbed: A Scalable Data Framework for Real‑time and Batch Processing at Airbnb

Airbnb’s Riverbed framework unifies streaming CDC events and batch Spark jobs behind a GraphQL‑based declarative API to automatically build and maintain distributed materialized views, using Kafka‑partitioned ordering and version control to deliver billions of daily updates with low‑latency reads for features such as payments and search.

AirbnbApache SparkKafka

0 likes · 8 min read

Riverbed: A Scalable Data Framework for Real‑time and Batch Processing at Airbnb

DataFunTalk

Dec 31, 2023 · Big Data

Apache Celeborn (Incubating): Addressing Traditional Shuffle Limitations in Big Data Processing

Apache Celeborn (Incubating) is a remote shuffle service designed to overcome the inefficiencies, high storage demands, network overhead, and limited fault tolerance of traditional Spark shuffle implementations by introducing push‑shuffle, partition splitting, columnar shuffle, multi‑layer storage, and elastic, stable, and scalable architectures.

Apache SparkBig DataPerformance Optimization

0 likes · 15 min read

Apache Celeborn (Incubating): Addressing Traditional Shuffle Limitations in Big Data Processing

dbaplus Community

Nov 8, 2023 · Big Data

Choosing Between Data Warehouse, Data Lake, and Lakehouse: When to Use Each

This article compares traditional data warehouses, modern data lakes, and emerging lakehouse architectures, explaining their design patterns, advantages, disadvantages, and suitable use cases, while detailing implementation considerations such as schema design, ETL/ELT processes, file formats like Delta, Iceberg, and Hudi, and factors influencing platform selection.

Apache SparkData WarehouseDelta Lake

0 likes · 20 min read

Choosing Between Data Warehouse, Data Lake, and Lakehouse: When to Use Each

iQIYI Technical Product Team

Sep 15, 2023 · Big Data

Apache Spark at iQIYI: Current Status and Optimization

iQIYI now relies on Apache Spark as its main offline engine, processing over 200 000 daily tasks for ETL, data synchronization and analytics, while recent optimizations—dynamic resource allocation, adaptive query execution, compression, rebalance, Z‑order and resource‑governance—have cut compute usage by ~27 %, storage by up to 76 % and improved query speed, completing a large‑scale migration from Hive and paving the way for Spark 3.4 and Iceberg support.

Apache SparkPerformance OptimizationSQL Service

0 likes · 21 min read

Apache Spark at iQIYI: Current Status and Optimization

DataFunTalk

Aug 5, 2023 · Big Data

Apache Celeborn (Incubating): Design, Performance, Stability, and Elasticity of a Remote Shuffle Service

This article reviews the limitations of traditional Spark shuffle, introduces Apache Celeborn (Incubating) as a remote shuffle service, and details its design for performance, stability, and elasticity, including push shuffle, partition splitting, columnar shuffle, multi‑layer storage, congestion control, and real‑world evaluation.

Apache SparkBig DataPerformance

0 likes · 19 min read

Apache Celeborn (Incubating): Design, Performance, Stability, and Elasticity of a Remote Shuffle Service

NetEase Yanxuan Technology Product Team

Feb 27, 2023 · Big Data

How NetEase Yanxuan Migrated from Lambda to Iceberg for Real‑Time Batch‑Stream Integration

This article details how NetEase Yanxuan transformed its data platform from a dual Lambda architecture to a unified batch‑stream solution built on Apache Iceberg, covering the original challenges, the evaluation of Iceberg versus Hudi and Delta Lake, implementation of stream‑batch pipelines, message ordering fixes, snapshot generation, and extensive table‑governance optimizations.

Apache FlinkApache SparkBatch-Stream Integration

0 likes · 14 min read

How NetEase Yanxuan Migrated from Lambda to Iceberg for Real‑Time Batch‑Stream Integration

DataFunSummit

Dec 10, 2022 · Big Data

Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions

This presentation details how Guanyuan Data leverages Apache Spark within its self‑service analytics platform, covering product features, flexible deployment, resource isolation, performance challenges, architectural solutions, and future cloud‑native enhancements to support thousands of users and massive query workloads.

Apache SparkBig DataData Platform

0 likes · 14 min read

Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions

Youzan Coder

Sep 29, 2022 · Big Data

Implementing Spark Data Lineage with Spline: A Step‑by‑Step Guide

This article explains the growing importance of data lineage in large data warehouses, evaluates three Spark lineage extraction approaches, and provides a detailed, step‑by‑step guide to integrating the open‑source Spline agent—including codeless and programmatic initialization, configuration, dispatcher setup, post‑processing, and known limitations.

Apache SparkBig DataData Lineage

0 likes · 16 min read

Implementing Spark Data Lineage with Spline: A Step‑by‑Step Guide

DataFunSummit

Sep 27, 2022 · Big Data

Apache Spark Adaptive Query Execution and Kyuubi Optimization Practices for Data Warehousing

This article presents a detailed overview of Apache Spark's Adaptive Query Execution evolution, its optimization techniques, and performance gains, followed by an in‑depth discussion of Apache Kyuubi's architecture, security integrations, cloud‑native capabilities, and practical Rebalance + Z‑Order strategies that enhance data‑warehouse task efficiency and query performance.

Adaptive Query ExecutionApache SparkBig Data Optimization

0 likes · 19 min read

Apache Spark Adaptive Query Execution and Kyuubi Optimization Practices for Data Warehousing

DataFunTalk

May 19, 2022 · Big Data

SeaTunnel: Distributed Data Integration Platform and Its Application in Traffic Management

This article introduces Apache SeaTunnel, a distributed, high‑performance data integration platform built on Spark and Flink, outlines its technical features, workflow, and plugin ecosystem, and details a concrete traffic‑management use case involving incremental Oracle‑to‑warehouse data synchronization with Spark resources and scheduled shell scripts.

Apache FlinkApache SparkBig Data

0 likes · 12 min read

SeaTunnel: Distributed Data Integration Platform and Its Application in Traffic Management

DataFunTalk

Apr 7, 2022 · Big Data

Apache Kyuubi: Architecture, Use Cases, Community, and Mobile Cloud Deployment

This article introduces Apache Kyuubi—a multi‑tenant Thrift JDBC/ODBC service built on Spark—detailing its architecture, advantages over Spark Thrift Server, real‑world use cases, open‑source community progress, and practical deployment strategies on mobile cloud, Kubernetes, and with Trino.

Apache SparkBig DataKubernetes

0 likes · 16 min read

Apache Kyuubi: Architecture, Use Cases, Community, and Mobile Cloud Deployment

Big Data Technology Architecture

Nov 28, 2021 · Big Data

EMR Studio: Architecture and Features for Simplifying Big Data Development

EMR Studio is a one‑stop, open‑source‑compatible big data development platform that integrates Zeppelin, Jupyter, Airflow and a custom Cluster Manager to streamline job creation, scheduling, monitoring, and cluster switching, thereby addressing common usability challenges in Spark, Flink, Hive, and Presto workflows.

AirflowApache SparkEMR Studio

0 likes · 9 min read

EMR Studio: Architecture and Features for Simplifying Big Data Development

Big Data Technology & Architecture

Jul 4, 2021 · Big Data

Comprehensive Guide to Learning Apache Spark: Background, Core Concepts, Modules, Resources, and Optimization

This article provides a thorough learning roadmap for Apache Spark, covering its background papers, core concepts such as RDD and fault tolerance, module breakdown, recommended books and repositories, source‑code reading tips, hands‑on projects, and interview‑oriented optimization guidance.

Apache SparkLearning PathPerformance Optimization

0 likes · 15 min read

Comprehensive Guide to Learning Apache Spark: Background, Core Concepts, Modules, Resources, and Optimization

Big Data Technology Architecture

Jun 29, 2021 · Big Data

Implementing and Registering a Custom SparkListener in Apache Spark

This article explains how to create a custom SparkListener in Apache Spark, provides Scala code examples for the listener and a main application, and details two registration approaches—via Spark configuration or SparkContext—along with a comprehensive list of listener event methods.

Apache SparkScalaSpark

0 likes · 5 min read

Implementing and Registering a Custom SparkListener in Apache Spark

DataFunTalk

Apr 28, 2021 · Big Data

Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features

This article explains how NVIDIA's RAPIDS Accelerator leverages GPUs to speed up Apache Spark 3.0 workloads, detailing the underlying architecture, benchmark results on TPC‑DS and recommendation models, required configuration changes, supported operators, shuffle optimizations, and the enhancements introduced in versions 0.2 and 0.3.

Apache SparkBig DataGPU Acceleration

0 likes · 19 min read

Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features

Tencent Cloud Developer

Nov 13, 2020 · Big Data

Apache Spark Core: Architecture, Components, and Execution Flow

Apache Spark Core is a high‑performance, fault‑tolerant engine that abstracts distributed computation through SparkContext, DAG and Task schedulers, supports in‑memory and disk storage, runs on various cluster managers (YARN, Kubernetes, etc.), and unifies batch, streaming, ML and graph processing via its rich ecosystem.

Apache SparkBig DataDAG scheduler

0 likes · 17 min read

Apache Spark Core: Architecture, Components, and Execution Flow

Big Data Technology & Architecture

Oct 19, 2020 · Big Data

Delta Lake: ACID Transactions, Schema Management, and Unified Batch‑Streaming for Data Lakes

Delta Lake adds ACID transaction support, schema enforcement, data versioning, and unified batch‑and‑stream processing to Apache Spark‑based data lakes, addressing reliability, quality, performance, and update challenges of traditional data lake architectures.

ACID TransactionsApache SparkBig Data

0 likes · 13 min read

Delta Lake: ACID Transactions, Schema Management, and Unified Batch‑Streaming for Data Lakes

Big Data Technology Architecture

Aug 12, 2020 · Big Data

Overview of New Features and Improvements in Apache Spark 3.0

Apache Spark 3.0 introduces a suite of performance enhancements, richer APIs, improved monitoring, SQL compatibility, new data sources, and ecosystem extensions, including Adaptive Query Execution, Dynamic Partition Pruning, Join Hints, pandas UDF improvements, and accelerator‑aware scheduling, to boost scalability and ease of use for big‑data workloads.

Adaptive Query ExecutionApache SparkPerformance Optimization

0 likes · 15 min read

Overview of New Features and Improvements in Apache Spark 3.0

Big Data Technology Architecture

Aug 8, 2020 · Big Data

Overview of SQL Performance Improvements in Apache Spark 3.0

Apache Spark 3.0 introduces extensive SQL performance enhancements, including a new explain format, expanded join hints, adaptive query execution, dynamic partition pruning, enhanced nested column pruning, improved aggregation code generation, and support for newer Scala and Java versions, all aimed at optimizing query planning and execution.

Adaptive Query ExecutionApache SparkPerformance tuning

0 likes · 14 min read

Overview of SQL Performance Improvements in Apache Spark 3.0

Big Data Technology Architecture

Jun 20, 2020 · Big Data

Apache Spark 3.0.0 Release: New Features, Improvements, and Timeline

Apache Spark 3.0.0, released after a 21‑month development cycle and several preview and release‑candidate votes, introduces major enhancements such as Dynamic Partition Pruning, Adaptive Query Execution, accelerator‑aware scheduling, DataSource V2, expanded pandas UDFs, new join hints, richer monitoring, SparkR vectorization, Kafka header support, and broader ecosystem integrations, while fixing over 3,400 issues.

Adaptive Query ExecutionApache SparkDataSource V2

0 likes · 17 min read

Apache Spark 3.0.0 Release: New Features, Improvements, and Timeline

dbaplus Community

Jun 20, 2020 · Big Data

What’s New in Apache Spark 3.0? Explore Dynamic Partition Pruning, AQE, and More

Apache Spark 3.0, released after a 21‑month development cycle, introduces dynamic partition pruning, adaptive query execution, accelerator‑aware scheduling, DataSource V2, enhanced pandas UDFs, new join hints, richer monitoring, ANSI‑SQL compatibility, SparkR vectorization, Kafka header support, and numerous platform upgrades, all backed by over 3,400 resolved issues.

Adaptive Query ExecutionApache SparkBig Data

0 likes · 17 min read

What’s New in Apache Spark 3.0? Explore Dynamic Partition Pruning, AQE, and More

Big Data Technology & Architecture

Feb 8, 2020 · Big Data

A Practical Guide to Reading Apache Spark Source Code and Understanding Its Core Design

This article explains why Spark is a mature big‑data framework, recommends which Spark versions to study, lists essential research papers, describes how to set up the development environment, and outlines the key components of Spark’s core architecture for effective source‑code exploration.

Apache SparkBig DataRDD

0 likes · 6 min read

A Practical Guide to Reading Apache Spark Source Code and Understanding Its Core Design

21CTO

Nov 27, 2019 · Big Data

Choosing the Right File Format for Big Data: CSV, JSON, Parquet & Avro Explained

This article compares CSV, JSON, Parquet, and Avro file formats, outlining their structures, advantages, and drawbacks, and explains how Apache Spark supports each format for efficient big‑data storage and processing.

Apache SparkAvroCSV

0 likes · 8 min read

Choosing the Right File Format for Big Data: CSV, JSON, Parquet & Avro Explained

Big Data Technology & Architecture

Oct 17, 2019 · Big Data

Delta Lake: Architecture, Features, and Hands‑On Tutorial

This article explains the origins and motivations of Delta Lake, details its ACID transaction support, schema enforcement, metadata handling, versioning, and unified batch‑and‑stream processing, and provides a step‑by‑step Maven and Spark code tutorial for creating, updating, and querying Delta tables.

ACIDApache SparkBig Data

0 likes · 10 min read

Delta Lake: Architecture, Features, and Hands‑On Tutorial

JD Retail Technology

Sep 27, 2019 · Big Data

How to Become a Spark Committer: The Journey of JD’s Zheng Ruifeng

The article chronicles JD engineer Zheng Ruifeng’s path to becoming a Spark Committer, highlighting his early involvement, key contributions to Spark’s ML and GraphX components, the community’s scale, and his vision for future improvements in the big‑data platform.

Apache SparkBig DataCommitter

0 likes · 6 min read

How to Become a Spark Committer: The Journey of JD’s Zheng Ruifeng

Architects' Tech Alliance

Aug 20, 2019 · Big Data

Current State and Future Trends of Hadoop in the Big Data Landscape

Despite recent market turbulence and negative headlines, Hadoop's revenue continues to grow, driven by cloud migration, evolving storage solutions, and increasing adoption of related projects like Spark and Kafka, positioning it as a leading data‑lake technology.

Apache SparkBig DataHadoop

0 likes · 8 min read

Current State and Future Trends of Hadoop in the Big Data Landscape

Big Data Technology Architecture

Aug 19, 2019 · Big Data

Understanding Spark Unified Memory Management and Dynamic Allocation

This article explains Apache Spark's memory architecture, covering the shift from static to unified memory management, the roles of on‑heap and off‑heap memory, configurable parameters, dynamic memory sharing between execution and storage, and the legacy mode introduced in Spark 1.6.

Apache SparkExecutorMemory Management

0 likes · 7 min read

Understanding Spark Unified Memory Management and Dynamic Allocation

Big Data Technology & Architecture

Aug 5, 2019 · Big Data

Apache Spark Latest Technological Developments and Outlook for Spark 3.0+

The article provides a comprehensive overview of recent Apache Spark advancements—including Delta Lake, Data Source V2, runtime optimizations, relational cache, cloud‑native challenges, AI integration via Project Hydrogen, and the anticipated features of Spark 3.0—highlighting how these innovations address modern data‑warehouse, cloud, and machine‑learning workloads.

Apache SparkBig DataData Warehouse

0 likes · 17 min read

Apache Spark Latest Technological Developments and Outlook for Spark 3.0+

Big Data Technology Architecture

Jul 10, 2019 · Big Data

Introduction to Apache Spark and Its Core Components

Apache Spark, an open‑source unified analytics engine from UC Berkeley’s AMP Lab, is the leading platform for large‑scale batch and streaming data processing, featuring components such as Spark SQL, Streaming, GraphX, MLlib, and core modules like DAGScheduler, TaskScheduler and BlockManager.

Apache SparkBlockManagerDAGScheduler

0 likes · 4 min read

Introduction to Apache Spark and Its Core Components

Big Data Technology Architecture

Apr 28, 2019 · Big Data

Apache Spark Memory Management: Storage and Execution Memory (Part 2)

This article continues the deep dive into Apache Spark memory management, explaining storage memory handling—including RDD persistence, caching, eviction, and disk spilling—as well as execution memory allocation for multi-tasking and shuffle operations, and detailing Spark’s internal structures such as BlockManager, StorageLevel, and Tungsten page management.

Apache SparkMemory ManagementRDD Persistence

0 likes · 13 min read

Apache Spark Memory Management: Storage and Execution Memory (Part 2)

Big Data Technology Architecture

Apr 22, 2019 · Big Data

Comparison of Apache Spark and Apache Flink: Programming Models, Streaming, State Management, and Exactly-Once Semantics

This article compares Apache Spark and Apache Flink, outlining their programming models, streaming mechanisms, state management, time semantics, and exactly‑once guarantees, and highlights the strengths and differences of each framework for batch and real‑time big‑data processing.

Apache FlinkApache SparkExactly-Once

0 likes · 8 min read

Comparison of Apache Spark and Apache Flink: Programming Models, Streaming, State Management, and Exactly-Once Semantics

Full-Stack Internet Architecture

Jun 14, 2018 · Big Data

What Is Big Data? Definitions, Technologies, Skills, and Use Cases

This article explains the definition of big data, its characteristic 3Vs, common data sources, supporting IT infrastructure, key technologies such as Hadoop and Spark, specialized databases, required skills, and several practical business use cases.

Apache SparkData AnalyticsHadoop

0 likes · 8 min read

What Is Big Data? Definitions, Technologies, Skills, and Use Cases

Qunar Tech Salon

Mar 9, 2018 · Big Data

New Features in Apache Spark 2.3: Continuous Streaming, Kubernetes Scheduler, Pandas UDFs, and MLlib Enhancements

Apache Spark 2.3 introduces major upgrades such as millisecond‑latency continuous streaming, stream‑to‑stream joins, a native Kubernetes scheduler backend, accelerated Pandas UDFs, and several MLlib improvements, all aimed at making big‑data processing faster, easier, and smarter.

Apache SparkBig DataContinuous Processing

0 likes · 7 min read

New Features in Apache Spark 2.3: Continuous Streaming, Kubernetes Scheduler, Pandas UDFs, and MLlib Enhancements

dbaplus Community

Sep 20, 2017 · Big Data

Scaling TB‑Level Price Computations with Apache Spark: Suning’s Architecture and Optimizations

This article details how Suning built a Hadoop‑based big data platform and leveraged Apache Spark to process terabytes of product price and inventory data, describing the system architecture, four key technical practices, performance results, and future data‑lake directions.

Apache SparkDataFramesDistributed computing

0 likes · 12 min read

Architect

Apr 17, 2017 · Big Data

Understanding Apache Spark Architecture: RDD, Computation Model, Cluster Modes, RPC, and Core Components

This article provides a comprehensive overview of Apache Spark's architecture, covering its RDD abstraction, computation model, various cluster deployment modes, RPC communication layer, startup procedures, core components, interaction flows, and block management for broadcast variables.

Apache SparkBig DataCluster Mode

0 likes · 15 min read

Understanding Apache Spark Architecture: RDD, Computation Model, Cluster Modes, RPC, and Core Components

Liulishuo Tech Team

Oct 17, 2016 · Big Data

Practical Tips and Common Pitfalls for Tuning Apache Spark Performance

This article shares hands‑on experience from Spark Summit attendees, covering why Spark is powerful, common performance problems such as slow jobs, OOM, data skew, excessive partitions, and provides concrete tuning advice on executors, cores, memory, and debugging techniques.

Apache SparkBig DataData Skew

0 likes · 11 min read

Practical Tips and Common Pitfalls for Tuning Apache Spark Performance

Qunar Tech Salon

Aug 29, 2016 · Big Data

Whole‑Stage Code Generation and Vectorization in Apache Spark’s Tungsten Engine

The article explains how Spark 2.0’s second‑generation Tungsten engine replaces the traditional Volcano iterator model with whole‑stage code generation and vectorization, eliminating virtual calls, keeping temporary data in CPU registers, and using loop unrolling and SIMD to achieve order‑of‑magnitude performance gains on large‑scale data workloads.

Apache SparkTungstenWhole-stage code generation

0 likes · 12 min read

Whole‑Stage Code Generation and Vectorization in Apache Spark’s Tungsten Engine

ITPUB

Jul 10, 2016 · Big Data

Can Spark Really Process Hundreds of Terabytes Interactively?

This article examines Apache Spark's interactive mode performance, revealing that while small datasets respond within seconds, processing beyond about 1 TB dramatically increases latency, and it discusses practical limits, hardware considerations, and the need to preload large datasets from disk.

Apache SparkBig DataPerformance

0 likes · 5 min read

Can Spark Really Process Hundreds of Terabytes Interactively?

21CTO

Mar 30, 2016 · Big Data

Unveiling Spark on YARN: From RDD Basics to Cluster Execution

This article explains Apache Spark’s core concepts, the RDD programming model, how Spark runs on YARN with driver and executor nodes, the distinction between transformations and actions, partitioning strategies, and an overview of SparkSQL processing.

Apache SparkRDDSparkSQL

0 likes · 18 min read

Unveiling Spark on YARN: From RDD Basics to Cluster Execution

Art of Distributed System Architecture Design

Jun 19, 2015 · Big Data

Storm vs Spark: Which Real‑Time Analytics Platform Wins for Your Business?

The article compares Apache Storm and Apache Spark, examining their origins, architecture, language support, integration capabilities, and performance characteristics, and offers guidance on selecting the right platform for real‑time business intelligence based on specific workload and infrastructure needs.

Apache SparkApache StormBig Data

0 likes · 11 min read

Storm vs Spark: Which Real‑Time Analytics Platform Wins for Your Business?