Big Data 17 min read

Performance Optimizations in Impala for Data Lake Queries: Iceberg and Codegen Enhancements

This article presents a comprehensive overview of Impala's high‑performance MPP query engine, its architecture for data‑lake workloads, and detailed performance optimizations including Iceberg table format improvements, manifest caching, and various Codegen techniques such as asynchronous compilation and caching.

DataFunSummit

Aug 7, 2023

Performance Optimizations in Impala for Data Lake Queries: Iceberg and Codegen Enhancements

Impala is a high‑performance, stateless MPP SQL query engine designed for data‑lake scenarios, supporting open storage (HDFS, Ozone) and open file/table formats (Parquet, ORC, Iceberg, Hudi) while providing robust security integration.

The engine consists of two master nodes (Statestore and Catalog Server) and execution nodes (Coordinators and Executors), with data stored externally; Impala does not retain any data or metadata itself.

Performance challenges in data‑lake queries stem from openness; optimizations include predicate push‑down, constant propagation, sub‑query rewrite, runtime filters, vectorized execution, and caching at various layers.

Iceberg optimizations covered: early support for multiple Iceberg catalogs, V2 merge‑on‑read handling (position delete files), anti‑join implementation to apply deletes efficiently, selective anti‑join avoidance for files without deletes, and a planned Iceberg‑specific operator to replace anti‑join (targeted for Impala 4.3).

Additional Iceberg enhancements: count(*) optimization using snapshot statistics for tables without deletes, and safe handling of delete files to avoid incorrect row counts.

Manifest caching reduces repeated planFiles() reads by caching manifest files inside the Iceberg library, yielding planning times comparable to Hive external tables.

Codegen improvements leverage LLVM to compile query fragments into optimized native code, dramatically reducing per‑row CPU cycles; however, synchronous Codegen adds latency for short queries.

Mitigations for Codegen latency include asynchronous Codegen (running compilation in a separate thread) and a Codegen cache that reuses compiled fragments across queries, both showing performance gains on small workloads.

The roadmap outlines continued Iceberg V2 support (delete/update, metadata queries) and deeper Codegen integration such as adaptive query compilation.

A Q&A section addresses vectorization plans, multi‑threaded execution tuning, performance differences between Iceberg and Hive tables, upcoming JSON table support, and materialized view considerations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data query optimization data lake Iceberg Impala Codegen

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.