Big Data 33 min read

How Meituan Scaled Spark with Vectorized Execution Using Gluten + Velox

This article details Meituan's production‑grade adoption of Spark vectorized execution via the open‑source Gluten and Velox stack, explaining SIMD fundamentals, performance motivations, the end‑to‑end integration workflow, staged rollout, encountered challenges, and the resulting resource savings and speedups.

Past Memory Big Data

Jun 20, 2024

How Meituan Scaled Spark with Vectorized Execution Using Gluten + Velox

1 What is Vectorized Computing

Vectorized computing leverages SIMD (single‑instruction‑multiple‑data) instructions that operate on multiple data elements in a single CPU cycle. A simple example adds two integer arrays element‑wise in a loop; the traditional SISD approach loads each operand, computes, and stores it, requiring three steps per element. With wider registers (e.g., 256‑bit AVX) the CPU can load, compute, and store several elements simultaneously, dramatically increasing per‑core throughput.

Beyond SIMD, a vectorized execution framework improves data locality by processing data in columnar blocks rather than rows, reducing cache misses, eliminating costly virtual‑function calls, and enabling compiler auto‑vectorization.

1.1 Parallel Data Processing: SIMD Instructions

SIMD originated with Intel MMX (1996) and evolved through SSE, AVX, AVX2, and AVX‑512. Linux users can query supported instruction sets via lscpu or cpuid.

1.2 Vectorized Execution Framework: Data Locality and Runtime Overhead

Row‑by‑row processing suffers from poor cache‑hit rates, handling of variable‑length fields, and virtual‑function call overhead. Transforming the execution model to block‑by‑block (columnar) mitigates these issues.

1.3 How to Use Vectorized Computing

Auto‑vectorization via compiler flags (e.g., gcc -ftree-vectorize -O3).

Inspect compiler hints and disassembly for vector instructions such as vmovups, vpaddd.

Use intrinsic libraries (Intel Intrinsics, xsimd) that emit SIMD instructions.

Embed SIMD assembly (architecture‑specific, low portability).

Apply compiler directives ( #pragma simd, #pragma omp simd) and hints ( __restrict).

Example of a vectorized addition using AVX intrinsics:

#include <immintrin.h>
void addArraysAVX(const int* a, const int* b, int* c, int num) {
  assert(num % 8 == 0);
  for (int i = 0; i < num; i += 8) {
    __m256i v_a = _mm256_load_si256((__m256i*)&a[i]);
    __m256i v_b = _mm256_load_si256((__m256i*)&b[i]);
    __m256i v_c = _mm256_add_epi32(v_a, v_b);
    _mm256_store_si256((__m256i*)&c[i], v_c);
  }
}

Benchmarks show the AVX version runs in ~58 µs versus ~170 µs for the scalar version, a three‑fold speedup.

2 Why Vectorize Spark

OLAP engines such as ClickHouse and Doris have long used vectorization to achieve extreme query speeds. Recent industry work includes Databricks Photon (4× faster) and Meta’s open‑source Velox. Gluten bridges Spark SQL to native vectorized back‑ends like Velox, allowing Java‑based Spark to reap similar performance gains.

Meituan runs tens of thousands of warehouse nodes; vectorization promises resource savings and faster job completion without hardware upgrades. Public benchmarks (TPC‑H) show Gluten + Velox can be 1.7× faster than Spark 3.0.

3 How Meituan Implemented Spark Vectorization

3.1 Overall Construction Philosophy

Prioritize resource savings over raw speed because offline workloads are memory‑bound and cost‑sensitive.

Develop core components in native languages (C++/Rust) rather than Java to gain fine‑grained CPU control.

Design a pluggable, multi‑engine execution library instead of a Spark‑only solution.

Adopt the open‑source Gluten + Velox stack to avoid reinventing hundreds of Spark functions.

Provide a transparent migration path with a black‑box ETL test harness that validates job results, execution time, and resource usage across engine versions.

3.2 Spark + Gluten + Velox Execution Flow

Gluten registers a set of Spark extension rules (e.g., ColumnarOverrideRules) on the driver side. These rules rewrite Spark logical plans, converting supported operators (e.g., FileScan) into native columnar equivalents and inserting fallback operators where necessary. The transformed plan is emitted as a Substrait plan, sent to executors via RPC, and executed by the native backend through JNI.

3.3 Staged Rollout

Hardware Compatibility Check : Verify CPU support for BMI, AVX, etc., run TPC‑DS/TPC‑H benchmarks, and back‑port Gluten patches to Spark 3.0.

Stability Verification : Enable ORC support, remote shuffle, HDFS client adaptations, and raise test‑pass rate from <30 % to ~90 %.

Performance Verification : Diagnose and fix issues (parameter alignment, Arrow conversion removal, shuffle batch fetch, native HDFS client, jemalloc, operator tuning) to lift average resource savings from –70 % to +40 %.

Consistency Verification : Large‑scale testing of non‑SLA jobs to ensure data correctness and positive ROI.

Gray‑scale Release : Gradually enable the vectorized engine on eligible jobs, monitor metrics, and fall back to vanilla Spark if performance or correctness degrades.

4 Challenges Encountered

4.1 Stability Issues

Aggregation OOM : Velox’s default flush threshold (75 % of off‑heap memory) combined with incomplete spill support caused out‑of‑memory failures during large aggregations. Lowering the flush threshold mitigated OOM at the cost of reduced aggregation efficiency.

SIMD Crash : Misaligned 16‑byte addresses in FlatVector<T>::copyValuesAndNulls() for 128‑bit types (e.g., LongDecimal) triggered illegal movaps instructions. The root cause was Arrow’s allocator not guaranteeing 16‑byte alignment; the issue was fixed by enforcing alignment in Arrow’s memory pool.

4.2 ORC Support and Read/Write Optimizations

Velox originally supported only DWRF and Parquet. Meituan extended Velox to read ORC by reusing DWRF readers and added the following enhancements:

RLEv2 decoding with filter push‑down (Velox‑5443, Velox‑6647) accelerated varint decoding via BMI2.

Decimal type support and filter push‑down (Velox‑5837, Velox‑6240).

File‑handle reuse (Velox‑6140) to reduce NameNode open requests.

ISA‑L accelerated Zlib decompression, cutting ORC read time in half and doubling throughput.

4.3 Native HDFS Client Optimizations

Random‑read amplification : Original client pre‑read entire blocks, inflating I/O for small random reads. The fix limits reads to the exact requested range.

Slow DataNode mitigation : By monitoring per‑DN latency and routing reads away from tail‑latency nodes (P99.9 threshold), average read/write latency dropped by two‑thirds and throughput doubled.

4.4 Shuffle Redesign

Gluten’s shuffle abstraction originally required many writer implementations for each combination of data format and partitioning mode. Meituan introduced a compositional design that decouples format from partitioning, reducing code duplication and allowing direct RowVector handling to avoid extra conversions.

4.5 HBO (Historical‑Based Optimization) Adaptation

Gluten uses off‑heap memory (75 % of total) while native Spark uses on‑heap (58 % utilization). Extending HBO to off‑heap memory raised memory‑saving ratios from 30 % to 40 % and eliminated OOM caused by insufficient on‑heap allocation.

4.6 Consistency Issues

Older ORC footers lacking column names caused missing data; Velox’s TableScan now injects column names from Hive metastore.

Distinct‑count aggregation produced incorrect results due to premature flush of intermediate hash tables; the short‑term fix disables early flush, while a longer‑term solution modifies plan generation to use final aggregation directly.

Floating‑point to string conversion yielded extra precision (e.g., 5.0799999). Adjusting DoubleToStringConverter to SHORTEST_SINGLE restored the expected 5.08 output.

5 Production Impact

More than 20 000 ETL jobs have been migrated, achieving over 40 % average memory savings and a 13 % reduction in execution time. A 30 TB job that previously required 7 hours (dominated by read and decompression) now finishes in under 2 hours thanks to ISA‑L compression, column‑to‑row savings, and the native HDFS client.

6 Future Plans

Upgrade the stack to Spark 3.5, which is expected to cut memory*second by another 40 % and reduce runtime by 17 %.

Maintain a stable Spark baseline, upgrading roughly every three years.

Continuously track and adopt new Gluten/Velox releases, targeting bi‑annual updates.

Expand vectorized operator and UDF coverage, including automatic conversion of Java UDFs to C++ via large‑language models.

Port remaining text‑file and custom formats to ORC or provide native C++ readers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Spark ORC Vectorized Execution velox Gluten

Written by

Past Memory Big Data

A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1 What is Vectorized Computing

1.1 Parallel Data Processing: SIMD Instructions

1.2 Vectorized Execution Framework: Data Locality and Runtime Overhead

1.3 How to Use Vectorized Computing

2 Why Vectorize Spark

3 How Meituan Implemented Spark Vectorization

3.1 Overall Construction Philosophy

3.2 Spark + Gluten + Velox Execution Flow

3.3 Staged Rollout

4 Challenges Encountered

4.1 Stability Issues

4.2 ORC Support and Read/Write Optimizations

4.3 Native HDFS Client Optimizations

4.4 Shuffle Redesign

4.5 HBO (Historical‑Based Optimization) Adaptation

4.6 Consistency Issues

5 Production Impact

6 Future Plans

Past Memory Big Data

How this landed with the community

Was this worth your time?

0 Comments

3.2 Spark + Gluten + Velox Execution Flow