Big Data 12 min read

How ByteDance Cut Spark History Storage by 90% with a Cloud‑Native UIService

ByteDance rebuilt Spark's History Server into a cloud‑native UIService that stores only essential UI metadata, reducing storage usage by over 90%, cutting UI latency by up to 94%, and enabling seamless horizontal scaling for large‑scale analytics workloads.

Volcano Engine Developer Services

Apr 11, 2022

How ByteDance Cut Spark History Storage by 90% with a Cloud‑Native UIService

Within ByteDance we built a brand‑new cloud‑native Spark History service called UIService. Compared with the open‑source Spark History Server (SHS), UIService reduces storage consumption and access latency by more than 90% and is now the default service for the Volcano Engine Lakehouse Analytics Service (LAS).

Background

The original Spark History Server relies on the Spark event system, persisting massive JSON event logs to HDFS. Each event log can reach tens of gigabytes, and a 7‑day window consumes about 3.2 PB of storage internally.

Pain Points

Huge storage overhead due to verbose JSON event logs.

High replay latency because the server parses the entire log to reconstruct UI state.

Poor scalability; FsHistoryProvider must scan all logs on startup, making the service stateful and hard to scale.

Not cloud‑native, leading to high operational cost and difficulty in multi‑tenant isolation.

Solution – UIService

We redesigned the History Server to persist only the UI‑relevant metadata (UIMeta) instead of full event logs.

UIMetaStore

All UI‑related objects (e.g., AppStatusStore, SQLAppStatusStore) are grouped into a UIMetaStore, which is serialized using Spark’s native KVStoreSerializer (Kryo) for compactness.

# AppStatusStore
org.apache.spark.status.JobDataWrapper
org.apache.spark.status.ExecutorStageSummaryWrapper
org.apache.spark.status.ApplicationInfoWrapper
org.apache.spark.status.PoolData
org.apache.spark.status.ExecutorSummaryWrapper
org.apache.spark.status.StageDataWrapper
org.apache.spark.status.AppSummary
org.apache.spark.status.RDDOperationGraphWrapper
org.apache.spark.status.TaskDataWrapper
org.apache.spark.status.ApplicationEnvironmentInfoWrapper
# SQLAppStatusStore
org.apache.spark.sql.execution.ui.SQLExecutionUIData
org.apache.spark.sql.execution.ui.SparkPlanGraphWrapper

The on‑disk format starts with a 4‑byte magic number "UI_S" followed by a sequence of

(class_name_length, class_name, data_length, serialized_data)

records.

4-Byte Magic Number: "UI_S"
----------- Body ---------------
4_byte_length_of_class_name | class_name_str1 | 4_byte_length | serialized_of_class1_instance1
4_byte_length_of_class_name | class_name_str1 | 4_byte_length | serialized_of_class1_instance2
4_byte_length_of_class_name | class_name_str2 | 4_byte_length | serialized_of_class2_instance1
4_byte_length_of_class_name | class_name_str2 | 4_byte_length | serialized_of_class2_instance2

UIMetaLoggingListener

Analogous to EventLoggingListener, this listener only writes UIMeta snapshots when stage‑end or job‑end events occur, batching writes and avoiding redundant data.

UIMetaProvider

Replaces the original FsHistoryProvider. Instead of scanning directories and replaying logs, it directly reads the corresponding UIMeta file for a given application ID, eliminating pre‑loading overhead and enabling horizontal scaling.

Optimizations

Deduplication: a map tracks already‑serialized objects so that only new or changed metadata is written, preventing duplicate writes.

Task‑level filtering: only completed task information is persisted, avoiding redundant snapshots of running tasks.

Fallback to event logs: if a UIMeta file is missing or corrupted, the system falls back to the original event log, ensuring reliability during migration.

Benefits

Storage usage dropped on average by 85% (total reduction 92.4%). A 7‑day event‑log footprint of 3.2 PB shrank to 350 TB with UIMeta. UI latency improved dramatically: average response time reduced by 35%, and the 90th/95th/99th percentile latencies fell by 84.6%/90.8%/93.7% respectively.

By removing the path‑scanning step, the time from job completion to UI availability decreased from ~10 minutes to seconds, and the service now scales horizontally to handle growing workloads.

UIService is also used in LAS to provide tenant isolation, cloud‑native deployment, and on‑demand scaling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Storage Reduction Spark History Server

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.