How ByteDance Cut Spark History Storage by 90% with a Cloud‑Native UIService
ByteDance rebuilt Spark's History Server into a cloud‑native UIService that stores only essential UI metadata, reducing storage usage by over 90%, cutting UI latency by up to 94%, and enabling seamless horizontal scaling for large‑scale analytics workloads.
Within ByteDance we built a brand‑new cloud‑native Spark History service called UIService. Compared with the open‑source Spark History Server (SHS), UIService reduces storage consumption and access latency by more than 90% and is now the default service for the Volcano Engine Lakehouse Analytics Service (LAS).
Background
The original Spark History Server relies on the Spark event system, persisting massive JSON event logs to HDFS. Each event log can reach tens of gigabytes, and a 7‑day window consumes about 3.2 PB of storage internally.
Pain Points
Huge storage overhead due to verbose JSON event logs.
High replay latency because the server parses the entire log to reconstruct UI state.
Poor scalability; FsHistoryProvider must scan all logs on startup, making the service stateful and hard to scale.
Not cloud‑native, leading to high operational cost and difficulty in multi‑tenant isolation.
Solution – UIService
We redesigned the History Server to persist only the UI‑relevant metadata (UIMeta) instead of full event logs.
UIMetaStore
All UI‑related objects (e.g., AppStatusStore, SQLAppStatusStore) are grouped into a UIMetaStore, which is serialized using Spark’s native KVStoreSerializer (Kryo) for compactness.
# AppStatusStore
org.apache.spark.status.JobDataWrapper
org.apache.spark.status.ExecutorStageSummaryWrapper
org.apache.spark.status.ApplicationInfoWrapper
org.apache.spark.status.PoolData
org.apache.spark.status.ExecutorSummaryWrapper
org.apache.spark.status.StageDataWrapper
org.apache.spark.status.AppSummary
org.apache.spark.status.RDDOperationGraphWrapper
org.apache.spark.status.TaskDataWrapper
org.apache.spark.status.ApplicationEnvironmentInfoWrapper
# SQLAppStatusStore
org.apache.spark.sql.execution.ui.SQLExecutionUIData
org.apache.spark.sql.execution.ui.SparkPlanGraphWrapperThe on‑disk format starts with a 4‑byte magic number "UI_S" followed by a sequence of
(class_name_length, class_name, data_length, serialized_data)records.
4-Byte Magic Number: "UI_S"
----------- Body ---------------
4_byte_length_of_class_name | class_name_str1 | 4_byte_length | serialized_of_class1_instance1
4_byte_length_of_class_name | class_name_str1 | 4_byte_length | serialized_of_class1_instance2
4_byte_length_of_class_name | class_name_str2 | 4_byte_length | serialized_of_class2_instance1
4_byte_length_of_class_name | class_name_str2 | 4_byte_length | serialized_of_class2_instance2UIMetaLoggingListener
Analogous to EventLoggingListener, this listener only writes UIMeta snapshots when stage‑end or job‑end events occur, batching writes and avoiding redundant data.
UIMetaProvider
Replaces the original FsHistoryProvider. Instead of scanning directories and replaying logs, it directly reads the corresponding UIMeta file for a given application ID, eliminating pre‑loading overhead and enabling horizontal scaling.
Optimizations
Deduplication: a map tracks already‑serialized objects so that only new or changed metadata is written, preventing duplicate writes.
Task‑level filtering: only completed task information is persisted, avoiding redundant snapshots of running tasks.
Fallback to event logs: if a UIMeta file is missing or corrupted, the system falls back to the original event log, ensuring reliability during migration.
Benefits
Storage usage dropped on average by 85% (total reduction 92.4%). A 7‑day event‑log footprint of 3.2 PB shrank to 350 TB with UIMeta. UI latency improved dramatically: average response time reduced by 35%, and the 90th/95th/99th percentile latencies fell by 84.6%/90.8%/93.7% respectively.
By removing the path‑scanning step, the time from job completion to UI availability decreased from ~10 minutes to seconds, and the service now scales horizontally to handle growing workloads.
UIService is also used in LAS to provide tenant isolation, cloud‑native deployment, and on‑demand scaling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
