Practical Experience with Apache Kyuubi and Celeborn on the DXY Big Data Platform
This article presents a comprehensive technical overview of how DXY's big data platform leverages Apache Kyuubi and Celeborn to unify Spark entry points, configure flexible task isolation, implement fine‑grained AuthZ, optimize small files and Z‑Order sorting, and accelerate large result set transmission with Arrow, while also discussing operational challenges and upcoming features.
The article shares DXY (Dingxiangyuan) big data platform's practice based on Apache Kyuubi and Apache Celeborn, describing the overall architecture and the motivations for introducing these projects.
Apache Kyuubi provides a unified Spark program entry, supporting Hive Beeline, RESTful API, multi‑tenant isolation, and various plugins such as Z‑Order optimization, small‑file merging, lineage and audit, which improve YARN resource utilization.
Four flexible gray‑scale task configuration methods are presented: global settings in kyuubi-default.conf or spark-default.conf , JDBC URL suffix parameters (e.g., spark.sql.shuffle.partitions=2;spark.executor.memory=5g ), the SET syntax, and the SessionConfAdvisor plugin for runtime session configuration.
The Kyuubi AuthZ plugin offers three fine‑grained permission controls—table/column level, row level, and data masking—by inserting rules into Spark Catalyst's optimizer and analyzer stages, with default policy information sourced from Apache Ranger.
Known AuthZ issues include the risk of bypassing permissions via SELECT‑ON‑FILES, limited support for only ExecuteStatement, and lack of control for Scala/Python scripts; mitigation strategies such as disabling certain queries or restricting script usage are discussed.
Small‑file optimization leverages Spark AQE to insert a forced shuffle, merging tiny files and alleviating data skew, which reduces the number of output files and improves storage efficiency.
Z‑Order optimization is compared with linear sorting, showing fewer file scans for point queries. Three implementation schemes (binary conversion, global row‑number sorting, and partition‑id sampling) are evaluated, and experiments demonstrate significant storage reduction (up to 90 %) and query‑scan improvements, while noting skew issues for global Z‑Order and the need for a noise column.
Connection‑level configuration isolation (Final Stage Config Isolation) is introduced to separate runtime parameters for long‑running and short‑running tasks, reducing overhead on the Kyuubi server.
Arrow serialization for large result sets is enabled with SET kyuubi.operation.result.format=arrow . The article contrasts Thrift and Arrow data flows, highlighting Arrow's zero‑copy mechanism and executor‑side serialization, which yields up to 2× performance gains for large result sets, while also noting current limitations such as lack of compression.
Apache Celeborn (Incubating) is examined as an alternative to the external shuffle service, offering asynchronous push, fetch, commit, and write‑back, which brings 7–20 % performance improvements. Additional features under development include stage recompute, local shuffle reads, Hadoop MapReduce support, in‑memory storage, authentication, and compatibility with Scala 2.13 and JDK 17.
Disabling Netty cache is shown to reduce the Kyuubi server's memory footprint from ~1 GB to 200–300 MB.
The article concludes with a summary of the presented techniques and their impact on the DXY big data platform.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.