Big Data 14 min read

Kylin OLAP Platform Architecture, Optimizations, and 58.com Case Study

This article introduces Kylin, a HBase‑based multidimensional analysis platform, explains its architecture and various performance optimizations—including multi‑tenant support, dimension dictionary handling, and cube size estimation—while showcasing a real‑world deployment and case study at 58.com.

58 Tech

Dec 28, 2018

Kylin OLAP Platform Architecture, Optimizations, and 58.com Case Study

Kylin is a multidimensional analysis platform that uses HBase as its storage and query engine, providing standard SQL query capabilities and achieving sub‑second query response times on massive datasets.

Kylin Architecture:

The OLAP engine consists of a metadata engine, storage engine, query engine, cube‑building engine, and a REST server for client requests.

Metadata engine: manages projects, Hive tables, data models, cubes, etc.

Storage engine: stores built cube data as HFiles in HBase.

Query engine: uses Calcite for SQL parsing and HBase coprocessor for parallel queries.

Cube‑building engine: leverages MapReduce for pre‑aggregation across dimension combinations.

REST server: handles client query requests.

JDBC/ODBC integration enables seamless connection with BI tools.

Kylin’s massive storage and coprocessor‑based aggregation make it suitable for recommendation evaluation, search evaluation, traffic conversion, and user behavior analysis.

1. Multi‑Tenant Support Optimization

Default Kylin releases provide only simple role‑based isolation, lacking proper separation for Hive script execution, MR job submission, HBase storage, and query phases. Two approaches were adopted:

Using Hadoop UGI: create a Hadoop UGI for the logged‑in user and execute operations within UGI.doAs , covering Hive metadata loading, MR job submission, HBase storage, and query.

Using Hadoop proxy user: set the HADOOP_PROXY_USER environment variable before running Hive scripts, applying to all Hive script executions.

2. Dimension Dictionary Upload Optimization

Cube build tasks suffered from long delays because each MR job uploaded all segment dimension dictionaries as distributed cache files, causing time‑outs as the number of segments grew.

Two optimizations were applied:

Skip dictionary upload for steps that do not need it (e.g., steps ①, ③‑⑤, ⑥).

Upload only the dictionaries required for a specific segment (e.g., step ② only needs the current segment’s dictionaries).

Result: overall cube build time reduced by ~20% and job start failures largely eliminated.

3. Cube Data Size Estimation Optimization (resolved in v2.5)

The original estimation algorithm produced large errors, leading to either excessive or insufficient HBase region numbers, which degraded query performance.

Issues identified include fixed‑size assumptions for BitMap (8 KB) and HyperLogLog (64 KB) metrics, which do not reflect actual cardinalities.

New approach: use the latest segment’s statistics (input row count and HFile size) to compute per‑row size, then estimate the next segment’s size by multiplying per‑row size with the expected input rows.

Result: HBase region count error dropped from >50% to ≤1%, stabilizing query performance and reducing region‑management pressure.

4. Other Optimizations

Set HBase scan blockcache parameter to false.

Disable cardinality statistics when loading Hive table metadata.

Improve global dictionary construction by reducing frequent shard swaps (fixed in v2.5).

Case Study: Recommendation Effect Evaluation at 58.com

Data flow: exposure logs and click logs are collected via Flume, streamed to Kafka for real‑time processing, and batch‑loaded to Hadoop. ETL creates two wide tables (exposure and click), which serve as source tables for two Kylin cubes. Cubes are built periodically and used for recommendation effectiveness analysis.

After dimension optimization, the cube reduced from 2^15 (32,768) possible cuboids to 384, dramatically improving storage and query efficiency.

Current statistics: exposure cube ~3 TB (≈110 billion rows), click cube ~2 TB (≈70 billion rows); total cubes >350, processing >460 billion raw records, pre‑computed results >1 TB, with 98% of queries returning within 0.5 s.

Future plans (2019): upgrade to Spark‑based cube building engine, add user‑retention analysis, richer SQL features, and more stable global dictionary algorithms.

Recruitment Notice

58 Data Platform Department is hiring big‑data development engineers for storage, computing, OLAP, messaging, and resource‑management tracks. Requirements include proficiency in Java/Scala/C++, familiarity with HDFS/HBase/Spark/Flink/Druid/Kafka/YARN, and large‑scale distributed system experience.

Location: Beijing, Chaoyang District, Jiuxianqiao North Road, Building 101. Contact: [email protected]

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Warehouse HBase OLAP Kylin Cube Optimization

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.