Understanding OLAP Types, Open‑Source Products, and Performance Optimization Techniques
This article explains the classification of OLAP data warehouses by data volume and modeling approach, compares MOLAP, ROLAP, HOLAP and HTAP, reviews popular open‑source ROLAP systems, and details advanced performance‑boosting techniques such as MPP architectures, cost‑based optimization, vectorized execution, dynamic code generation, and runtime filtering.
The article begins by introducing OLAP as the analytical counterpart to OLTP, outlining basic data‑warehouse concepts such as multidimensional models, data cubes, and typical operations, and then proceeds to discuss the classification of OLAP systems.
1. Classification by data volume – OLAP warehouses are grouped into three ranges: small‑scale (suitable for relational databases like MySQL), medium‑scale (millions to hundreds of billions of rows, e.g., Cloudera Impala, Facebook Presto, Pivotal Greenplum), and large‑scale (offline warehouses like Hive or Spark). The author notes that their team at NetEase focuses on medium‑scale, real‑time analytical warehouses.
2. Classification by modeling type – According to Wikipedia, OLAP can be divided into MOLAP, ROLAP, and HOLAP. Each type is described in detail:
MOLAP stores pre‑computed results in multidimensional arrays (cubes). It offers fast, index‑free queries but requires costly pre‑computation and is less flexible for schema changes. Apache Kylin is given as a typical open‑source MOLAP engine.
ROLAP operates directly on relational fact and dimension tables without pre‑computation, providing better scalability and flexibility. However, query latency can be high for large data volumes, especially for complex analytical queries. The article cites surveys showing ROLAP tools are used far more often than MOLAP tools.
HOLAP combines the strengths of MOLAP and ROLAP by using pre‑computation for frequent stable queries while falling back to relational execution for ad‑hoc or less frequent queries. No open‑source HOLAP systems are mentioned.
HTAP is presented as an extension of ROLAP that adds transactional capabilities, though the article does not explore it further.
3. Open‑source ROLAP products – Two architectural families are discussed:
Wide‑table models (e.g., Druid, ClickHouse, Elasticsearch, Solr) that provide high‑performance queries on columnar or inverted indexes but have limited SQL support.
Multi‑table (star or snowflake) models (e.g., Greenplum, Presto, Impala) that rely on MPP architectures and support rich SQL.
4. Performance‑boosting techniques for ROLAP – The article enumerates several key technologies:
MPP architecture : Massively parallel processing reduces query latency compared to MapReduce (MR) by keeping intermediate results in memory and pipelining operators.
MR limitations : Independent MapReduce jobs prevent cross‑stage optimization and cause heavy I/O.
Cost‑Based Optimization (CBO) : Uses detailed statistics (min/max, histograms, cardinality) to choose join orders, join types (broadcast vs. partition), and execution trees (left‑deep vs. bushy).
Rule‑Based Optimization (RBO) : Applies heuristics such as predicate push‑down, projection push‑down, constant folding, and hash join selection.
Join execution strategies : Broadcast join for small‑table joins, partitioned join for large‑table joins, with decisions based on table size estimates or dynamic sampling.
Vectorized execution engines : Process columns in batches, leverage SIMD, reduce virtual‑function calls, and improve CPU cache utilization.
Dynamic code generation : Generates specialized native code (e.g., LLVM for Impala, reflection for Spark SQL) to eliminate type checks and virtual dispatch.
Storage and access optimizations : Columnar storage, compression (zlib, snappy, lz4), encoding (RLE, dictionary), row‑group metadata, local indexes, and rich statistics.
Runtime filters : Bloom‑filter based pruning (Impala) or dynamic partition pruning (Spark SQL) that filter data early during scans.
The article also mentions cluster resource management (YARN) and the need for long‑living ApplicationMasters and container reuse to reduce query startup latency.
5. Summary – The author reflects that the series consolidates recent literature on OLAP systems, emphasizing the importance of choosing the right architecture for the workload and applying the discussed optimizations to achieve low‑latency analytical queries.
<property> <name>dfs.client.read.shortcircuit</name> <value>true</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/var/lib/hadoop-hdfs/dn_socket</value> </property>
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.