Comparative Analysis of Big Data Storage and Query Solutions
This article reviews major big‑data storage and query architectures—including HBase, Dremel/Parquet, pre‑aggregation systems, Lucene, and the custom Tindex solution—evaluating their strengths, weaknesses, and suitability for real‑time, high‑volume analytical workloads.
In recent years the value of big data has attracted increasing attention, prompting many enterprises to store and analyze massive incremental data streams generated by user behavior, IoT devices, and logs. The author outlines five key goals for an optimal storage and query system: lossless data, real‑time availability, rapid response to business queries, flexible exploratory analysis, and sub‑second analytics on trillion‑row datasets.
HBase family – Solutions such as OpenTSDB and Kylin built on HBase are suited for relatively fixed reporting workloads with few dimensions. HBase stores data as rowkey‑based KeyValue pairs, uses region‑based sharding, and provides fast single‑row lookups but suffers from poor scan performance and limited aggregation capabilities due to the lack of column indexes.
Dremel family – Represented by Apache Parquet, Dremel‑style columnar storage offers superior scan performance and avoids index‑building overhead. Data is organized into Row Groups, Column Chunks, and Pages, enabling projection and predicate push‑down optimizations. However, Parquet is primarily batch‑oriented, incurs higher costs for data reconstruction and aggregation, and provides limited real‑time write capabilities.
Pre‑aggregation family – Systems such as Kylin, Druid, and Pinot pre‑aggregate metrics at ingest time, delivering fast OLAP queries at the expense of data loss and inflexible metric definitions.
Lucene family – Lucene provides powerful inverted‑index based full‑text search, forming the basis of Elasticsearch and Solr. While excellent for search, its design is less suitable for time‑series logs and large‑scale aggregation, and the single‑node version does not scale horizontally without additional sharding logic.
Tindex solution – The author’s proprietary system combines a Lucene‑derived index layer with an extended Druid query engine. It offers high compression, columnar inverted and forward indexes, real‑time data ingestion, flexible metric definition, and seamless integration with HDFS, Kafka, Spark, and Hive. Features such as off‑heap memory reuse, segment‑level caching, and dynamic segment loading improve performance and reduce GC pressure.
Overall, the article provides practical insights and comparative evaluations to help practitioners select or design a big‑data storage and query architecture that meets the demanding requirements of modern data‑driven applications.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.