Why Druid? Architecture, Indexing, Use Cases, and Lessons Learned
This article introduces Druid as an open‑source, distributed column‑store OLAP engine, explains its architecture and indexing mechanisms, discusses real‑time and batch data ingestion for order analytics at Qunar, compares it with other engines, and shares practical tips and pitfalls.
Qunar's large‑accommodation data platform originally relied on Hive, Postgres, and MySQL, but growing data volume and dimensionality required a faster OLAP engine, leading to the adoption of Druid.
Druid is an open‑source, distributed, column‑store system designed for real‑time analytics, offering millisecond‑level query latency, flexible filtering, and low‑latency data ingestion, with the latest version 0.9.1.1 released under the Apache License 2.0.
The Druid cluster follows a share‑nothing architecture composed of five node types—Realtime, Indexer, Broker, Historical, and Coordinator—and depends on three external services: Zookeeper, a relational metadata store, and deep storage (HDFS, local FS, or S3).
Each node has a specific role: Coordinators manage segment allocation; Realtime nodes ingest and aggregate streaming data; Indexer handles batch indexing; Brokers route client queries to the appropriate Historical or Realtime nodes; Historical nodes store indexed segments and perform partial aggregations.
Druid indexes data by splitting it into timestamps, dimensions, and metrics, then applying dictionary encoding, columnar storage, and inverted bitmap indexes to enable fast retrieval and high compression.
In practice, Qunar uses Druid for real‑time multidimensional order analysis, combining streaming ingestion via Kafka with periodic offline re‑indexing to keep three months of data up to date.
Challenges encountered include the lack of raw detail data, a proprietary DSL for queries, and the need for external tools; Caravel (an Airbnb‑developed visualization platform) addresses these by providing a drag‑and‑drop UI and integrating Presto for detailed queries.
The article lists Druid's pros—high availability, horizontal scalability, efficient compression and indexing, real‑time and batch ingestion, and flexible schemas—and cons—no raw detail storage, inability to update imported data without re‑indexing, and the requirement to pre‑define dimensions and metrics.
Comparisons are drawn with Elasticsearch (text‑search focus), Spark/Hive/Impala/Presto (full SQL support), and Kylin (OLAP cube vs. bitmap indexing), highlighting Druid's strengths in real‑time aggregation.
Finally, practical pitfalls are shared, such as timezone handling, protobuf limitations, CSV formatting, inappropriate dimension choices, and optimal segment sizing.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.