Big Data 15 min read

Practical Experience and Optimization of Apache Druid for Real‑Time OLAP at iQIYI

This article describes how iQIYI evaluated various OLAP engines, selected Apache Druid for real‑time analytics, detailed its architecture, identified performance bottlene‑cks in Coordinator, Overlord and indexing, applied configuration and resource‑allocation optimizations, and built a user‑friendly RAP platform to democratize real‑time data analysis.

DataFunTalk
DataFunTalk
DataFunTalk
Practical Experience and Optimization of Apache Druid for Real‑Time OLAP at iQIYI

In recent years, big‑data technologies have become pervasive, but the growing volume and timeliness requirements of business data create challenges for real‑time analysis. iQIYI’s big‑data service team evaluated mainstream OLAP engines and chose Apache Druid as the core solution for its low‑latency, multi‑dimensional query capabilities.

Why Druid? Compared with offline tools (Hive, Impala, Kylin) that suffer from poor freshness, and with other real‑time options (Elasticsearch, OpenTSDB, Spark/Flink, Kudu) that have scalability or latency issues, Druid offers minute‑level query visibility, second‑level query latency, flexible dimensions, instant index configuration changes, and exactly‑once semantics in the newer KIS mode.

Druid Architecture consists of MiddleManager (real‑time indexing), Historical (segment storage and query), Broker (query routing), Overlord (task management) and Coordinator (load balancing). The platform now runs on hundreds of nodes, ingesting billions of events daily with sub‑second query latency.

Performance bottlenecks and fixes :

Coordinator: load‑balancing was O(N²) because each segment’s cost was computed against every historical node. Switching to druid.coordinator.balancer.strategy = CachingCostBalancerStrategy reduced scheduling time by 10,000×.

Overlord: API latency caused task failures due to a small DB connection pool. Increasing the pool with druid.metadata.storage.connector.dbcp.maxTotal = 64 improved API response, though CPU pressure required tuning druidBeam.overlordPollPeriod and eventually moving to KIS mode.

Indexing cost: default index tasks used 3 cores per task, wasting resources for small workloads. Introducing “Tiny” nodes (1 core / 3 GB) for 80 % of low‑volume tasks doubled slot capacity without adding hardware.

RAP (Realtime Analysis Platform) was built to hide Druid’s complexity. It provides wizard‑driven data ingestion, JSON‑free query definition, automatic dashboard generation, sub‑minute end‑to‑end latency, second‑level query response, and seamless dimension changes.

The platform is already deployed in iQIYI’s membership, recommendation and BI systems, supporting thousands of real‑time reports and alerts. Future work includes integrating the Pilot intelligent SQL engine, enhancing operational tooling, adding offline Parquet indexing, and supporting JOINs.

For more technical details, see the referenced iQIYI big‑data real‑time analysis articles.

Performance OptimizationBig Datadata platformReal-time OLAPApache DruidStreaming Analytics
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.