Big Data 15 min read

Practical Experience and Optimization of Apache Druid for Real‑Time OLAP at iQIYI

This article describes how iQIYI evaluated various OLAP engines, selected Apache Druid for real‑time analytics, detailed its architecture, identified performance bottlene‑cks in Coordinator, Overlord and indexing, applied configuration and resource‑allocation optimizations, and built a user‑friendly RAP platform to democratize real‑time data analysis.

DataFunTalk

Jun 14, 2020

Practical Experience and Optimization of Apache Druid for Real‑Time OLAP at iQIYI

In recent years, big‑data technologies have become pervasive, but the growing volume and timeliness requirements of business data create challenges for real‑time analysis. iQIYI’s big‑data service team evaluated mainstream OLAP engines and chose Apache Druid as the core solution for its low‑latency, multi‑dimensional query capabilities.

Why Druid? Compared with offline tools (Hive, Impala, Kylin) that suffer from poor freshness, and with other real‑time options (Elasticsearch, OpenTSDB, Spark/Flink, Kudu) that have scalability or latency issues, Druid offers minute‑level query visibility, second‑level query latency, flexible dimensions, instant index configuration changes, and exactly‑once semantics in the newer KIS mode.

Druid Architecture consists of MiddleManager (real‑time indexing), Historical (segment storage and query), Broker (query routing), Overlord (task management) and Coordinator (load balancing). The platform now runs on hundreds of nodes, ingesting billions of events daily with sub‑second query latency.

Performance bottlenecks and fixes :

Coordinator: load‑balancing was O(N²) because each segment’s cost was computed against every historical node. Switching to

druid.coordinator.balancer.strategy = CachingCostBalancerStrategy

reduced scheduling time by 10,000×.

Overlord: API latency caused task failures due to a small DB connection pool. Increasing the pool with druid.metadata.storage.connector.dbcp.maxTotal = 64 improved API response, though CPU pressure required tuning druidBeam.overlordPollPeriod and eventually moving to KIS mode.

Indexing cost: default index tasks used 3 cores per task, wasting resources for small workloads. Introducing “Tiny” nodes (1 core / 3 GB) for 80 % of low‑volume tasks doubled slot capacity without adding hardware.

RAP (Realtime Analysis Platform) was built to hide Druid’s complexity. It provides wizard‑driven data ingestion, JSON‑free query definition, automatic dashboard generation, sub‑minute end‑to‑end latency, second‑level query response, and seamless dimension changes.

The platform is already deployed in iQIYI’s membership, recommendation and BI systems, supporting thousands of real‑time reports and alerts. Future work includes integrating the Pilot intelligent SQL engine, enhancing operational tooling, adding offline Parquet indexing, and supporting JOINs.

For more technical details, see the referenced iQIYI big‑data real‑time analysis articles.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Data Platform Real-time OLAP Apache Druid streaming analytics

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.