Performance Governance and Optimization of Ctrip's Nova Data Reporting Platform
This article details the performance challenges of Ctrip's Nova data reporting platform and describes a series of governance measures—including multi‑dimensional data caching, materialized view acceleration, query strategy optimization, and SQL quality improvements—that collectively reduced average query latency by over 50% and stabilized the system.
Background The Nova data reporting platform supports internal analytics, mining, and visualization at Ctrip, handling tens of thousands of Hive table queries daily with petabyte‑scale data. Growing user numbers caused query latency, instability, and high resource contention.
Platform Overview The query flow consists of Nova (UI), a Router for SQL routing, and StarRocks & Hive as execution engines. Figure 1: Nova query chain
Multi‑Dimensional Data Cache
To reuse repeated query results, several caches were introduced:
Data Cache pre‑warming : StarRocks caches file blocks locally; hot tables are pre‑warmed based on usage patterns. Figure 3: Pre‑warming mechanism
HDFS metadata cache : Extends Remote File cache to 6 hours by checking metadata freshness on the FE side, reducing repeated metadata requests.
Router Redis cache : Caches query results in Redis; consistency is ensured via lineage‑based freshness checks.
Download cache : Caches exported report data to avoid re‑execution on subsequent downloads. Figure 5: Download cache
These caches reduced data transfer and I/O pressure across the whole query chain. Figure 6: Cache overview
Materialized View (MV) Acceleration
MV pre‑computes results to cut both I/O and compute costs. Three key aspects were addressed:
View definition : Datasets are classified as static, semi‑static, or dynamic. MV is built on static or transformed static datasets.
View utilization : StarRocks rewrites queries to use MV automatically via two rules—SPJG logical‑plan rewrite (based on the paper “Optimizing Queries Using Materialized Views…”) and AST‑based text‑match rewrite.
View maintenance : MV freshness is monitored through table lineage; expired MV are refreshed automatically.
Value extraction includes a cost‑benefit model (CPU cost saved) and an evaluation workflow illustrated in Figure 8. Figure 8: MV value assessment
Query Strategy Optimization & SQL Quality Governance
Scheduling alignment : Introduced “off‑peak” and data‑driven scheduling to smooth load spikes.
Traffic splitting : Separate StarRocks clusters for real‑time reports and offline jobs; dedicated cluster for high‑cost queries.
Max‑d query remediation : Two‑phase rewrite—first fetch max(date) then apply as predicate—reducing complexity from O(N²) to O(N).
SQL quality improvements : Removed unnecessary distinct * , advocated table splitting, and enforced partition limits.
Slow‑query blocking : Platform checks for scan rows, runtime, memory, and triggers StarRocks circuit‑breaker when file/partition counts exceed thresholds.
Results
Average query latency dropped from ~8 s to ~4 s during peak hours.
90‑percentile latency reduced from ~18 s to ~8 s; time‑out queries fell from ~7 000/day to ~1 400/day.
Standard deviation of query times fell from ~25 s to ~14 s, indicating smoother performance. Figure 10: Response time statistics
Long‑running query count showed a clear downward trend. Figure 11: Long‑query trend
Conclusion By combining multi‑dimensional caching, materialized view acceleration, query‑strategy refinements, traffic isolation, and rigorous SQL quality control, the Nova platform achieved over 50 % performance improvement and enhanced stability. Future work will further automate and integrate these optimizations.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.