In-Depth Analysis of StarRocks Optimizer Architecture and Techniques
This article provides a comprehensive technical overview of StarRocks' query optimizer, covering its cascades/ORCA-inspired architecture, logical and physical plan transformations, cost modeling, statistics derivation, memo structure, task scheduling, and practical examples of join optimization in a distributed OLAP engine.
StarRocks, a modern OLAP engine, implements an efficient and stable planner/optimizer inspired by the Cascades and ORCA frameworks. The optimizer consists of several stages: Analyzer validates catalog information, Rewriter applies logical‑to‑logical transformations using rule‑based tree rewrites, and the Cost‑Based Optimizer (CBO) generates physical plans based on cost estimates.
The overall optimization flow includes Exploration (generating equivalent logical plans via algebraic rules such as join commutativity and associativity), Statistics Derivation (collecting row counts, column cardinalities, histograms, etc., via a bottom‑up traversal), Implementation (converting logical operators to physical operators like HashJoin, SortMergeJoin, or IndexScan), and Optimization (evaluating costs and pruning sub‑trees).
Rewriter performs top‑down pattern matching on binary trees, splitting predicates and pushing them down to improve join ordering.
CBO maintains two key data structures: LowestCostExpressions (mapping required physical properties to the best expression) and LowestCostTable (tracking required properties for child groups).
Property enforcement handles required sort and distribution properties by inserting Enforcer operators (e.g., Shuffle, Broadcast) and accounting for their costs.
StarRocks adopts a stratified search strategy where logical‑to‑logical and logical‑to‑physical transformations are interleaved, allowing early cost‑based pruning and reducing the search space.
The optimizer is driven by a task scheduler with distinct task types: DeriveStatsTask (statistics derivation), OptimizeExpressionTask (rule application), ApplyRuleTask (transformations), and EnforceAndCostTask (cost calculation and property enforcement). These tasks operate on GroupExpression and OptExpression structures stored in a Memo.
Cost modeling for operators such as HashJoin considers CPU cost (output row size), memory cost (size of the build side), and network cost (typically zero for hash joins). Statistics objects contain fields like outputRowCount , columnStatistics (min, max, average row size, distinct values, etc.), which feed into the cost formulas.
Implementation details include classes for statistics collection (CreateAnalyzeJobStmt, AnalyzeStmt, StatisticAutoCollector, StatisticsCalculator) and the overall optimizer pipeline illustrated with diagrams of memo evolution, join optimization examples, and task flow.
Finally, the article references foundational papers (Cascades, ORCA, CMU 15‑721) and provides links to StarRocks source code for deeper exploration.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.