Materialized Views in MaxCompute: Design, Implementation, and Best Practices
This article explains how MaxCompute leverages materialized views as a query accelerator, covering their history, advantages and drawbacks, creation and maintenance details, automatic query rewriting, intelligent recommendation, auto‑materialization, and future enhancements for large‑scale data warehousing.
Materialized views (MVs) pre‑compute and store query results as physical tables, enabling automatic reuse of these results to improve query performance, especially in cloud and big‑data environments. The article introduces MaxCompute, an enterprise‑grade data warehouse, and explains why MVs are essential for reducing redundant computation, meeting latency requirements, and lowering data‑governance costs.
Unlike virtual views, which generate results on demand, materialized views store a physical copy of the result set. Their evolution is traced from early databases without MVs, through the RDBMS view era, to the modern cloud‑native big‑data MV stage, where scalability, real‑time updates, automation, and cost‑efficiency are key.
The design of MVs in MaxCompute includes four main points: (1) MaxCompute and materialized views, (2) functional design and concrete implementation, (3) intelligent recommendation and auto‑materialization, and (4) summary and outlook.
Creating MVs supports ordinary, partitioned, clustered, and penetrative views, using DDL clauses such as PARTITIONED BY (col_name) and CLUSTERED BY (col_name) SORTED BY (col_name [ASC|DESC]) INTO number_of_buckets BUCKETS . Two critical steps are building an index from the base table to the MV and storing the base table’s data version in MV metadata; partitioned MVs must keep partition columns consistent with the base table for incremental updates.
Maintenance relies on table properties to control automatic refresh, interval, and partition‑level incremental refresh, for example: "enable_auto_substitute"="true" "enable_auto_refresh"="true" "refresh_interval_minutes"="120" "only_refresh_max_pt"="true"
Automatic query rewriting uses a four‑step process: locate relevant MVs via the index, perform hit pre‑judgment to filter irrelevant MVs, register candidate MVs with the optimizer, and finally validate and execute the rewrite based on Calcite rules, handling SELECT order, missing columns, and computable expressions.
Example original query: SELECT ds, count(uid) as pv FROM users WHERE ds>='20230501' AND ds<='20230507' GROUP BY ds; Corresponding MV definition: SELECT ds, count(uid) as pv FROM users WHERE ds>='20230401' GROUP BY ds; Rewritten query using the MV: SELECT ds, pv FROM mv WHERE ds>='20230501' AND ds<='20230507';
Automatic penetration distinguishes materialized from non‑materialized partitions, converting unmaterialized partitions to virtual views and unioning results, thus saving storage while extending query coverage.
Intelligent recommendation analyzes heavy, long‑running, frequently used SQLs, extracts common expressions, and suggests MVs based on full match, table overlap count, and MV size. Auto‑materialization introduces delayed materialization, quota limits, and feedback‑driven iteration, allowing users to generate and view MVs via the console.
The summary highlights MaxCompute’s comprehensive MV capabilities—partitioned, clustered, query rewrite, delayed materialization, penetration, automatic creation, and refresh—compared with Hive, Spark, Snowflake, RedShift, and BigQuery. In production, over 5,000 MVs have optimized more than 20,000 queries, reducing average resource consumption by over 20% and achieving up to 80% savings for some workloads.
Future work includes finer‑grained incremental updates, event‑driven refresh, extending external table support (OSS, HDFS), and expanding rewrite operators to cover LEFT/RIGHT OUTER JOIN, GROUP BY, HAVING, ORDER BY, LIMIT, and aggregate functions.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.