Implementing Data Cubes in Hive Using WITH CUBE, GROUPING SETS, and WITH ROLLUP
This article demonstrates how to build multi‑dimensional data cubes on JD's big‑data platform using Hive, comparing UNION ALL with the more concise WITH CUBE, GROUPING SETS, and WITH ROLLUP functions, and discusses practical pitfalls and optimization tips.
Based on JD Group's big‑data platform, the article explains how to implement data cubes with Hive, covering general multi‑dimensional batch analysis techniques, performance tuning, and production deployment.
It first describes the data‑cube concept, noting that a naïve UNION ALL approach can lead to redundant code, while Hive's high‑order functions WITH CUBE , GROUPING SETS , and WITH ROLLUP achieve the same result with simpler, maintainable queries.
Sample data is prepared by creating a table and inserting example rows:
CREATE TABLE tmp.tmp_hivecube_test (
`province` string COMMENT '省份',
`city` string COMMENT '城市',
`population` int COMMENT '人口数量'
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS ORC tblproperties (
'orc.compress' = 'SNAPPY'
);
INSERT INTO tmp.tmp_hivecube_test SELECT '北京市','大兴区',260 UNION ALL SELECT '北京市','通州区',300;Using UNION ALL, the data cube is built by manually enumerating all dimension combinations:
SELECT province, city, SUM(population) FROM tmp.tmp_hivecube_test GROUP BY province, city UNION ALL
SELECT province, NULL AS city, SUM(population) FROM tmp.tmp_hivecube_test GROUP BY province UNION ALL
SELECT NULL AS province, city, SUM(population) FROM tmp.tmp_hivecube_test GROUP BY city UNION ALL
SELECT NULL AS province, NULL AS city, SUM(population) FROM tmp.tmp_hivecube_test;The same result can be obtained with a single WITH CUBE statement:
SELECT province, city, SUM(population) FROM tmp.tmp_hivecube_test GROUP BY province, city WITH CUBE;When the required dimensions are known, GROUPING SETS allows flexible configuration of specific aggregates:
SELECT province, city, SUM(population) FROM tmp.tmp_hivecube_test GROUP BY province, city GROUPING SETS ((province), (province, city));For hierarchical aggregations, WITH ROLLUP generates cumulative totals from the most detailed level up to the grand total:
SELECT province, city, SUM(population) FROM tmp.tmp_hivecube_test GROUP BY province, city WITH ROLLUP;The article also highlights a common pitfall: Hive's built‑in grouping_id defaults to 0, causing filters like grouping_id IN (1,3) to return no rows. Disabling predicate push‑down with SET hive.optimize.ppd = false; resolves the issue.
In summary, the author recommends using WITH CUBE with grouping_id for uncertain dimension combinations, GROUPING SETS when dimensions are fixed to save storage, and WITH ROLLUP for clear hierarchical aggregations.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.