Big Data 18 min read

Implementing Data Cost Governance: Quantifying Storage and Compute Expenses with Hive, Spark, and HDFS FsImage

This article explains how to perform task‑level data cost governance by collecting storage and compute metrics from Hive tables, Spark jobs, and HDFS FsImage files, then estimating monthly expenses using replication factors and resource‑usage rates, while providing practical SQL and shell examples.

政采云技术
政采云技术
政采云技术
Implementing Data Cost Governance: Quantifying Storage and Compute Expenses with Hive, Spark, and HDFS FsImage

Data cost governance is a critical component of data platform construction, helping enterprises reduce expenses and achieve sustainable efficiency improvements.

The approach is task‑oriented: for each business task we collect the associated storage size, CPU time, and estimated monthly cost, linking tasks to output tables in the idata platform.

Storage Statistics

Most data‑warehouse jobs run on Spark‑on‑Hive, with a few on StarRocks. Storage statistics require gathering table and partition metadata such as last access/modify time, business ID, and estimated monthly storage fees.

Method 1: ANALYZE TABLE – Use the ANALYZE TABLE command to retrieve total_size , row_num , and raw_data_size . This method scans large amounts of data and can degrade query performance, so it is only suitable for low‑load periods and small numbers of tables.

Method 2: Metastore SQL Queries – Directly query Hive metastore tables (TBLS, PARTITIONS, SDS, PARTITION_PARAMS, TABLE_PARAMS, etc.) to obtain row counts, sizes, and file numbers. In practice some metadata is unreliable (e.g., zero row counts for tables that actually contain data) and recent access timestamps are unavailable.

Method 3: FsImage Parsing – Parse the HDFS NameNode FsImage, which stores complete filesystem metadata without affecting service performance. Convert the binary FsImage to XML using hdfs oiv -p XML , then process the XML with Spark‑xml.

Implementation Steps

1. Convert FsImage to XML: hdfs oiv -p XML -i /opt/hdfs_parse/fsimage_0* -o /opt/hdfs_parse/fsimage.xml

2. Upload the XML to HDFS and schedule periodic updates via a cron script (example shown below).

#!/bin/bash
# Fetch the latest FsImage
hdfs dfsadmin -fetchImage /opt/hdfs_parse
# Determine the newest fsimage file
command_result=$(ls -l /opt/hdfs_parse/ | grep fsimage_0 | awk '{print $9}' | sort -nr | head -n 1)
echo "command result is $command_result"
# Convert to XML
hdfs oiv -p XML -i /opt/hdfs_parse/$command_result -o /opt/hdfs_parse/fsimage.xml
# Upload to HDFS
hdfs dfs -put -f /opt/hdfs_parse/fsimage.xml /files/fsimage
rm -f /opt/hdfs_parse/$command_result

3. Use Spark to parse the XML and populate analytical tables. Example DDLs:

create table `ods`.`ods_fsimage_hdfs_file` (
  `atime` BIGINT comment 'last access time',
  `blocks` STRING comment 'raw block info',
  `blocknum` INT comment 'number of blocks',
  `filesize` BIGINT comment 'file size',
  `dsquota` BIGINT comment 'directory space quota',
  `id` BIGINT comment 'file unique id',
  `pid` BIGINT comment 'parent id',
  `mtime` BIGINT comment 'last modification time',
  `name` STRING comment 'file name',
  `nsquota` BIGINT comment 'name quota',
  `permission` STRING comment 'permissions',
  `preferredblocksize` BIGINT comment 'recommended block size',
  `replication` BIGINT comment 'replication factor',
  `storagepolicyid` BIGINT comment 'storage policy',
  `type` STRING comment 'FILE/DIRECTORY',
  `xattrs` STRING comment 'extended attributes',
  `path` STRING comment 'full file path',
  `pathlevel` INT comment 'path depth',
  `hivedbname` STRING comment 'Hive database',
  `hivetablename` STRING comment 'Hive table',
  `hivepartitionname` STRING comment 'Hive partition',
  `execute_time` TIMESTAMP comment 'job execution time'
) comment 'FsImage analysis table' stored as ORC;

Additional tables store row counts for ORC files, Hive metadata, and aggregated warehouse and data‑service layer information (see source for full DDL).

Storage Cost Estimation

The monthly storage cost for a task is calculated as:

Task_Monthly_Storage_Cost = Table_Size_GB * Replication_Factor * 0.45 (CNY per GB·month)

Compute Statistics

Two approaches exist: model‑based calculations per job type, or direct retrieval from YARN ResourceManager API. The article adopts the latter, extracting memorySeconds , vcoreSecond , and elapsedTime for each application.

Task‑level identification relies on a naming convention (e.g., "SparkSQL-p-5371") that encodes task type and ID.

Compute Cost Estimation

Monthly compute cost is the sum of memory and CPU costs:

Memory_Cost = memorySeconds / 1024 / 60 / 60 / 24 * 30 * 30 (CNY per GB·month)

CPU_Cost = vcoreSecond / 60 / 60 / 24 * 30 * 155 (CNY per core·month)

Conclusion

The data‑warehouse team uses these metrics to govern tasks, quantify governance impact, and drive cost‑effective operations.

The article ends with a call to like the post, follow the "政采云技术" public account, and a recruitment invitation for the Zero team (frontend, backend, testing, UED) based in Hangzhou, providing contact email [email protected] .

Big DataHiveYARNHDFSSparkData Cost Governance
政采云技术
Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.