Data Cost Quantification, Billing, and Optimization in a Data Platform
The data‑platform team introduced a self‑sustaining cost‑reduction framework that quantifies CPU, memory, and disk expenses using price‑per‑resource formulas, applies time‑weighted billing, generates multi‑level reports, and drives optimization through six actionable “swords” and incentive‑based operations, achieving roughly 17 % offline‑cluster savings within six months.
1. Introduction
The article describes a data middle‑platform team’s effort to control rapidly growing compute and storage costs. After six months of business growth, resource consumption doubled, prompting a systematic approach to cost awareness and reduction.
2. Overall Approach
The team proposes a long‑term, self‑sustaining cost‑reduction mechanism with six key requirements: cost quantification, waste perception, ease of reduction, traceable processes, incentive mechanisms, and operational governance.
3. Cost Quantification
Cost is modeled as resource_price * resource_consumption . The main resources are CPU, memory, and disk. The unit price of each resource is calculated using the following formulas:
cpu_price = total_cost * cpu_ratio / (total_cpu * load_factor)
memory_price = total_cost * memory_ratio / (total_memory * load_factor)
disk_price = total_cost * disk_ratio / (total_disk * load_factor)
Variables:
total_cost : total hardware investment for the data platform.
total_cpu , total_memory , total_disk : aggregate hardware capacity.
cpu_ratio , memory_ratio , disk_ratio : cost‑share ratios derived from market scarcity.
load_factor : effective utilization factor (e.g., 0.8).
3.1 Resource Consumption
Three categories of consumption are considered:
Disk storage (including replication, e.g., HDFS 3‑copy).
CPU and memory usage, measured in cpu_seconds and memory_seconds .
Time of execution, enabling time‑weighted billing.
Formulas for consumption:
disk = data_size * replicator
cpu = cpu_seconds * (1 + loss_factor)
memory = memory_seconds * (1 + loss_factor)
loss_factor accounts for resource allocation overhead (0 for YARN‑based collection, >0 for Spark Thrift Server).
3.2 Time‑Weighted Billing
The cluster load is divided into three time slots with different weights (w1, w2, w3) satisfying 0.6*w1 + 0.3*w2 + 0.1*w3 = 1 and w1 > w2 > w3 . Example weights: w1=1.2 (golden), w2=0.8 (silver), w3=0.4 (bronze). Weighted CPU consumption is computed as:
cpu_weight = (cs1*w1 + cs2*w2 + cs3*w3) / cpu_seconds
Analogous weighting is applied to memory.
3.3 Data Cost Calculation
Final cost components are:
cpu_cost = cpu_price * (cpu * cpu_weight)
memory_cost = memory_price * (memory * memory_weight)
disk_cost = disk_price * disk
Additional considerations include cost allocation among multiple output tables, ownership attribution, and handling of ad‑hoc queries.
4. Cost Billing
Billing reports are generated at global, department, and individual levels, covering total cost overview, trend analysis, high‑cost/top‑time tables, savings summary, and value metrics (usage count, business impact).
5. Cost Optimization (Six “Swords”)
Decommission unused data (offline).
Delay start of non‑critical jobs to off‑peak periods.
Reduce job frequency (e.g., hourly to daily).
Replace legacy or duplicate pipelines with more efficient alternatives.
Task‑level tuning (e.g., Hive skew, SQL rewrite).
Merge small files (use Hive’s file‑merge strategy for Spark jobs).
Additional tactics include leveraging Hive cubes, providing registration for manual cost‑saving actions, and building dashboards for monitoring.
6. Cost‑Reduction Operations
The operation loop follows four principles: promotion, “nudge”, feedback, and incentives. Activities include displaying cost metrics on dashboards, regular reminders in meetings, sending personalized cost statements, targeting high‑cost owners, organizing weekly optimization sprints, collecting user feedback, and rewarding top savers with internal tokens.
7. Summary & Outlook
After six months, 40 participants performed 660 cost‑saving actions, reducing offline cluster spend by ~17% (over 20% of savings were self‑initiated). Future work will focus on finer‑grained operation, extending cost governance beyond offline clusters, attributing cost to business lines, and building a data‑value assessment framework.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.