Automated Offline Data Cost Optimization in Youzan's Data Platform
Youzan built an automated offline data cost‑optimization platform that gathers accurate metadata, mines unused or failing tables and tasks, and safely decommissions them through a backend‑frontend workflow with owner validation, notifications, rollback safeguards, and plans to extend lineage coverage and real‑time asset handling.
1. Introduction Youzan has been exploring data cost governance and has published related articles. This document focuses on cost optimization for offline data and tasks.
2. Background While many teams have contributed to early cost‑saving efforts, the ROI of cost governance declines in later stages because only a few large‑scale data assets remain, and the remaining small tables and tasks require excessive manual effort.
3. Current Situation Youzan has quantified the cost of each data asset and task and provided preliminary offline suggestions (e.g., no downstream tasks, unused for N days, long‑term failures). However, actual offline actions still rely on owners, and two main pain points are low accuracy of recommendation data and lack of batch operations.
4. Solution Design The solution consists of a backend and frontend for automated offline. The backend reads pending assets from RDS, interacts with Hive and the Data Platform (DP) to delete or pause resources, stores configurations in Apollo, sends notifications via Feishu, and provides a UI for user interaction. The overall workflow is illustrated in the accompanying diagram.
5. System Implementation
5.1 Data Preparation Collect complete metadata, ownership, lineage, and audit logs from Hive, HBase, DP, BI, etc., ensuring high accuracy. Key steps include extracting metadata, parsing lineage, gathering audit logs, and aggregating DP task dependencies and BI dashboard usage.
5.2 Offline Mining Identify candidates for decommission based on criteria such as long‑term task failures, tables without downstream usage, and tables with downstream but no access. A three‑stage pipeline (candidate pool → filter pool → pre‑offline pool → offline pool) is applied, with filters for dependencies, recent temporary usage, multi‑table outputs, and recent creation.
5.3 Automated Offline The process includes pre‑notification, decision making (white‑list, owner validation), execution of offline actions (dropping tables, pausing tasks), and post‑execution notifications. Controls limit daily offline volume and provide rollback windows.
5.4 Protection Measures To avoid accidental data loss, a buffer period with backups is kept, monitoring of lineage coverage, SQL parsing success rate, and abnormal metrics (e.g., audit log success < 99%) will halt the pipeline.
6. Future Planning Expand lineage collection, extend automated offline to real‑time assets (Kafka, ES, HBase), and roll out the solution to other environments such as financial cloud.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.