Big Data 17 min read

White‑Box Cost Governance in Big Data: Engine, Data Warehouse, and Tool Optimizations

This article presents a year‑long, white‑box cost‑governance practice for big‑data platforms, detailing the data‑governance framework, engine auto‑tuning (HBO), compression algorithm replacement, operator analysis, data‑warehouse white‑boxing, duplicate‑computation reduction, link‑depth minimisation, automation of routine governance, benefit analysis, and future plans.

DataFunSummit

Apr 17, 2024

White‑Box Cost Governance in Big Data: Engine, Data Warehouse, and Tool Optimizations

01 Data Governance System

Like most companies, Kuaishou’s data governance is divided into four parts: cost, quality, efficiency, and security.

1. Efficiency

Efficiency includes data‑development efficiency and data‑consumption efficiency. Development efficiency focuses on model‑development speed, while consumption efficiency focuses on model usability and query response time.

2. Security

Security is split into production‑stage security and consumption‑stage security.

3. Quality

Quality is divided into prevention, proactive detection, fault impact, and fault post‑mortem.

Prevention: Ensure compliance with standards during design, development, testing, and acceptance.

Proactive detection: Detect issues internally before users report them, requiring comprehensive monitoring and effective alerts.

Fault impact: Monitor fault counts at each level to keep them within acceptable ranges.

Fault post‑mortem: Conduct deep analysis to find root causes and ensure remedial actions are taken promptly.

4. Cost

Data cost consists of storage cost, compute cost, and traffic cost.

Storage cost: Focus on compression ratio, compression performance, and replica count to improve storage efficiency.

Compute cost: CPU utilization reflects resource scheduling ability; single‑CU processing volume reflects engine compute power.

Traffic cost: (Details omitted in source.)

The cost section emphasizes three white‑box practices: engine white‑boxing, data‑warehouse white‑boxing, and tool white‑boxing.

02 Engine White‑Boxing

Engine white‑boxing is a project that includes many optimization points such as HBO auto‑tuning, compression algorithm replacement, and engine‑operator analysis.

1. HBO Auto‑Tuning

Before HBO, tuning was manual, suffering from high difficulty, easy invalidation, and high cost. HBO automatically analyzes job history and optimizes execution parameters, keeping jobs near optimal.

HBO improves performance and reduces cost through three mechanisms:

Reasonable resource quota: Identify CPU and memory needs and auto‑scale.

Optimized task sharding: Adjust shard parameters based on job duration.

Task‑level parameter optimization: Tune small‑file merging, compression algorithm, broadcast, etc.

The tuning workflow consists of four steps:

Build profile – collect dozens of decision metrics.

Coarse‑tune – apply rules for an initial conservative adjustment.

Parameter release – push tuned parameters to jobs for the next cycle.

Fine‑tune – refine based on feedback.

2. Compression Algorithm Replacement

Kuaishou’s lake‑warehouse storage uses Parquet + GZIP. With PB‑scale new data and EB‑scale existing data, read‑write ratio >20:1, so decompression performance is critical.

Industry peers have moved from GZIP to ZSTD, which offers comparable compression ratio to zlib and performance close to Snappy. Tests show ZSTD improves compression ratio by 3%‑12% (optimal level ≤12) without stability or compatibility issues.

3. Engine Operator Analysis

Operator analysis deeply inspects Spark engine from multiple perspectives (execution process, physical operator, UDF function) using four data sources: QueryPlan, StackTrace, EventLog, and GcLog.

Key findings from execution‑process analysis:

Data scan consumes >30% of time.

Data exchange accounts for ~20%.

Data aggregation ~15%.

UDF calls ~14%.

Further analysis reveals heavy JSON processing, high‑cost user UDFs, and opportunities for operator merging.

03 Data‑Warehouse White‑Boxing

1. Warehouse Architecture Metrics

Metrics include completeness (model coverage), reuse degree (reference coefficient, duplicate computation, link depth), and规范性 (standard compliance).

2. Reducing Duplicate Computation

Identify similar operators by extracting SQL, generating execution plans, signing operators via AST traversal, detecting signature collisions, estimating cost, and merging high‑cost duplicates.

Internal data shows 43% duplicate aggregation operators, 23% duplicate join operators, and 4.5% duplicate INSERT operators.

3. Lowering Link Depth

Production pipelines can reach 39 layers with many cross‑layer dependencies, causing latency, high cost, and quality issues. Short‑term solution: machine‑assisted governance using operator‑level lineage and remediation suggestions. Long‑term solution: decouple logical and physical layers, letting machines auto‑generate physical models from logical designs.

4. Routine Governance Automation

Automation follows a five‑step method: define standards, identify problems, quality inspection, governance preview, and fast rollback. All actions must be reversible.

04 Benefit Analysis

Results: storage compression ratio ↑5%, compute resource efficiency ↑16%, job runtime ↓14%, with additional reductions in failure rate, GC time, and OOM incidents.

05 Future Plans

Further storage optimization (dynamic compression, encoding).

Advance data‑warehouse architecture (model design, production).

Deepen engine white‑boxing.

Explore next‑generation technologies for breakthrough efficiency and cost.

Thank you for your attention.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cost optimization Data Governance engine tuning white-boxing

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.