Claude Code Harness: Turning Data‑Warehouse AI Coding from Ad‑hoc Queries to Rule‑Driven Automation
The article analyzes the shortcomings of current AI‑assisted data‑warehouse development—context forgetting, unstable rule enforcement, and token‑heavy operations—and presents a five‑layer Harness architecture (persistent CLAUDE.md, Auto Memory, deterministic hooks, subagents, and SKILL refactoring) that systematically resolves these issues, boosts reliability, and embeds AI into the development pipeline.
Background and Pain Points
In the offline data‑warehouse teams, Claude Code is the primary AI coding tool, delivering noticeable efficiency gains. However, three structural pain points emerge during real use:
Context forgetting: When the conversation reaches the token limit (≈95% of the window), Claude’s auto‑compact compresses history, discarding temporary constraints such as "amount field unit is thousand" and causing SQL errors with a 1,000‑fold data discrepancy.
Unstable rule enforcement: Manual or prompt‑based memory of coding standards (OneData naming, INSERT must contain PARTITION, etc.) drops to 60‑70% compliance under tight schedules.
Context explosion: Large‑scale tasks (blood‑line queries, 23‑item self‑tests, data comparisons) quickly fill the context, triggering compact and leading to further “memory loss”.
The core contradiction is that more complex requirements need more AI assistance, yet they also increase the risk of context overflow.
What is Harness?
Harness is the host runtime framework of Claude Code, providing a layered responsibility model that moves deterministic operations out of the LLM’s inference loop. It consists of:
Persistent CLAUDE.md: Stores iteration‑level constraints and specifications on disk; re‑injected after each session start or compact.
Auto Memory: Writes cross‑session discoveries (e.g., field‑unit reminders) to ~/.claude/projects/<project>/memory/MEMORY.md for automatic re‑loading.
Hooks: Configured in .claude/settings.json to run deterministic checks (SQL linting, dangerous DDL blocking) without relying on Claude’s memory.
Subagents: Independent agents (sql‑validator, dw‑explorer, data‑quality‑checker) that execute high‑token operations in isolated contexts.
SKILL refactor: Reduces main‑context consumption by moving the full SKILL file execution into a subagent and exposing only the final artefacts.
Five‑Layer Defense Architecture
Layer 1 – Write into CLAUDE.md (Immediate Use)
Project root contains .claude/CLAUDE.md. Each iteration writes current status, version, and iteration‑specific constraints (e.g., prohibited table modifications, partition field format). This file is the most reliable persistence point because it is re‑read from disk after every compact.
Layer 2 – Auto Memory (Running)
Claude automatically appends key observations to MEMORY.md. Example prompts that trigger memory writes: "the amount field unit is thousand", "field_a may be null in this scenario", "V1.0 introduces a change to field_b". These entries survive across sessions and are re‑injected after compact.
Layer 3 – Hooks for Automatic Validation (Core Defense)
Hooks are defined in .claude/settings.json and the .claude/hooks/ directory. Two critical hooks are:
{
"hooks": {
"PostToolUse": [{
"matcher": "Write|Edit",
"hooks": [{
"type": "command",
"command": "$CLAUDE_PROJECT_DIR/.claude/hooks/validate_sql.sh",
"timeout": 60,
"statusMessage": "Checking SQL standards..."
}]
}],
"PreToolUse": [{
"matcher": "Bash",
"hooks": [{
"type": "command",
"command": "$CLAUDE_PROJECT_DIR/.claude/hooks/block_dangerous_ddl.sh"
}]
}]
}
}Sample validate_sql.sh (SQL linting):
#!/bin/bash
INPUT=$(cat)
FILE_PATH=$(echo "$INPUT" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('tool_input',{}).get('file_path',''))" 2>/dev/null)
[[ "$FILE_PATH" != *.sql ]] && exit 0
[[ -z "$FILE_PATH" ]] && exit 0
SQL=$(cat "$FILE_PATH" 2>/dev/null)
[[ -z "$SQL" ]] && exit 0
ERRORS=()
# Rule 1: Disallow SELECT *
echo "$SQL" | grep -iqE 'SELECT\s+\*' && ERRORS+=("CRITICAL: SELECT * found, explicit columns required")
# Rule 2: INSERT must contain PARTITION
if echo "$SQL" | grep -iqE 'INSERT\s+(INTO|OVERWRITE)'; then
echo "$SQL" | grep -iqE 'PARTITION\s*\(' || ERRORS+=("CRITICAL: INSERT missing PARTITION clause")
fi
# Rule 3: Avoid DOUBLE for amount fields
echo "$SQL" | grep -iqE '\bDOUBLE\b' && ERRORS+=("WARNING: Use DECIMAL(20,4) instead of DOUBLE for amount")
# Rule 4: UPDATE/DELETE need WHERE
if echo "$SQL" | grep -iqE '\b(UPDATE|DELETE)\b'; then
echo "$SQL" | grep -iqE '\bWHERE\b' || ERRORS+=("CRITICAL: UPDATE/DELETE without WHERE")
fi
if [ ${#ERRORS[@]} -gt 0 ]; then
echo "=== SQL VALIDATION FAILED: $FILE_PATH ===" >&2
for err in "${ERRORS[@]}"; do echo " $err" >&2; done
exit 2
fi
echo "SQL validation passed: $(basename $FILE_PATH)" >&2
exit 0Sample block_dangerous_ddl.sh (prevent DROP/TRUNCATE on production tables):
#!/bin/bash
INPUT=$(cat)
CMD=$(echo "$INPUT" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('tool_input',{}).get('command',''))" 2>/dev/null)
if echo "$CMD" | grep -iqE '\b(DROP\s+TABLE|TRUNCATE\s+TABLE)\b'; then
if ! echo "$CMD" | grep -qiE '(_dev|_test|_stg)\b'; then
echo "BLOCKED: Production DROP/TRUNCATE detected" >&2
exit 2
fi
fi
exit 0Layer 4 – Subagents for Context Isolation
Three agents are defined under .claude/agents/:
---
name: sql-validator
description: ODPS/MaxCompute SQL syntax and rule validator. Runs in an isolated context.
tools: Read, Bash, Grep, Glob
model: haiku
permissionMode: dontAsk
---
You are a data‑warehouse SQL expert. Validate only, do not modify files.
Validation items (priority):
1. DISALLOW SELECT *
2. INSERT must contain PARTITION
3. Use snake_case for fields
4. Use DECIMAL for amount, not DOUBLE
5. JOIN must have ON condition
6. Detect Cartesian product risk
Output format:
- status: PASS/FAIL
- problem list (CRITICAL/WARNING/INFO)
- line‑specific suggestions
(Max 50 lines) ---
name: dw-explorer
description: Data‑warehouse structure explorer. Reads DDL, field info, and one‑level lineage without modifying files.
tools: Read, Glob, Grep, Bash
model: haiku
permissionMode: dontAsk
---
When invoked, read the specified table’s DDL, partition strategy, and upstream/downstream lineage (one level). Return a concise summary containing table basics, core field definitions, and lineage list.Layer 5 – SKILL File Refactor (Reduce Main‑Context Load)
SKILL files (≈10 KB) previously loaded entirely into the main context, accelerating compact. The refactor moves the full SKILL execution into a subagent and replaces the file with path‑scoped rule files, e.g.:
---
# .claude/rules/etl-rules.md
paths:
- "**/*insert*.sql"
- "**/*_di.sql"
- "**/*_df.sql"
---
# ETL development rules (auto‑loaded per path)
- Must have partition_dt partition
- INSERT OVERWRITE must check partition existence
- Disallow cross‑database JOINThe main conversation now only receives the final artefact paths (e.g., ddl_table_a.sql) and a PASS/FAIL summary.
Practical Deployment Steps
Project‑level context persistence: Create .claude/CLAUDE.md with current iteration status, version, and iteration‑specific constraints. Update “正在开发” and “本次迭代约束” sections at each sprint; clear constraints after release.
Configure hooks: Add .claude/settings.json and the hooks/ scripts shown above. PostToolUse runs validate_sql.sh after every .sql write; PreToolUse runs block_dangerous_ddl.sh before any DDL command.
Create subagents: Add sql-validator.md, dw-explorer.md, and data-quality-checker.md (similar structure). These handle self‑test result aggregation, lineage queries, and performance checks in isolated contexts.
SKILL invocation redesign: Replace direct SKILL calls with subagent commands, e.g. @"sql-validator (agent)" validate path/to/insert.sql, ensuring the main context only receives concise outcomes.
Comparison with Traditional Data‑Warehouse AI Development
Traditional workflow relies on Claude’s memory for both semantic understanding and rule enforcement, leading to frequent compact‑triggered forgetting and inconsistent compliance. Harness separates concerns: deterministic hooks guarantee rule adherence, subagents protect the main context from token‑heavy operations, and persistent files ensure cross‑session stability. The result is higher accuracy, lower rework, and a measurable reduction (≈50‑70%) in compact frequency.
Outcome and Benefits
Semantic‑plus‑rule accuracy improves dramatically; field‑unit errors drop from frequent to near‑zero.
Requirement‑understanding rework decreases from ~50% to ~10% due to explicit CLAUDE.md constraints and Stop‑hook verification.
SQL‑standard compliance rises from 70‑80% (memory‑based) to >95% (hook‑enforced).
Context overflow is mitigated, cutting compact triggers by half to two‑thirds.
Conclusion
Harness does not aim to make Claude smarter; it embeds Claude’s semantic strengths into a reliable pipeline while delegating deterministic tasks to hooks, subagents, and persistent files. This four‑layer (plus SKILL) division transforms data‑warehouse AI coding from an ad‑hoc conversational aid into a rule‑driven, production‑grade automation framework.
Key Diagrams
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DeWu Technology
A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
