Artificial Intelligence 30 min read

How Claude Code Prompt Caching Cuts AI Costs by Up to 90% and Boosts Efficiency

Prompt Caching in Anthropic's Claude Code replaces repeated processing of identical prompt prefixes with a prefix‑hash cache, slashing input‑token costs by up to 90%, reducing first‑token latency by 79%, and improving throughput, while preserving model output exactly as if no cache were used.

Architect's Guide

May 28, 2026

How Claude Code Prompt Caching Cuts AI Costs by Up to 90% and Boosts Efficiency

1. Introduction: Why Prompt Caching Matters

In large language model (LLM) workflows, each API call recomputes the entire prompt even when 90% of the content is unchanged, leading to massive compute waste and latency. Claude Code sessions typically include system prompts, tool definitions, project context (CLAUDE.md), conversation history, and the current user input.

Prompt Caching solves this by allowing developers to cache the stable prefix of a prompt and only process the changing part on subsequent calls.

Cost reduction: only 10% of input‑token price is charged on cache hits.

TTFT reduction: up to 79% faster first‑token latency.

Throughput increase: cached requests do not count toward rate limits.

2. Core Concepts

2.1 Definition

Prompt Caching is an Anthropic API feature that reuses previously computed intermediate states when the prompt prefix matches exactly. The system hashes the prefix up to a cache‑control breakpoint and stores the forward‑pass results.

2.2 Caching Modes

Automatic Caching : add a top‑level cache_control field; the API automatically places a breakpoint at the last cacheable block and moves it forward as the conversation grows.

{
  "model": "claude-sonnet-4-6-20250514",
  "system": "You are a professional code‑review assistant...",
  "messages": [{"role": "user", "content": "..."}],
  "cache_control": {"type": "ephemeral"}
}

Explicit Cache Breakpoints : manually insert cache_control on specific content blocks to gain fine‑grained control.

{
  "system": [
    {"type": "text", "text": "Base instructions...", "cache_control": {"type": "ephemeral"}}
  ],
  "messages": [
    {"role": "user", "content": [{"type": "text", "text": "Doc A...", "cache_control": {"type": "ephemeral"}}]},
    {"role": "assistant", "content": "..."}
  ]
}

2.3 Key Constraints

Exact match required : the cached prefix must be 100% identical (including images).

Output unchanged : cached responses are identical to non‑cached runs.

Zero Data Retention (ZDR) : cached data is not stored after the response.

Workspace isolation (since Feb 2026): caches are isolated per workspace, not organization‑wide.

3. Engineering Architecture Deep Dive

3.1 Prefix Hash Matching – Core Mechanism

The cache key is the hash of the prompt prefix up to a breakpoint. Three principles govern its behavior:

Write only at breakpoints : when a cache_control marker is placed, the system hashes everything from the start of the prompt to that point and stores the intermediate state.

Read by backward lookup : on a new request the system computes the hash at the current breakpoint; if no entry matches, it walks back up to 20 previous blocks looking for a cached prefix.

Look‑back window limit : the backward search is capped at 20 content blocks per breakpoint.

Example cache key computation:

Prompt structure: [Tool Defs] → [System Msg] → [Msg 1] → [Msg 2] → [Msg 3]
Breakpoint: ^
Cache Key = Hash([Tool Defs] + [System Msg] + [Msg 1])

3.2 Cache Layering and Invalidation Cascade

Caches are layered as Tools → System → Messages. Each higher layer includes all lower‑layer content. Changing a lower layer invalidates all higher layers.

Tool definition change → invalidates Tool, System, and Message caches.

System prompt change → invalidates System and Message caches.

Message change → invalidates Message cache only.

3.3 Look‑back Window in Practice

When the number of new content blocks between rounds exceeds 20, earlier cache entries fall outside the window and miss. Adding extra breakpoints (up to four per request) mitigates this.

3.4 Storage and Isolation Model

Organization/Workspace isolation : caches are not shared across organizations.

Exact hash matching : no fuzzy matching.

TTL : default 5 minutes (1.25× input price), optional 1 hour (2× input price). TTL refreshes on hit.

Automatic refresh : a hit within TTL extends its lifetime without extra cost.

3.5 Cacheable vs. Non‑cacheable Content

Cacheable: tool definitions, system messages, text messages, images, tool calls/results.

Non‑cacheable: Thinking blocks (cannot be explicitly marked), empty text blocks, sub‑content (e.g., citations) unless the top‑level block is cached.

4. Cache Breakpoints and TTL Mechanics

4.1 Breakpoint Limits

Each request can have up to four breakpoints. Automatic mode consumes one; the remaining three can be placed explicitly.

4.2 TTL Dual‑Layer Mechanism

Two TTL options:

5‑minute (default) : write cost 1.25× base input price; suited for high‑frequency interactions.

1‑hour (extended) : write cost 2× base input price; suited for low‑frequency or long‑gap interactions.

Configuration example:

{
  "cache_control": {"type": "ephemeral", "ttl": "1h"}
}

4.3 Mixed TTL Billing Model

When both TTLs appear in a request, billing follows three positions (A, B, C) based on token counts. Cache reads are charged at 0.1× input price, while writes follow the respective TTL multiplier.

4.4 Minimum Cacheable Length

Models require a minimum token count for caching (e.g., Claude Opus 4.6 needs ≥4096 tokens, Claude Sonnet 4.6 needs ≥2048 tokens). Prompts below this threshold silently bypass caching.

5. Cost Model and Benchmark Analysis

5.1 Example Cost Calculation

Scenario: a developer runs 50 Claude Sonnet 4.6 rounds per day. Stable context = 15 000 tokens, user input = 500 tokens, model output = 2 000 tokens.

Without cache, total cost ≈ $3.83. With 95% cache hit rate, total cost ≈ $1.85, a 51.7% saving. In extreme cases (100 K‑token documents) savings can reach 90%.

5.2 Token Usage Tracking

API response usage fields expose: input_tokens – uncached input. cache_creation_input_tokens – tokens written to cache. cache_read_input_tokens – tokens read from cache. output_tokens – model output.

Cache hit rate = cache_read_input_tokens / (cache_read + cache_creation + input).

5.3 Official Benchmarks

Anthropic measured three typical workloads:

Long‑document QA (100 K tokens) : TTFT 11.5 s → 2.4 s (‑79%); cost 100% → 10% (‑90%).

Few‑shot prompting (10 K tokens) : TTFT 1.6 s → 1.1 s (‑31%); cost 100% → 14% (‑86%).

Multi‑turn dialogue (10 rounds) : TTFT ~10 s → ~2.5 s (‑75%); cost 100% → 47% (‑53%).

6. Integration with Claude Code

Claude Code automatically enables Prompt Caching for the stable parts of a session:

CLAUDE.md : project‑wide instructions and conventions are cached from the first request.

System prompt & tool definitions : remain unchanged across a session, yielding the highest hit rates.

Conversation history : earlier turns are read from cache; only the newest user‑assistant exchange is processed.

Sub‑agents : each runs in an isolated cache instance.

Typical cache flow:

Request 1: [Tools] + [System] + [CLAUDE.md + User Q1] → write all to cache
Request 2: [Tools] + [System] + [CLAUDE.md] ← cache hit
          + [User Q1 + Assistant A1 + User Q2] → new content
Request 3: … ← cache hit for prefix, new content added

7. Best‑Practice Guide

7.1 Cache Strategy Selection

Multi‑turn dialogue → use automatic caching (simplest, manages breakpoints).

Multiple independent cache zones (e.g., separate tool sets) → use explicit breakpoints.

High‑frequency interaction → 5‑minute TTL.

Low‑frequency interaction → 1‑hour TTL.

7.2 Prompt Structure Optimization

Golden rule: place the most stable content first and the most volatile last.

┌───────────────────────┐
│ 1. Tools (stable)      │
├───────────────────────┤
│ 2. System prompt       │
├───────────────────────┤
│ 3. Reference docs      │
├───────────────────────┤
│ 4. Conversation history│
├───────────────────────┤
│ 5. Current user input  │
└───────────────────────┘
Breakpoints at the end of layers 2, 3, and 4.

7.3 Common Pitfalls & Mitigations

Dynamic data before a breakpoint (e.g., timestamps) prevents hits – move such data after the breakpoint.

Non‑deterministic JSON key order breaks hash matching – use a stable serializer or sort keys.

Prompt shorter than model’s minimum cache length silently disables caching – verify cache_creation_input_tokens and cache_read_input_tokens are non‑zero.

Parallel requests before the first finishes cannot share the newly written cache – serialize parallel calls that share the same prefix.

Exceeding the 20‑block look‑back window – add extra breakpoints (up to four) to keep cache entries within range.

7.4 Monitoring Cache Health

Cache hit rate = cache_read_input_tokens / (cache_read_input_tokens + cache_creation_input_tokens + input_tokens)
Cache efficiency = cache_read_input_tokens / total_input_tokens

Guidelines:

>80% hit rate – excellent.

50‑80% – good, room for improvement.

<80% – investigate prompt layout and breakpoint placement.

8. Future Outlook

Semantic‑level caching to allow near‑matches.

Longer TTL options (beyond 1 hour) for day‑level reuse.

Increasing the look‑back window beyond 20 blocks.

Cross‑session cache sharing under strict isolation.

Automatic breakpoint optimization by the API.

Prompt Caching is reshaping AI service economics, turning "pay‑per‑call" into "pay‑for‑incremental‑change" and making long‑context and agentic workloads financially viable for enterprises.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM cost optimization TTL AI Engineering Cache Invalidation Claude Code Prompt Caching Cache Metrics

Written by

Architect's Guide

Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Introduction: Why Prompt Caching Matters

2. Core Concepts

2.1 Definition

2.2 Caching Modes

2.3 Key Constraints

3. Engineering Architecture Deep Dive

3.1 Prefix Hash Matching – Core Mechanism

3.2 Cache Layering and Invalidation Cascade

3.3 Look‑back Window in Practice

3.4 Storage and Isolation Model

3.5 Cacheable vs. Non‑cacheable Content

4. Cache Breakpoints and TTL Mechanics

4.1 Breakpoint Limits

4.2 TTL Dual‑Layer Mechanism

4.3 Mixed TTL Billing Model

4.4 Minimum Cacheable Length

5. Cost Model and Benchmark Analysis

5.1 Example Cost Calculation

5.2 Token Usage Tracking

5.3 Official Benchmarks

6. Integration with Claude Code

7. Best‑Practice Guide

7.1 Cache Strategy Selection

7.2 Prompt Structure Optimization

7.3 Common Pitfalls & Mitigations

7.4 Monitoring Cache Health

8. Future Outlook

Architect's Guide

How this landed with the community

Was this worth your time?

0 Comments

6. Integration with Claude Code