How Claude Code Prompt Caching Cuts AI Costs by Up to 90% and Boosts Efficiency
Prompt Caching in Anthropic's Claude Code replaces repeated processing of identical prompt prefixes with a prefix‑hash cache, slashing input‑token costs by up to 90%, reducing first‑token latency by 79%, and improving throughput, while preserving model output exactly as if no cache were used.
1. Introduction: Why Prompt Caching Matters
In large language model (LLM) workflows, each API call recomputes the entire prompt even when 90% of the content is unchanged, leading to massive compute waste and latency. Claude Code sessions typically include system prompts, tool definitions, project context (CLAUDE.md), conversation history, and the current user input.
Prompt Caching solves this by allowing developers to cache the stable prefix of a prompt and only process the changing part on subsequent calls.
Cost reduction: only 10% of input‑token price is charged on cache hits.
TTFT reduction: up to 79% faster first‑token latency.
Throughput increase: cached requests do not count toward rate limits.
2. Core Concepts
2.1 Definition
Prompt Caching is an Anthropic API feature that reuses previously computed intermediate states when the prompt prefix matches exactly. The system hashes the prefix up to a cache‑control breakpoint and stores the forward‑pass results.
2.2 Caching Modes
Automatic Caching : add a top‑level cache_control field; the API automatically places a breakpoint at the last cacheable block and moves it forward as the conversation grows.
{
"model": "claude-sonnet-4-6-20250514",
"system": "You are a professional code‑review assistant...",
"messages": [{"role": "user", "content": "..."}],
"cache_control": {"type": "ephemeral"}
}Explicit Cache Breakpoints : manually insert cache_control on specific content blocks to gain fine‑grained control.
{
"system": [
{"type": "text", "text": "Base instructions...", "cache_control": {"type": "ephemeral"}}
],
"messages": [
{"role": "user", "content": [{"type": "text", "text": "Doc A...", "cache_control": {"type": "ephemeral"}}]},
{"role": "assistant", "content": "..."}
]
}2.3 Key Constraints
Exact match required : the cached prefix must be 100% identical (including images).
Output unchanged : cached responses are identical to non‑cached runs.
Zero Data Retention (ZDR) : cached data is not stored after the response.
Workspace isolation (since Feb 2026): caches are isolated per workspace, not organization‑wide.
3. Engineering Architecture Deep Dive
3.1 Prefix Hash Matching – Core Mechanism
The cache key is the hash of the prompt prefix up to a breakpoint. Three principles govern its behavior:
Write only at breakpoints : when a cache_control marker is placed, the system hashes everything from the start of the prompt to that point and stores the intermediate state.
Read by backward lookup : on a new request the system computes the hash at the current breakpoint; if no entry matches, it walks back up to 20 previous blocks looking for a cached prefix.
Look‑back window limit : the backward search is capped at 20 content blocks per breakpoint.
Example cache key computation:
Prompt structure: [Tool Defs] → [System Msg] → [Msg 1] → [Msg 2] → [Msg 3]
Breakpoint: ^
Cache Key = Hash([Tool Defs] + [System Msg] + [Msg 1])3.2 Cache Layering and Invalidation Cascade
Caches are layered as Tools → System → Messages. Each higher layer includes all lower‑layer content. Changing a lower layer invalidates all higher layers.
Tool definition change → invalidates Tool, System, and Message caches.
System prompt change → invalidates System and Message caches.
Message change → invalidates Message cache only.
3.3 Look‑back Window in Practice
When the number of new content blocks between rounds exceeds 20, earlier cache entries fall outside the window and miss. Adding extra breakpoints (up to four per request) mitigates this.
3.4 Storage and Isolation Model
Organization/Workspace isolation : caches are not shared across organizations.
Exact hash matching : no fuzzy matching.
TTL : default 5 minutes (1.25× input price), optional 1 hour (2× input price). TTL refreshes on hit.
Automatic refresh : a hit within TTL extends its lifetime without extra cost.
3.5 Cacheable vs. Non‑cacheable Content
Cacheable: tool definitions, system messages, text messages, images, tool calls/results.
Non‑cacheable: Thinking blocks (cannot be explicitly marked), empty text blocks, sub‑content (e.g., citations) unless the top‑level block is cached.
4. Cache Breakpoints and TTL Mechanics
4.1 Breakpoint Limits
Each request can have up to four breakpoints. Automatic mode consumes one; the remaining three can be placed explicitly.
4.2 TTL Dual‑Layer Mechanism
Two TTL options:
5‑minute (default) : write cost 1.25× base input price; suited for high‑frequency interactions.
1‑hour (extended) : write cost 2× base input price; suited for low‑frequency or long‑gap interactions.
Configuration example:
{
"cache_control": {"type": "ephemeral", "ttl": "1h"}
}4.3 Mixed TTL Billing Model
When both TTLs appear in a request, billing follows three positions (A, B, C) based on token counts. Cache reads are charged at 0.1× input price, while writes follow the respective TTL multiplier.
4.4 Minimum Cacheable Length
Models require a minimum token count for caching (e.g., Claude Opus 4.6 needs ≥4096 tokens, Claude Sonnet 4.6 needs ≥2048 tokens). Prompts below this threshold silently bypass caching.
5. Cost Model and Benchmark Analysis
5.1 Example Cost Calculation
Scenario: a developer runs 50 Claude Sonnet 4.6 rounds per day. Stable context = 15 000 tokens, user input = 500 tokens, model output = 2 000 tokens.
Without cache, total cost ≈ $3.83. With 95% cache hit rate, total cost ≈ $1.85, a 51.7% saving. In extreme cases (100 K‑token documents) savings can reach 90%.
5.2 Token Usage Tracking
API response usage fields expose: input_tokens – uncached input. cache_creation_input_tokens – tokens written to cache. cache_read_input_tokens – tokens read from cache. output_tokens – model output.
Cache hit rate = cache_read_input_tokens / (cache_read + cache_creation + input).
5.3 Official Benchmarks
Anthropic measured three typical workloads:
Long‑document QA (100 K tokens) : TTFT 11.5 s → 2.4 s (‑79%); cost 100% → 10% (‑90%).
Few‑shot prompting (10 K tokens) : TTFT 1.6 s → 1.1 s (‑31%); cost 100% → 14% (‑86%).
Multi‑turn dialogue (10 rounds) : TTFT ~10 s → ~2.5 s (‑75%); cost 100% → 47% (‑53%).
6. Integration with Claude Code
Claude Code automatically enables Prompt Caching for the stable parts of a session:
CLAUDE.md : project‑wide instructions and conventions are cached from the first request.
System prompt & tool definitions : remain unchanged across a session, yielding the highest hit rates.
Conversation history : earlier turns are read from cache; only the newest user‑assistant exchange is processed.
Sub‑agents : each runs in an isolated cache instance.
Typical cache flow:
Request 1: [Tools] + [System] + [CLAUDE.md + User Q1] → write all to cache
Request 2: [Tools] + [System] + [CLAUDE.md] ← cache hit
+ [User Q1 + Assistant A1 + User Q2] → new content
Request 3: … ← cache hit for prefix, new content added7. Best‑Practice Guide
7.1 Cache Strategy Selection
Multi‑turn dialogue → use automatic caching (simplest, manages breakpoints).
Multiple independent cache zones (e.g., separate tool sets) → use explicit breakpoints.
High‑frequency interaction → 5‑minute TTL.
Low‑frequency interaction → 1‑hour TTL.
7.2 Prompt Structure Optimization
Golden rule: place the most stable content first and the most volatile last.
┌───────────────────────┐
│ 1. Tools (stable) │
├───────────────────────┤
│ 2. System prompt │
├───────────────────────┤
│ 3. Reference docs │
├───────────────────────┤
│ 4. Conversation history│
├───────────────────────┤
│ 5. Current user input │
└───────────────────────┘
Breakpoints at the end of layers 2, 3, and 4.7.3 Common Pitfalls & Mitigations
Dynamic data before a breakpoint (e.g., timestamps) prevents hits – move such data after the breakpoint.
Non‑deterministic JSON key order breaks hash matching – use a stable serializer or sort keys.
Prompt shorter than model’s minimum cache length silently disables caching – verify cache_creation_input_tokens and cache_read_input_tokens are non‑zero.
Parallel requests before the first finishes cannot share the newly written cache – serialize parallel calls that share the same prefix.
Exceeding the 20‑block look‑back window – add extra breakpoints (up to four) to keep cache entries within range.
7.4 Monitoring Cache Health
Cache hit rate = cache_read_input_tokens / (cache_read_input_tokens + cache_creation_input_tokens + input_tokens)
Cache efficiency = cache_read_input_tokens / total_input_tokensGuidelines:
>80% hit rate – excellent.
50‑80% – good, room for improvement.
<80% – investigate prompt layout and breakpoint placement.
8. Future Outlook
Semantic‑level caching to allow near‑matches.
Longer TTL options (beyond 1 hour) for day‑level reuse.
Increasing the look‑back window beyond 20 blocks.
Cross‑session cache sharing under strict isolation.
Automatic breakpoint optimization by the API.
Prompt Caching is reshaping AI service economics, turning "pay‑per‑call" into "pay‑for‑incremental‑change" and making long‑context and agentic workloads financially viable for enterprises.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect's Guide
Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
