← Bedrock

Bedrock prompt caching — when it pays back

07 May, 2026

Anthropic prompt caching on Bedrock charges you 1.25x the input rate to write to cache, then 0.1x the input rate to read from cache. So a cached prefix is 90% off if you can amortize the write across enough reads.

The math on whether it pays back is short.

The break-even

For a given prefix of N tokens cached for one hour:

Write cost  = N tokens × 1.25 × input_rate
Read cost   = N tokens × 0.10 × input_rate per read
Fresh cost  = N tokens × 1.00 × input_rate per read

You break even when the savings from cached reads exceed the write penalty. That happens at:

break_even_reads = 1.25 / (1.00 - 0.10) = ~1.4 reads

Two reads of the same prefix within the cache TTL beat sending it fresh both times, even after paying for the write. Three or more reads is a clear win.

What's worth caching

Static prefixes of any length get cached at 90% off. The candidates are:

  • System prompts that don't change per request (always)
  • Tool definitions for agentic workflows (always)
  • Retrieved documents the user is chatting against (yes, if multi-turn)
  • Few-shot examples in the system message (always)

The thing that breaks caching: anything dynamic in the prefix. If your prompt template starts with User joined at {timestamp}, the timestamp invalidates the cache for that whole prefix. Move dynamic content to the end.

The minimum to enable

You need at least 1024 tokens in the prefix for it to be cacheable. Below that, caching is silently a no-op.

Code sample

response = bedrock.invoke_model(
    modelId="anthropic.claude-sonnet-4-20260214-v1:0",
    body=json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "system": [
            {
                "type": "text",
                "text": LONG_SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"}
            }
        ],
        "messages": [
            {"role": "user", "content": user_input}
        ],
        "max_tokens": 1024
    })
)

Cache hits show up in the response usage block as cache_read_input_tokens. Track this metric over time. If it's under 50% on traffic where the prefix is static, you have a cache-invalidation bug somewhere upstream (usually a timestamp or session ID accidentally embedded in the system prompt).

Quick estimate

The companion bedrock-cost-calculator CLI takes a --cache-hit-rate argument and shows the savings at 60% hit rate vs. zero, against your specific token shape. Useful for deciding whether to spend a sprint on cache plumbing.

Related

Lever 1 in Seven AWS levers that cut $1.5M from a production SaaS bill covers the EC2 commitment-pricing equivalent. Prompt caching is the Bedrock parallel.