Bedrock prompt caching — when it pays back
Anthropic prompt caching on Bedrock charges you 1.25x the input rate to write to cache, then 0.1x the input rate to read from cache. So a cached prefix is 90% off if you can amortize the write across enough reads.
The math on whether it pays back is short.
The break-even
For a given prefix of N tokens cached for one hour:
Write cost = N tokens × 1.25 × input_rate
Read cost = N tokens × 0.10 × input_rate per read
Fresh cost = N tokens × 1.00 × input_rate per read
You break even when the savings from cached reads exceed the write penalty. That happens at:
break_even_reads = 1.25 / (1.00 - 0.10) = ~1.4 reads
Two reads of the same prefix within the cache TTL beat sending it fresh both times, even after paying for the write. Three or more reads is a clear win.
What's worth caching
Static prefixes of any length get cached at 90% off. The candidates are:
- System prompts that don't change per request (always)
- Tool definitions for agentic workflows (always)
- Retrieved documents the user is chatting against (yes, if multi-turn)
- Few-shot examples in the system message (always)
The thing that breaks caching: anything dynamic in the prefix. If your prompt template starts with User joined at {timestamp}, the timestamp invalidates the cache for that whole prefix. Move dynamic content to the end.
The minimum to enable
You need at least 1024 tokens in the prefix for it to be cacheable. Below that, caching is silently a no-op.
Code sample
response = bedrock.invoke_model(
modelId="anthropic.claude-sonnet-4-20260214-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"system": [
{
"type": "text",
"text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": user_input}
],
"max_tokens": 1024
})
)
Cache hits show up in the response usage block as cache_read_input_tokens. Track this metric over time. If it's under 50% on traffic where the prefix is static, you have a cache-invalidation bug somewhere upstream (usually a timestamp or session ID accidentally embedded in the system prompt).
Quick estimate
The companion bedrock-cost-calculator CLI takes a --cache-hit-rate argument and shows the savings at 60% hit rate vs. zero, against your specific token shape. Useful for deciding whether to spend a sprint on cache plumbing.
Related
Lever 1 in Seven AWS levers that cut $1.5M from a production SaaS bill covers the EC2 commitment-pricing equivalent. Prompt caching is the Bedrock parallel.