Prompt caching looks free in the docs and complicated in production. Here's what we found running it across a few thousand Amy invocations a day.
The naive picture
The vendor docs sell caching as: long static prefix → big read; short dynamic suffix → small write; pay for both, but reads are cheap. In a single-turn benchmark that's true and unambiguous. In a multi-turn agent that re-plans, calls tools, and amends its own scratchpad, the picture is messier.
What we measured
Across a representative slice of Amy traffic over two weeks:
- Tool-heavy agents (more than ~6 tool calls per turn) consistently hit cache. Median cost per turn dropped 41%.
- Single-shot completions (one prompt → one answer) hit cache only when invoked in tight bursts. Steady-state hit rate was below 10%.
- Long-running agents that mutate their system prompt mid-run got worse with caching enabled. Each mutation invalidated everything; the write cost was paid repeatedly.
The third bucket was the one we didn't predict.
Why prompt mutation kills caching
A cache entry is keyed on the entire prefix bytes-for-bytes. Re-ordering tools, swapping a date string, even rewriting a single instruction at the top of the system prompt invalidates the entire cached prefix — and the next call pays the write price for a brand-new entry that may itself only be read once before the next mutation.
The agents most likely to mutate their prompts are the ones planning long-horizon work. They're also the most expensive ones. So the regression hurts where you can least afford it.
What we do now
A simple rule that landed after the data:
Cache the prefix you're sure won't change. Put everything else after the cache boundary. If you can't cleanly separate the two, don't cache.
In Amy's runtime that means:
- The tool catalog and the system prompt are above the boundary.
- The conversation history (which the agent edits) is below the boundary.
- We re-evaluate per agent-class once a month and turn caching off for any class whose hit rate drops under 25%.
What we're testing next
Vendor-side partial-prefix caching — being able to cache the first N tokens even when token N+1 differs. If/when that ships broadly, the prompt-mutation regression goes away and the rule above gets simpler. Until then, the boundary discipline is what works.