Notes on prompt caching at agent scale

Prompt caching looks free in the docs and complicated in production. Here's what we found running it across a few thousand Amy invocations a day.

The naive picture

The vendor docs sell caching as: long static prefix → big read; short dynamic suffix → small write; pay for both, but reads are cheap. In a single-turn benchmark that's true and unambiguous. In a multi-turn agent that re-plans, calls tools, and amends its own scratchpad, the picture is messier.

What we measured

Across a representative slice of Amy traffic over two weeks:

Tool-heavy agents (more than ~6 tool calls per turn) consistently hit cache. Median cost per turn dropped 41%.
Single-shot completions (one prompt → one answer) hit cache only when invoked in tight bursts. Steady-state hit rate was below 10%.
Long-running agents that mutate their system prompt mid-run got worse with caching enabled. Each mutation invalidated everything; the write cost was paid repeatedly.

The third bucket was the one we didn't predict.

Why prompt mutation kills caching

A cache entry is keyed on the entire prefix bytes-for-bytes. Re-ordering tools, swapping a date string, even rewriting a single instruction at the top of the system prompt invalidates the entire cached prefix — and the next call pays the write price for a brand-new entry that may itself only be read once before the next mutation.

The agents most likely to mutate their prompts are the ones planning long-horizon work. They're also the most expensive ones. So the regression hurts where you can least afford it.

What we do now

A simple rule that landed after the data:

Cache the prefix you're sure won't change. Put everything else after the cache boundary. If you can't cleanly separate the two, don't cache.

In Amy's runtime that means:

The tool catalog and the system prompt are above the boundary.
The conversation history (which the agent edits) is below the boundary.
We re-evaluate per agent-class once a month and turn caching off for any class whose hit rate drops under 25%.

What we're testing next

Vendor-side partial-prefix caching — being able to cache the first N tokens even when token N+1 differs. If/when that ships broadly, the prompt-mutation regression goes away and the rule above gets simpler. Until then, the boundary discipline is what works.