Notes on prompt caching at agent scale

What we measured running Amy with aggressive prompt caching: where it pays off, where it backfires, and the rule of thumb we use now.

Mira Patel2 min read

Prompt caching looks free in the docs and complicated in production. Here's what we found running it across a few thousand Amy invocations a day.

The naive picture

The vendor docs sell caching as: long static prefix → big read; short dynamic suffix → small write; pay for both, but reads are cheap. In a single-turn benchmark that's true and unambiguous. In a multi-turn agent that re-plans, calls tools, and amends its own scratchpad, the picture is messier.

What we measured

Across a representative slice of Amy traffic over two weeks:

  • Tool-heavy agents (more than ~6 tool calls per turn) consistently hit cache. Median cost per turn dropped 41%.
  • Single-shot completions (one prompt → one answer) hit cache only when invoked in tight bursts. Steady-state hit rate was below 10%.
  • Long-running agents that mutate their system prompt mid-run got worse with caching enabled. Each mutation invalidated everything; the write cost was paid repeatedly.

The third bucket was the one we didn't predict.

Why prompt mutation kills caching

A cache entry is keyed on the entire prefix bytes-for-bytes. Re-ordering tools, swapping a date string, even rewriting a single instruction at the top of the system prompt invalidates the entire cached prefix — and the next call pays the write price for a brand-new entry that may itself only be read once before the next mutation.

The agents most likely to mutate their prompts are the ones planning long-horizon work. They're also the most expensive ones. So the regression hurts where you can least afford it.

What we do now

A simple rule that landed after the data:

Cache the prefix you're sure won't change. Put everything else after the cache boundary. If you can't cleanly separate the two, don't cache.

In Amy's runtime that means:

  • The tool catalog and the system prompt are above the boundary.
  • The conversation history (which the agent edits) is below the boundary.
  • We re-evaluate per agent-class once a month and turn caching off for any class whose hit rate drops under 25%.

What we're testing next

Vendor-side partial-prefix caching — being able to cache the first N tokens even when token N+1 differs. If/when that ships broadly, the prompt-mutation regression goes away and the rule above gets simpler. Until then, the boundary discipline is what works.

More in Research

View all →