Notes on tool-use evals at small budgets

You don't need a thousand-row eval set to catch the regression that matters. You need the right twenty rows.

Amy TeamJanuary 13, 20261 min read

The "build a giant eval set" advice is great when you have a research team. We don't, so we did the small version: twenty hand-curated cases, each one a regression we'd already shipped at least once.

The twenty cases catch about 80% of model-swap regressions before they reach users. The remaining 20% mostly surface as user reports within a day, which is fast enough to roll back. The lesson, again: small and curated beats large and procedural when the cost of being wrong is bounded.

More in Research

View all →

Research

Notes on prompt caching at agent scale

What we measured running Amy with aggressive prompt caching: where it pays off, where it backfires, and the rule of thumb we use now.

Mira PatelApril 8, 20262 min read

Research

When to fine-tune vs. prompt-engineer

Most teams reach for fine-tuning too early and prompt engineering too late. Here's the heuristic we use.

Amy TeamJanuary 6, 20261 min read