Notes on tool-use evals at small budgets

You don't need a thousand-row eval set to catch the regression that matters. You need the right twenty rows.

Amy Team1 min read

The "build a giant eval set" advice is great when you have a research team. We don't, so we did the small version: twenty hand-curated cases, each one a regression we'd already shipped at least once.

The twenty cases catch about 80% of model-swap regressions before they reach users. The remaining 20% mostly surface as user reports within a day, which is fast enough to roll back. The lesson, again: small and curated beats large and procedural when the cost of being wrong is bounded.

More in Research

View all →