The "build a giant eval set" advice is great when you have a research team. We don't, so we did the small version: twenty hand-curated cases, each one a regression we'd already shipped at least once.
The twenty cases catch about 80% of model-swap regressions before they reach users. The remaining 20% mostly surface as user reports within a day, which is fast enough to roll back. The lesson, again: small and curated beats large and procedural when the cost of being wrong is bounded.