Evals before you scale AI - the boring checklist
Before doubling traffic on an LLM feature: golden questions, regression harnesses, and refusing to ship vibes-only improvements.
Shipping AI features without evals is shipping blind. “Feels smarter” isn’t a release note your COSS users can rely on.
Minimum viable rig
- Golden set: 30-100 questions your users actually ask - labeled expected behavior (even if subjective).
- Regression runs on every prompt/model change: same harness, diff outputs, compare failure deltas - not vibes.
- Explicit failure buckets: refusal vs hallucination vs formatting vs tool misuse - each gets different fixes.
What “good enough” looks like
No ML purity contests - fewer severe regressions per deploy, stable latency envelopes, and humans reviewing only the churn subset.
Once that exists, traffic scaling becomes an ops discussion - not roulette.