Journal

Evals before you scale AI - the boring checklist

Before doubling traffic on an LLM feature: golden questions, regression harnesses, and refusing to ship vibes-only improvements.

Shipping AI features without evals is shipping blind. “Feels smarter” isn’t a release note your COSS users can rely on.

Minimum viable rig

  1. Golden set: 30-100 questions your users actually ask - labeled expected behavior (even if subjective).
  2. Regression runs on every prompt/model change: same harness, diff outputs, compare failure deltas - not vibes.
  3. Explicit failure buckets: refusal vs hallucination vs formatting vs tool misuse - each gets different fixes.

What “good enough” looks like

No ML purity contests - fewer severe regressions per deploy, stable latency envelopes, and humans reviewing only the churn subset.

Once that exists, traffic scaling becomes an ops discussion - not roulette.