Journal

Shipping AI APIs without 3 a.m. pages

What we wire in from the first deploy: limits, logging, and fallbacks so LLM features fail safely when providers or prompts misbehave.

An LLM endpoint is still an API: it needs timeouts, concurrency limits, and observability like anything else in production. The model is non-deterministic; the system around it shouldn’t be.

Non-negotiables

  1. Hard timeouts on vendor calls, with a clear degraded response path (cached answer, “try again,” or human handoff - never a hang).
  2. Token and request budgets per tenant or per user, so one bad loop can’t burn the month.
  3. Structured logging: request id, model id, latency, and outcome (success, timeout, rate limit, guardrail block) - no raw prompts in logs unless policy says so.
  4. Idempotency for anything that triggers side effects (tickets, payments, emails).

What we add as you scale

The goal isn’t “no failures” - it’s no surprises: you see it in metrics first, and users see a controlled experience.