Shipping AI APIs without 3 a.m. pages
What we wire in from the first deploy: limits, logging, and fallbacks so LLM features fail safely when providers or prompts misbehave.
An LLM endpoint is still an API: it needs timeouts, concurrency limits, and observability like anything else in production. The model is non-deterministic; the system around it shouldn’t be.
Non-negotiables
- Hard timeouts on vendor calls, with a clear degraded response path (cached answer, “try again,” or human handoff - never a hang).
- Token and request budgets per tenant or per user, so one bad loop can’t burn the month.
- Structured logging: request id, model id, latency, and outcome (success, timeout, rate limit, guardrail block) - no raw prompts in logs unless policy says so.
- Idempotency for anything that triggers side effects (tickets, payments, emails).
What we add as you scale
- Queue + worker for long jobs instead of holding HTTP open.
- Eval runs on new prompts or model versions before they hit 100% of traffic.
- Feature flags to roll out changes to a slice of users.
The goal isn’t “no failures” - it’s no surprises: you see it in metrics first, and users see a controlled experience.