AI Engineering

LLM Observability: Building Eval Pipelines That Actually Catch Problems

Logging prompts and responses is not observability. Here is how to build eval pipelines that surface hallucinations, semantic drift, and cost spikes before your users do.

15 Jan 2026·2 min read·

LLMObservabilityProduction

The Logging Trap

When you first deploy an LLM-powered feature, logging the prompt and response feels sufficient. Six months later, users are quietly churning and you have no idea why. Logging is not observability. Observability is the ability to ask arbitrary questions about system behaviour from the outside — for LLMs, that means evaluating correctness, not just latency.

The Four Signals That Matter

Groundedness: Is the response factually anchored to retrieved context? An NLI classifier or LLM-as-judge evaluator running asynchronously on sampled traffic catches hallucinations before they compound.
Retrieval relevance: Did the vector search surface the right chunks? Track recall@k against a golden evaluation set you update every sprint.
Semantic drift: Are responses shifting in tone, length, or style over weeks? Embedding-based distance from a baseline corpus surfaces prompt-injection attempts and silent model version updates.
Cost per query: Token budgets spiral without guardrails. Track input and output tokens per session, segment by feature, and alert on anomalies.

Building the Async Eval Pipeline

The most reliable eval architecture runs in three stages. First, online sampling — capture 5% of live traces asynchronously via a background queue, never blocking the critical path. Second, async evaluation — route each trace through a battery of scorers: groundedness (RAGAS), relevance (cosine similarity), toxicity (Perspective API), and at least one custom task-specific rubric. Third, a feedback loop — low-scoring traces land in a human review queue that doubles as fine-tuning data.

Evaluation should be fully decoupled from serving. Eval workers can use cheaper models (Claude Haiku, GPT-4o-mini) without impacting user-facing latency. We ran 50,000 eval traces per day at under $8 using this approach.

Tooling That Holds Up in Production

LangSmith is the most mature option if you are already on LangChain. For framework-agnostic setups, OpenTelemetry with a custom OTLP exporter into ClickHouse gives full query flexibility at low cost and no vendor lock-in. Avoid proprietary observability SaaS early — they extract maximum value exactly when you are most locked in.

For LLM-as-judge scoring, RAGAS and Prometheus (BerriAI) are good foundations, but expect to write custom scorers for your domain. Generic evals miss task-specific failure modes every time.

The Mindset Shift

Teams that catch LLM regressions before users do treat evaluation as a first-class engineering concern — not an afterthought bolted on after a bad week of support tickets. Budget engineering time for your eval harness the same way you budget for tests. In production AI systems, they are the same thing.

Back to Blog

Deepak Kushwaha