AI Engineering

Building RAG Pipelines at Scale: Lessons from Production

What nobody tells you about retrieval-augmented generation when you move from prototype to production: chunking strategies, re-ranking, eval loops, and the surprising cost of naive embeddings.

28 Mar 2025·1 min read·

LLMProductionRAG

The Prototype Lie

Every RAG prototype works. You chunk a PDF, embed it, push it into a vector store, run a semantic search, and get impressive results in an afternoon. Then you move to production — 10 million documents, 500 concurrent users, a P99 latency SLA — and everything breaks differently.

Chunking Strategy Is Not a Detail

The single biggest lever on retrieval quality is how you chunk. Naive fixed-size chunking works in demos. In production, it splits sentences mid-thought, breaks tabular data across chunks, and buries the most relevant snippet in a chunk of noise.

The strategies that actually move recall metrics: semantic chunking (split on topic shift, not token count), recursive chunking with per-document size tuning, and parent-child chunking (embed small child chunks, retrieve parent context). Measure recall@5 on a golden evaluation set before shipping any chunking change.

Re-ranking Is Not Optional at Scale

Vector similarity is a fast approximation — at 10 million documents, the top 5 results are often not the most relevant 5. A two-stage retrieval pipeline changes this: broad retrieval (top 50 via ANN) followed by a cross-encoder re-ranker (top 5 by semantic relevance). Cross-encoders are slower (50–150ms per query) but dramatically more accurate. Cohere Rerank and BGE-reranker-large are both solid choices; the latter is self-hostable.

The Hidden Cost of Naive Embeddings

OpenAI's text-embedding-3-large costs $0.00013 per 1K tokens — negligible for a prototype, devastating at scale. For 10 million 512-token chunks, initial embedding costs $665. We cut embedding costs by 80% by moving to a self-hosted bge-base-en-v1.5 model and batching re-indexing jobs during off-peak hours.

Eval Loops Are the Product

Build a golden evaluation set from day one: 200–500 question-answer pairs that represent real user queries. Measure recall@k, answer faithfulness (RAGAS), and answer relevance on every deployment. Without it, you are guessing whether changes helped or hurt. With it, you can iterate weekly and quantify improvement.

Async Is a Requirement

Synchronous RAG takes 300–800ms under ideal conditions. Under load, it stacks. Fire the vector search and metadata queries in parallel, await combined results, then feed the re-ranker. This alone cuts median latency by 35–40% in most production pipelines.

Back to Blog

Deepak Kushwaha