Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) grounds language model outputs in retrieved evidence, reducing hallucination and enabling knowledge updates without retraining. This article covers the retrieval pipeline, embedding models, chunking strategies, and evaluation methodology.
The Problem RAG Solves
LLMs have two fundamental limitations that RAG addresses:
- Knowledge cutoff. Parametric knowledge is frozen at pretraining time. The model cannot answer questions about events, documents, or data that postdate its training.
- Hallucination. When the model lacks knowledge, it generates plausible-sounding but fabricated content. There is no mechanism to distinguish “I know this” from “I’m generating something that fits the pattern.”
RAG addresses both by retrieving relevant documents at inference time and conditioning the generation on retrieved context. The model generates from evidence rather than from memory.
Architecture Overview
A RAG system has two components:
Retriever. Maps a query to a set of relevant passages from a corpus .
Generator. An LLM that produces an answer conditioned on both the query and retrieved passages:
The retriever is typically a dense embedding model, and the generator is any instruction-tuned LLM. The two components can be trained jointly or independently; in practice, independent training dominates because it allows swapping either component.
Dense Retrieval
Embedding Models
Dense retrieval encodes queries and documents into a shared vector space where semantic similarity corresponds to vector proximity.
Bi-encoder architecture. Separate encoders for queries and documents (though often the same model):
Documents are encoded offline and indexed; at query time, only the query needs encoding. This decoupling enables sub-second retrieval over millions of documents.
Training objectives. Contrastive learning with in-batch negatives:
where is the relevant passage, is the set of negatives (other passages in the batch), and is a temperature parameter. Hard negatives (passages that are topically related but not relevant) improve training substantially over random negatives.
Key models. OpenAI text-embedding-3, Cohere embed-v3, BGE, E5, GTE. Embedding dimensions typically 768—1536. Modern embedding models are trained on diverse retrieval tasks (question-answer pairs, paraphrase detection, semantic similarity) to produce general-purpose representations.
Vector Databases
Dense vectors require specialized storage and search infrastructure:
| System | Index Type | Key Feature |
|---|---|---|
| FAISS (Meta) | IVF, HNSW, PQ | In-memory, GPU-accelerated, industry standard |
| Pinecone | Managed HNSW | Serverless, metadata filtering |
| Weaviate | HNSW | Hybrid search (dense + sparse), GraphQL API |
| Qdrant | HNSW | Filtering, payload indexing |
| pgvector | IVFFlat, HNSW | PostgreSQL extension, familiar ops tooling |
Approximate Nearest Neighbor (ANN) algorithms:
- IVF (Inverted File Index): Cluster vectors into Voronoi cells. At query time, search only the nearest cells. Trades recall for speed.
- HNSW (Hierarchical Navigable Small World): Build a multi-layer graph of proximity. Traverse from coarse to fine layers. State-of-the-art recall-speed tradeoff for most use cases.
- Product Quantization (PQ): Compress vectors by splitting into subvectors and quantizing each independently. Reduces memory by 8—32x with moderate recall loss.
Chunking Strategies
Documents must be split into passages before embedding. Chunking strategy significantly impacts retrieval quality.
Fixed-Size Chunking
Split documents into chunks of tokens with tokens of overlap:
- Chunk size: 256—512 tokens typical. Larger chunks provide more context per retrieval but dilute relevance signal. Smaller chunks are more precisely retrievable but may lack sufficient context.
- Overlap: 10—20% prevents information loss at chunk boundaries.
Semantic Chunking
Split at natural boundaries (paragraphs, sections, sentences) rather than fixed token counts. Methods:
- Recursive character splitting: Try paragraph breaks first, then sentences, then fixed-size as fallback. LangChain’s
RecursiveCharacterTextSplitterimplements this hierarchy. - Embedding-based splitting: Compute sentence embeddings and split where cosine similarity between adjacent sentences drops below a threshold, detecting topic shifts.
Hierarchical Chunking
Maintain a hierarchy: document summaries at the top level, section summaries at the middle level, raw paragraphs at the bottom. Route queries to the appropriate level based on specificity. Summary-level retrieval catches broad questions; paragraph-level catches specific factual queries.
Parent-Child Chunking
Index small chunks (sentences or short paragraphs) for precise retrieval, but return the parent chunk (the full section or surrounding paragraphs) to the generator. This gives the retriever precision and the generator context.
Hybrid Search
Dense retrieval struggles with exact-match queries (specific names, codes, identifiers) where lexical overlap is the primary signal. Sparse retrieval (BM25) handles these well but misses semantic matches.
BM25 scores documents by term frequency with saturation and document length normalization:
where is term frequency, controls saturation, and controls length normalization.
Hybrid search combines dense and sparse scores:
The mixing weight is tuned per domain. Typical values: , slightly favoring dense retrieval. Reciprocal Rank Fusion (RRF) is an alternative that combines ranked lists without score normalization:
with typical. RRF is robust to score scale differences between retrievers.
Reranking
Initial retrieval returns candidates (typically 20—100). A reranker rescores these with a more expensive model.
Cross-encoder reranking. A cross-encoder processes the (query, document) pair jointly:
Unlike bi-encoders (which encode and independently), cross-encoders model fine-grained token-level interactions. This is 100x slower but significantly more accurate. Practical because it only processes candidates, not the full corpus.
LLM-based reranking. Prompt an LLM to assess relevance of each candidate. More flexible (can handle complex relevance criteria specified in natural language) but slower and more expensive.
Cohere Rerank, FlashRank, bge-reranker are purpose-built cross-encoder rerankers that balance quality and latency.
Prompt Engineering for RAG
The generator prompt must effectively integrate retrieved context. Key patterns:
Stuffing. Concatenate all retrieved passages into the prompt:
Given the following context, answer the question.
Context:
{passage_1}
{passage_2}
...
{passage_k}
Question: {query}
Answer:Simple and effective. Limited by context window size.
Map-reduce. For large document sets: generate an answer from each passage independently (map), then synthesize across answers (reduce). Handles more documents than context window allows.
Citation. Instruct the model to cite which passage(s) support each claim. Enables verification and builds user trust. Requires careful prompt engineering to ensure citations are accurate, not hallucinated.
Advanced RAG Patterns
Query Transformation
The user’s raw query may not be optimal for retrieval. Transformations include:
- HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer, then use its embedding to retrieve real documents. The hypothetical answer is often closer in embedding space to relevant documents than the original question.
- Query decomposition: Break a complex question into sub-questions, retrieve for each, and synthesize.
- Step-back prompting: Abstract the query to a higher-level question (“What is the GDP of France?” → “What are France’s key economic indicators?”), retrieve broader context, then answer the specific question.
Self-RAG (Asai et al., 2023)
The model learns when to retrieve (not every query needs retrieval) and whether retrieved passages are relevant. Special tokens control the retrieval decision:
- Generate an initial response token
- If the model predicts [RETRIEVE], trigger retrieval
- Evaluate retrieved passages for relevance [RELEVANT/IRRELEVANT]
- Generate response conditioned on relevant passages
- Self-assess for factual grounding [SUPPORTED/NOT SUPPORTED]
Agentic RAG
Multi-step retrieval where the model iteratively queries, evaluates results, refines queries, and retrieves again. The model acts as an agent that plans its information-gathering strategy rather than performing a single retrieval step. This handles questions that require synthesizing information from multiple sources or that require follow-up queries based on initial findings.
Evaluation
RAG evaluation requires measuring both retrieval quality and generation quality.
Retrieval Metrics
| Metric | Formula | Measures |
|---|---|---|
| Recall@k | $\frac{ | \text{relevant} \cap \text{retrieved}_k |
| MRR | $\frac{1}{ | Q |
| NDCG@k | Graded relevance with position discount |
Generation Metrics
Faithfulness. Does the answer contain only information supported by the retrieved context? Measured by decomposing the answer into claims and verifying each against the passages. LLM-as-judge with structured prompting achieves reasonable agreement with human annotators.
Relevance. Does the answer address the question? Standard NLG metrics (ROUGE, BERTScore) provide weak signal; human evaluation or LLM-as-judge is more reliable.
RAGAS framework (Es et al., 2023) provides automated evaluation across faithfulness, answer relevance, and context relevance using LLM-based scoring.
Failure Modes
- Retrieval failure: Relevant documents not retrieved. Root causes: poor chunking, embedding model domain mismatch, query-document vocabulary gap.
- Context poisoning: Irrelevant retrieved passages cause the model to generate off-topic or incorrect answers. Mitigated by reranking and relevance filtering.
- Lost in the middle: Models attend more to the beginning and end of the context window, sometimes ignoring relevant information in the middle (Liu et al., 2023). Mitigated by placing the most relevant passages first.
- Over-retrieval: Too many passages dilute the signal and confuse the generator. Optimal depends on the task and model; typically 3—5 for focused questions, up to 20 for synthesis tasks.
Production Considerations
Indexing pipeline. Document ingestion → chunking → embedding → vector store. Must handle incremental updates (new documents, modified documents, deletions) without full reindexing.
Latency budget. Embedding the query: ~10ms. ANN search: ~5ms. Reranking passages: ~50—200ms. LLM generation: 500ms—5s. Total: under 1s for retrieval, dominated by generation time.
Cost. Embedding costs are one-time per document (indexing) plus per-query. Storage scales linearly with corpus size. At 1536 dimensions and float32, 1M documents require ~6GB for vectors alone.
Monitoring. Track retrieval recall (via human annotation on a sample), answer quality (user feedback, LLM-as-judge on a sample), and latency percentiles. Detect distribution shift by monitoring query embedding drift and retrieval score distributions.
Summary
| Component | Purpose | Key Decision |
|---|---|---|
| Embedding model | Encode semantics into vectors | Model choice, dimension, fine-tuning |
| Chunking | Segment documents for retrieval | Size, overlap, semantic vs fixed |
| Vector store | Index and search embeddings | ANN algorithm, metadata filtering |
| Hybrid search | Combine dense + sparse retrieval | Fusion method, mixing weight |
| Reranker | Rescore top-k candidates | Cross-encoder vs LLM-based |
| Generator prompt | Integrate context for answer | Stuffing, map-reduce, citation |
| Evaluation | Measure retrieval + generation quality | Faithfulness, relevance, RAGAS |
RAG transforms LLMs from closed-book to open-book systems. The retrieval pipeline determines what information the model sees; the generation pipeline determines how it uses that information. Both must work well for the system to produce accurate, grounded answers.