Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) grounds language model outputs in retrieved evidence, reducing hallucination and enabling knowledge updates without retraining. This article covers the retrieval pipeline, embedding models, chunking strategies, and evaluation methodology.

The Problem RAG Solves

LLMs have two fundamental limitations that RAG addresses:

Knowledge cutoff. Parametric knowledge is frozen at pretraining time. The model cannot answer questions about events, documents, or data that postdate its training.
Hallucination. When the model lacks knowledge, it generates plausible-sounding but fabricated content. There is no mechanism to distinguish “I know this” from “I’m generating something that fits the pattern.”

RAG addresses both by retrieving relevant documents at inference time and conditioning the generation on retrieved context. The model generates from evidence rather than from memory.

Architecture Overview

A RAG system has two components:

Retriever. Maps a query $q$ to a set of relevant passages $\{d_1, \ldots, d_k\}$ from a corpus $\mathcal{C}$ .

Generator. An LLM that produces an answer conditioned on both the query and retrieved passages:

P(a \mid q) = P(a \mid q, d_1, \ldots, d_k; \theta)

The retriever is typically a dense embedding model, and the generator is any instruction-tuned LLM. The two components can be trained jointly or independently; in practice, independent training dominates because it allows swapping either component.

Dense Retrieval

Embedding Models

Dense retrieval encodes queries and documents into a shared vector space where semantic similarity corresponds to vector proximity.

Bi-encoder architecture. Separate encoders for queries and documents (though often the same model):

\mathbf{q} = E_q(\text{query}), \quad \mathbf{d} = E_d(\text{document})

\text{score}(q, d) = \cos(\mathbf{q}, \mathbf{d}) = \frac{\mathbf{q} \cdot \mathbf{d}}{\|\mathbf{q}\| \|\mathbf{d}\|}

Documents are encoded offline and indexed; at query time, only the query needs encoding. This decoupling enables sub-second retrieval over millions of documents.

Training objectives. Contrastive learning with in-batch negatives:

\mathcal{L} = -\log \frac{\exp(\text{sim}(q, d^+) / \tau)}{\exp(\text{sim}(q, d^+) / \tau) + \sum_{d^- \in \mathcal{N}} \exp(\text{sim}(q, d^-) / \tau)}

where $d^+$ is the relevant passage, $\mathcal{N}$ is the set of negatives (other passages in the batch), and $\tau$ is a temperature parameter. Hard negatives (passages that are topically related but not relevant) improve training substantially over random negatives.

Key models. OpenAI text-embedding-3, Cohere embed-v3, BGE, E5, GTE. Embedding dimensions typically 768—1536. Modern embedding models are trained on diverse retrieval tasks (question-answer pairs, paraphrase detection, semantic similarity) to produce general-purpose representations.

Vector Databases

Dense vectors require specialized storage and search infrastructure:

System	Index Type	Key Feature
FAISS (Meta)	IVF, HNSW, PQ	In-memory, GPU-accelerated, industry standard
Pinecone	Managed HNSW	Serverless, metadata filtering
Weaviate	HNSW	Hybrid search (dense + sparse), GraphQL API
Qdrant	HNSW	Filtering, payload indexing
pgvector	IVFFlat, HNSW	PostgreSQL extension, familiar ops tooling

Approximate Nearest Neighbor (ANN) algorithms:

IVF (Inverted File Index): Cluster vectors into Voronoi cells. At query time, search only the $n_{\text{probe}}$ nearest cells. Trades recall for speed.
HNSW (Hierarchical Navigable Small World): Build a multi-layer graph of proximity. Traverse from coarse to fine layers. State-of-the-art recall-speed tradeoff for most use cases.
Product Quantization (PQ): Compress vectors by splitting into subvectors and quantizing each independently. Reduces memory by 8—32x with moderate recall loss.

Chunking Strategies

Documents must be split into passages before embedding. Chunking strategy significantly impacts retrieval quality.

Fixed-Size Chunking

Split documents into chunks of $n$ tokens with $m$ tokens of overlap:

Chunk size: 256—512 tokens typical. Larger chunks provide more context per retrieval but dilute relevance signal. Smaller chunks are more precisely retrievable but may lack sufficient context.
Overlap: 10—20% prevents information loss at chunk boundaries.

Semantic Chunking

Split at natural boundaries (paragraphs, sections, sentences) rather than fixed token counts. Methods:

Recursive character splitting: Try paragraph breaks first, then sentences, then fixed-size as fallback. LangChain’s RecursiveCharacterTextSplitter implements this hierarchy.
Embedding-based splitting: Compute sentence embeddings and split where cosine similarity between adjacent sentences drops below a threshold, detecting topic shifts.

Hierarchical Chunking

Maintain a hierarchy: document summaries at the top level, section summaries at the middle level, raw paragraphs at the bottom. Route queries to the appropriate level based on specificity. Summary-level retrieval catches broad questions; paragraph-level catches specific factual queries.

Parent-Child Chunking

Index small chunks (sentences or short paragraphs) for precise retrieval, but return the parent chunk (the full section or surrounding paragraphs) to the generator. This gives the retriever precision and the generator context.

Hybrid Search

Dense retrieval struggles with exact-match queries (specific names, codes, identifiers) where lexical overlap is the primary signal. Sparse retrieval (BM25) handles these well but misses semantic matches.

BM25 scores documents by term frequency with saturation and document length normalization:

\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot (1 - b + b \cdot |d|/\text{avgdl})}

where $f(t, d)$ is term frequency, $k_1 \approx 1.2$ controls saturation, and $b \approx 0.75$ controls length normalization.

Hybrid search combines dense and sparse scores:

\text{score}_{\text{hybrid}}(q, d) = \alpha \cdot \text{score}_{\text{dense}}(q, d) + (1 - \alpha) \cdot \text{score}_{\text{sparse}}(q, d)

The mixing weight $\alpha$ is tuned per domain. Typical values: $\alpha \in [0.5, 0.7]$ , slightly favoring dense retrieval. Reciprocal Rank Fusion (RRF) is an alternative that combines ranked lists without score normalization:

\text{RRF}(d) = \sum_{r \in \text{rankers}} \frac{1}{k + \text{rank}_r(d)}

with $k = 60$ typical. RRF is robust to score scale differences between retrievers.

Reranking

Initial retrieval returns $k$ candidates (typically 20—100). A reranker rescores these with a more expensive model.

Cross-encoder reranking. A cross-encoder processes the (query, document) pair jointly:

\text{score}(q, d) = \text{MLP}(\text{BERT}([q; \text{SEP}; d]))

Unlike bi-encoders (which encode $q$ and $d$ independently), cross-encoders model fine-grained token-level interactions. This is 100x slower but significantly more accurate. Practical because it only processes $k$ candidates, not the full corpus.

LLM-based reranking. Prompt an LLM to assess relevance of each candidate. More flexible (can handle complex relevance criteria specified in natural language) but slower and more expensive.

Cohere Rerank, FlashRank, bge-reranker are purpose-built cross-encoder rerankers that balance quality and latency.

Prompt Engineering for RAG

The generator prompt must effectively integrate retrieved context. Key patterns:

Stuffing. Concatenate all retrieved passages into the prompt:

Given the following context, answer the question.

Context:
{passage_1}
{passage_2}
...
{passage_k}

Question: {query}
Answer:

Simple and effective. Limited by context window size.

Map-reduce. For large document sets: generate an answer from each passage independently (map), then synthesize across answers (reduce). Handles more documents than context window allows.

Citation. Instruct the model to cite which passage(s) support each claim. Enables verification and builds user trust. Requires careful prompt engineering to ensure citations are accurate, not hallucinated.

Advanced RAG Patterns

Query Transformation

The user’s raw query may not be optimal for retrieval. Transformations include:

HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer, then use its embedding to retrieve real documents. The hypothetical answer is often closer in embedding space to relevant documents than the original question.
Query decomposition: Break a complex question into sub-questions, retrieve for each, and synthesize.
Step-back prompting: Abstract the query to a higher-level question (“What is the GDP of France?” → “What are France’s key economic indicators?”), retrieve broader context, then answer the specific question.

Self-RAG (Asai et al., 2023)

The model learns when to retrieve (not every query needs retrieval) and whether retrieved passages are relevant. Special tokens control the retrieval decision:

Generate an initial response token
If the model predicts [RETRIEVE], trigger retrieval
Evaluate retrieved passages for relevance [RELEVANT/IRRELEVANT]
Generate response conditioned on relevant passages
Self-assess for factual grounding [SUPPORTED/NOT SUPPORTED]

Agentic RAG

Multi-step retrieval where the model iteratively queries, evaluates results, refines queries, and retrieves again. The model acts as an agent that plans its information-gathering strategy rather than performing a single retrieval step. This handles questions that require synthesizing information from multiple sources or that require follow-up queries based on initial findings.

Evaluation

RAG evaluation requires measuring both retrieval quality and generation quality.

Retrieval Metrics

Metric	Formula	Measures
Recall@k	$\frac{	\text{relevant} \cap \text{retrieved}_k
MRR	$\frac{1}{	Q
NDCG@k	$\frac{\text{DCG}_k}{\text{IDCG}_k}$	Graded relevance with position discount

Generation Metrics

Faithfulness. Does the answer contain only information supported by the retrieved context? Measured by decomposing the answer into claims and verifying each against the passages. LLM-as-judge with structured prompting achieves reasonable agreement with human annotators.

Relevance. Does the answer address the question? Standard NLG metrics (ROUGE, BERTScore) provide weak signal; human evaluation or LLM-as-judge is more reliable.

RAGAS framework (Es et al., 2023) provides automated evaluation across faithfulness, answer relevance, and context relevance using LLM-based scoring.

Failure Modes

Retrieval failure: Relevant documents not retrieved. Root causes: poor chunking, embedding model domain mismatch, query-document vocabulary gap.
Context poisoning: Irrelevant retrieved passages cause the model to generate off-topic or incorrect answers. Mitigated by reranking and relevance filtering.
Lost in the middle: Models attend more to the beginning and end of the context window, sometimes ignoring relevant information in the middle (Liu et al., 2023). Mitigated by placing the most relevant passages first.
Over-retrieval: Too many passages dilute the signal and confuse the generator. Optimal $k$ depends on the task and model; typically 3—5 for focused questions, up to 20 for synthesis tasks.

Production Considerations

Indexing pipeline. Document ingestion → chunking → embedding → vector store. Must handle incremental updates (new documents, modified documents, deletions) without full reindexing.

Latency budget. Embedding the query: ~10ms. ANN search: ~5ms. Reranking $k$ passages: ~50—200ms. LLM generation: 500ms—5s. Total: under 1s for retrieval, dominated by generation time.

Cost. Embedding costs are one-time per document (indexing) plus per-query. Storage scales linearly with corpus size. At 1536 dimensions and float32, 1M documents require ~6GB for vectors alone.

Monitoring. Track retrieval recall (via human annotation on a sample), answer quality (user feedback, LLM-as-judge on a sample), and latency percentiles. Detect distribution shift by monitoring query embedding drift and retrieval score distributions.

Summary

Component	Purpose	Key Decision
Embedding model	Encode semantics into vectors	Model choice, dimension, fine-tuning
Chunking	Segment documents for retrieval	Size, overlap, semantic vs fixed
Vector store	Index and search embeddings	ANN algorithm, metadata filtering
Hybrid search	Combine dense + sparse retrieval	Fusion method, mixing weight
Reranker	Rescore top-k candidates	Cross-encoder vs LLM-based
Generator prompt	Integrate context for answer	Stuffing, map-reduce, citation
Evaluation	Measure retrieval + generation quality	Faithfulness, relevance, RAGAS

RAG transforms LLMs from closed-book to open-book systems. The retrieval pipeline determines what information the model sees; the generation pipeline determines how it uses that information. Both must work well for the system to produce accurate, grounded answers.