Rag
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Retrieval-Augmented Generation (RAG) is an AI architecture that combines the generative capabilities of large language models with a retrieval mechanism that fetches relevant information from an external knowledge base at inference time. Rather than relying solely on knowledge encoded in model weights (which can be outdated or hallucinated), RAG grounds LLM responses in retrieved documents, dramatically improving factual accuracy and enabling AI systems to reason over private, domain-specific, or frequently-updated information.
Remembering
- RAG — Retrieval-Augmented Generation; an architecture combining retrieval of relevant documents with generation by an LLM.
- Retriever — The component responsible for finding relevant documents from a knowledge base given a user query.
- Generator — The LLM that reads retrieved documents and the user query to produce a grounded response.
- Knowledge base — The document collection from which the retriever fetches context; can be a vector store, search index, or database.
- Embedding — A dense vector representation of text that captures semantic meaning, enabling similarity-based retrieval.
- Vector store — A database optimized for storing and searching high-dimensional embedding vectors (e.g., Pinecone, Weaviate, Chroma, pgvector).
- Semantic search — Finding documents based on meaning similarity rather than keyword matching, using embedding vectors.
- Chunk — A segment of a larger document created during preprocessing. Documents are split into chunks before embedding.
- Context window — The maximum amount of text an LLM can process at once; RAG must fit retrieved chunks within this limit.
- Grounding — Providing the LLM with factual context (retrieved documents) to base its generation on, reducing hallucination.
- Hallucination — LLM-generated content that is factually incorrect or unsupported by evidence.
- Reranker — A model that reorders retrieved documents by relevance after initial retrieval, improving the quality of context passed to the LLM.
- HyDE (Hypothetical Document Embeddings) — A technique where the LLM generates a hypothetical answer, which is then embedded and used as the retrieval query.
- Naive RAG — The basic retrieve-then-generate pipeline without optimizations.
- Advanced RAG — RAG with pre-retrieval (query transformation) and post-retrieval (reranking, filtering) enhancements.
Understanding
The fundamental problem RAG solves is the knowledge limitation of static LLMs. A model trained on data up to a cutoff date cannot know what happened after; a general model cannot know your company's internal documents; and any model may hallucinate on specific factual queries.
RAG works in three phases:
1. Indexing (offline): Documents are split into chunks, each chunk is converted to an embedding vector, and vectors are stored in a vector database. This is done once (or periodically as documents update).
2. Retrieval (at query time): The user's query is converted to an embedding. The vector store finds the k most semantically similar document chunks using approximate nearest neighbor (ANN) search.
3. Generation: The retrieved chunks are inserted into the LLM's prompt as context. The model reads both the context and the query to generate a grounded answer.
The intuition: instead of the LLM trying to recall facts from its training data (unreliable), it reads the relevant facts directly from a "cheat sheet" (the retrieved documents). The model's job becomes comprehension and synthesis, not memorization.
Why not just use a large context window? You could stuff thousands of documents into a 1M token context. But this is expensive, slow, and LLMs struggle with "lost in the middle" — they attend poorly to information in the middle of very long contexts. Selective retrieval of the 5–20 most relevant chunks is far more efficient and effective.
Applying
Basic RAG pipeline with LangChain and OpenAI:
<syntaxhighlight lang="python"> from langchain_community.document_loaders import PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_community.vectorstores import Chroma from langchain.chains import RetrievalQA
- === INDEXING (done once) ===
- 1. Load documents
loader = PyPDFLoader("company_handbook.pdf") documents = loader.load()
- 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters per chunk chunk_overlap=200 # Overlap to avoid cutting context
) chunks = splitter.split_documents(documents)
- 3. Embed and store in vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./db")
- === RETRIEVAL + GENERATION (at query time) ===
retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm, chain_type="stuff", # "stuff" = concatenate retrieved docs into prompt retriever=retriever, return_source_documents=True
)
result = qa_chain.invoke({"query": "What is the parental leave policy?"}) print(result["result"]) print([doc.metadata for doc in result["source_documents"]]) </syntaxhighlight>
- Chunking strategy guide
- Fixed-size chunking → Simple, fast. Risk: splits sentences mid-meaning.
- Recursive character splitting → Splits on paragraphs, then sentences, then words. Best general-purpose default.
- Semantic chunking → Splits based on embedding similarity between adjacent sentences. Produces semantically coherent chunks.
- Document structure splitting → Uses headers, sections, and metadata. Best for structured documents (PDFs, Markdown).
Analyzing
| Approach | Description | Best For |
|---|---|---|
| Naive RAG | Embed query → retrieve top-k → generate | Simple Q&A on clean documents |
| HyDE | Generate hypothetical answer → embed it → retrieve | Queries with technical jargon |
| Query expansion | Rewrite query multiple ways, retrieve for each, deduplicate | Ambiguous queries |
| Reranking | Use cross-encoder to reorder retrieved chunks | Precision-critical applications |
| Self-RAG | LLM decides whether to retrieve and critiques its own output | Multi-step reasoning |
| Corrective RAG | Evaluates retrieval quality; falls back to web search if low confidence | High-stakes factual queries |
Common failure modes:
- Retrieval failure — The relevant document wasn't retrieved (low recall). Causes: poor chunking, wrong embedding model, similarity metric mismatch.
- Context poisoning — Irrelevant chunks are included, confusing the LLM. Use reranking and confidence thresholds.
- Chunk boundary artifacts — Important context is split across chunks. Increase overlap or use semantic chunking.
- Prompt injection — Malicious content in retrieved documents instructs the LLM to behave differently. Sanitize inputs and use structured prompts.
- Stale index — The knowledge base isn't updated when documents change. Implement incremental indexing and deletion.
Evaluating
Expert RAG evaluation requires separate assessment of the retrieval and generation components:
Retrieval metrics
- Recall@k: Of the relevant chunks that exist, what fraction was retrieved in the top-k?
- MRR (Mean Reciprocal Rank): How highly is the first relevant chunk ranked?
- Precision@k: Of the k retrieved chunks, what fraction was actually relevant?
Generation metrics
- Faithfulness: Does the generated answer only assert things supported by the retrieved context? (Use an LLM as judge)
- Answer relevance: Does the response actually address the user's question?
- Context relevance: Are the retrieved chunks relevant to the query?
RAGAs framework: The open-source RAGAs library automates evaluation of faithfulness, answer relevance, and context precision/recall using LLM-as-judge.
Expert practitioners run end-to-end evals on a curated test set of query-answer pairs, tracking all metrics over time and alerting on regressions as the knowledge base evolves.
Creating
Designing a production RAG system:
1. Document pipeline architecture <syntaxhighlight lang="text"> Document Sources (PDFs, web, DB, APIs)
↓
[Ingestion service: parse, clean, deduplicate]
↓
[Chunking: semantic or recursive, 500-1000 tokens]
↓
[Embedding model: text-embedding-3-small or BGE-M3]
↓
[Vector store: Pinecone / Weaviate / pgvector]
↓
[Metadata index: BM25 for hybrid search] </syntaxhighlight>
2. Query pipeline architecture <syntaxhighlight lang="text"> User Query
↓
[Query rewriting / expansion]
↓
[Hybrid retrieval: semantic + BM25 keyword]
↓
[Reranking: cross-encoder or Cohere Rerank API]
↓
[Context assembly: top 5–10 chunks + metadata]
↓
[Prompt construction: system + context + query]
↓
[LLM generation with citation extraction]
↓
[Post-processing: source validation, confidence scoring]
↓
Response + Citations </syntaxhighlight>
3. Key design decisions
- Embedding model: balance quality vs. cost (BGE-M3 is strong open-source; text-embedding-3 for OpenAI ecosystem)
- Chunk size: 512–1024 tokens; larger for narrative text, smaller for dense technical docs
- Hybrid search (semantic + BM25) outperforms either alone on most benchmarks
- Always return source citations to build user trust and enable verification