Rag - Revision history

Wordpad: BloomWiki: Rag

2026-04-25T01:56:48Z

BloomWiki: Rag

← Older revision		Revision as of 01:56, 25 April 2026
Line 1:		Line 1:
			<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
	{{BloomIntro}}		{{BloomIntro}}
	Retrieval-Augmented Generation (RAG) is an AI architecture that combines the generative capabilities of large language models with a retrieval mechanism that fetches relevant information from an external knowledge base at inference time. Rather than relying solely on knowledge encoded in model weights (which can be outdated or hallucinated), RAG grounds LLM responses in retrieved documents, dramatically improving factual accuracy and enabling AI systems to reason over private, domain-specific, or frequently-updated information.		Retrieval-Augmented Generation (RAG) is an AI architecture that combines the generative capabilities of large language models with a retrieval mechanism that fetches relevant information from an external knowledge base at inference time. Rather than relying solely on knowledge encoded in model weights (which can be outdated or hallucinated), RAG grounds LLM responses in retrieved documents, dramatically improving factual accuracy and enabling AI systems to reason over private, domain-specific, or frequently-updated information.
			</div>

	== Remembering ==		__TOC__

			<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Remembering</span> ==
	* '''RAG''' — Retrieval-Augmented Generation; an architecture combining retrieval of relevant documents with generation by an LLM.		* '''RAG''' — Retrieval-Augmented Generation; an architecture combining retrieval of relevant documents with generation by an LLM.
	* '''Retriever''' — The component responsible for finding relevant documents from a knowledge base given a user query.		* '''Retriever''' — The component responsible for finding relevant documents from a knowledge base given a user query.
Line 18:		Line 23:
	* '''Naive RAG''' — The basic retrieve-then-generate pipeline without optimizations.		* '''Naive RAG''' — The basic retrieve-then-generate pipeline without optimizations.
	* '''Advanced RAG''' — RAG with pre-retrieval (query transformation) and post-retrieval (reranking, filtering) enhancements.		* '''Advanced RAG''' — RAG with pre-retrieval (query transformation) and post-retrieval (reranking, filtering) enhancements.
			</div>

	== Understanding ==		<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Understanding</span> ==
	The fundamental problem RAG solves is the '''knowledge limitation of static LLMs'''. A model trained on data up to a cutoff date cannot know what happened after; a general model cannot know your company's internal documents; and any model may hallucinate on specific factual queries.		The fundamental problem RAG solves is the '''knowledge limitation of static LLMs'''. A model trained on data up to a cutoff date cannot know what happened after; a general model cannot know your company's internal documents; and any model may hallucinate on specific factual queries.

Line 33:		Line 40:

	'''Why not just use a large context window?''' You could stuff thousands of documents into a 1M token context. But this is expensive, slow, and LLMs struggle with "lost in the middle" — they attend poorly to information in the middle of very long contexts. Selective retrieval of the 5–20 most relevant chunks is far more efficient and effective.		'''Why not just use a large context window?''' You could stuff thousands of documents into a 1M token context. But this is expensive, slow, and LLMs struggle with "lost in the middle" — they attend poorly to information in the middle of very long contexts. Selective retrieval of the 5–20 most relevant chunks is far more efficient and effective.
			</div>

	== Applying ==		<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Applying</span> ==
	'''Basic RAG pipeline with LangChain and OpenAI:'''		'''Basic RAG pipeline with LangChain and OpenAI:'''

Line 83:		Line 92:
	: '''Semantic chunking''' → Splits based on embedding similarity between adjacent sentences. Produces semantically coherent chunks.		: '''Semantic chunking''' → Splits based on embedding similarity between adjacent sentences. Produces semantically coherent chunks.
	: '''Document structure splitting''' → Uses headers, sections, and metadata. Best for structured documents (PDFs, Markdown).		: '''Document structure splitting''' → Uses headers, sections, and metadata. Best for structured documents (PDFs, Markdown).
			</div>

	== Analyzing ==		<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Analyzing</span> ==
	{\| class="wikitable"		{\| class="wikitable"
	\|+ RAG Architecture Variants		\|+ RAG Architecture Variants
Line 108:		Line 119:
	* '''Prompt injection''' — Malicious content in retrieved documents instructs the LLM to behave differently. Sanitize inputs and use structured prompts.		* '''Prompt injection''' — Malicious content in retrieved documents instructs the LLM to behave differently. Sanitize inputs and use structured prompts.
	* '''Stale index''' — The knowledge base isn't updated when documents change. Implement incremental indexing and deletion.		* '''Stale index''' — The knowledge base isn't updated when documents change. Implement incremental indexing and deletion.
			</div>

	== Evaluating ==		<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Evaluating</span> ==
	Expert RAG evaluation requires separate assessment of the retrieval and generation components:		Expert RAG evaluation requires separate assessment of the retrieval and generation components:

Line 125:		Line 138:

	Expert practitioners run '''end-to-end evals''' on a curated test set of query-answer pairs, tracking all metrics over time and alerting on regressions as the knowledge base evolves.		Expert practitioners run '''end-to-end evals''' on a curated test set of query-answer pairs, tracking all metrics over time and alerting on regressions as the knowledge base evolves.
			</div>

	== Creating ==		<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
			== <span style="color: #FFFFFF;">Creating</span> ==
	Designing a production RAG system:		Designing a production RAG system:

Line 174:		Line 189:
	[[Category:Large Language Models]]		[[Category:Large Language Models]]
	[[Category:Natural Language Processing]]		[[Category:Natural Language Processing]]
			</div>

Wordpad: BloomWiki: Rag

2026-04-23T14:19:30Z

BloomWiki: Rag

New page

{{BloomIntro}}
Retrieval-Augmented Generation (RAG) is an AI architecture that combines the generative capabilities of large language models with a retrieval mechanism that fetches relevant information from an external knowledge base at inference time. Rather than relying solely on knowledge encoded in model weights (which can be outdated or hallucinated), RAG grounds LLM responses in retrieved documents, dramatically improving factual accuracy and enabling AI systems to reason over private, domain-specific, or frequently-updated information.

== Remembering ==
* '''RAG''' — Retrieval-Augmented Generation; an architecture combining retrieval of relevant documents with generation by an LLM.
* '''Retriever''' — The component responsible for finding relevant documents from a knowledge base given a user query.
* '''Generator''' — The LLM that reads retrieved documents and the user query to produce a grounded response.
* '''Knowledge base''' — The document collection from which the retriever fetches context; can be a vector store, search index, or database.
* '''Embedding''' — A dense vector representation of text that captures semantic meaning, enabling similarity-based retrieval.
* '''Vector store''' — A database optimized for storing and searching high-dimensional embedding vectors (e.g., Pinecone, Weaviate, Chroma, pgvector).
* '''Semantic search''' — Finding documents based on meaning similarity rather than keyword matching, using embedding vectors.
* '''Chunk''' — A segment of a larger document created during preprocessing. Documents are split into chunks before embedding.
* '''Context window''' — The maximum amount of text an LLM can process at once; RAG must fit retrieved chunks within this limit.
* '''Grounding''' — Providing the LLM with factual context (retrieved documents) to base its generation on, reducing hallucination.
* '''Hallucination''' — LLM-generated content that is factually incorrect or unsupported by evidence.
* '''Reranker''' — A model that reorders retrieved documents by relevance after initial retrieval, improving the quality of context passed to the LLM.
* '''HyDE (Hypothetical Document Embeddings)''' — A technique where the LLM generates a hypothetical answer, which is then embedded and used as the retrieval query.
* '''Naive RAG''' — The basic retrieve-then-generate pipeline without optimizations.
* '''Advanced RAG''' — RAG with pre-retrieval (query transformation) and post-retrieval (reranking, filtering) enhancements.

== Understanding ==
The fundamental problem RAG solves is the '''knowledge limitation of static LLMs'''. A model trained on data up to a cutoff date cannot know what happened after; a general model cannot know your company's internal documents; and any model may hallucinate on specific factual queries.

RAG works in three phases:

'''1. Indexing (offline)''': Documents are split into chunks, each chunk is converted to an embedding vector, and vectors are stored in a vector database. This is done once (or periodically as documents update).

'''2. Retrieval (at query time)''': The user's query is converted to an embedding. The vector store finds the k most semantically similar document chunks using approximate nearest neighbor (ANN) search.

'''3. Generation''': The retrieved chunks are inserted into the LLM's prompt as context. The model reads both the context and the query to generate a grounded answer.

The intuition: instead of the LLM trying to recall facts from its training data (unreliable), it reads the relevant facts directly from a "cheat sheet" (the retrieved documents). The model's job becomes comprehension and synthesis, not memorization.

'''Why not just use a large context window?''' You could stuff thousands of documents into a 1M token context. But this is expensive, slow, and LLMs struggle with "lost in the middle" — they attend poorly to information in the middle of very long contexts. Selective retrieval of the 5–20 most relevant chunks is far more efficient and effective.

== Applying ==
'''Basic RAG pipeline with LangChain and OpenAI:'''

<syntaxhighlight lang="python">
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

# === INDEXING (done once) ===

# 1. Load documents
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters per chunk
chunk_overlap=200 # Overlap to avoid cutting context
)
chunks = splitter.split_documents(documents)

# 3. Embed and store in vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./db")

# === RETRIEVAL + GENERATION (at query time) ===

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
llm = ChatOpenAI(model="gpt-4o", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # "stuff" = concatenate retrieved docs into prompt
retriever=retriever,
return_source_documents=True
)

result = qa_chain.invoke({"query": "What is the parental leave policy?"})
print(result["result"])
print([doc.metadata for doc in result["source_documents"]])
</syntaxhighlight>

; Chunking strategy guide
: '''Fixed-size chunking''' → Simple, fast. Risk: splits sentences mid-meaning.
: '''Recursive character splitting''' → Splits on paragraphs, then sentences, then words. Best general-purpose default.
: '''Semantic chunking''' → Splits based on embedding similarity between adjacent sentences. Produces semantically coherent chunks.
: '''Document structure splitting''' → Uses headers, sections, and metadata. Best for structured documents (PDFs, Markdown).

== Analyzing ==
{| class="wikitable"
|+ RAG Architecture Variants
! Approach !! Description !! Best For
|-
| Naive RAG || Embed query → retrieve top-k → generate || Simple Q&A on clean documents
|-
| HyDE || Generate hypothetical answer → embed it → retrieve || Queries with technical jargon
|-
| Query expansion || Rewrite query multiple ways, retrieve for each, deduplicate || Ambiguous queries
|-
| Reranking || Use cross-encoder to reorder retrieved chunks || Precision-critical applications
|-
| Self-RAG || LLM decides whether to retrieve and critiques its own output || Multi-step reasoning
|-
| Corrective RAG || Evaluates retrieval quality; falls back to web search if low confidence || High-stakes factual queries
|}

'''Common failure modes:'''
* '''Retrieval failure''' — The relevant document wasn't retrieved (low recall). Causes: poor chunking, wrong embedding model, similarity metric mismatch.
* '''Context poisoning''' — Irrelevant chunks are included, confusing the LLM. Use reranking and confidence thresholds.
* '''Chunk boundary artifacts''' — Important context is split across chunks. Increase overlap or use semantic chunking.
* '''Prompt injection''' — Malicious content in retrieved documents instructs the LLM to behave differently. Sanitize inputs and use structured prompts.
* '''Stale index''' — The knowledge base isn't updated when documents change. Implement incremental indexing and deletion.

== Evaluating ==
Expert RAG evaluation requires separate assessment of the retrieval and generation components:

'''Retrieval metrics'''
* '''Recall@k''': Of the relevant chunks that exist, what fraction was retrieved in the top-k?
* '''MRR (Mean Reciprocal Rank)''': How highly is the first relevant chunk ranked?
* '''Precision@k''': Of the k retrieved chunks, what fraction was actually relevant?

'''Generation metrics'''
* '''Faithfulness''': Does the generated answer only assert things supported by the retrieved context? (Use an LLM as judge)
* '''Answer relevance''': Does the response actually address the user's question?
* '''Context relevance''': Are the retrieved chunks relevant to the query?

'''RAGAs framework''': The open-source RAGAs library automates evaluation of faithfulness, answer relevance, and context precision/recall using LLM-as-judge.

Expert practitioners run '''end-to-end evals''' on a curated test set of query-answer pairs, tracking all metrics over time and alerting on regressions as the knowledge base evolves.

== Creating ==
Designing a production RAG system:

'''1. Document pipeline architecture'''
<syntaxhighlight lang="text">
Document Sources (PDFs, web, DB, APIs)
↓
[Ingestion service: parse, clean, deduplicate]
↓
[Chunking: semantic or recursive, 500-1000 tokens]
↓
[Embedding model: text-embedding-3-small or BGE-M3]
↓
[Vector store: Pinecone / Weaviate / pgvector]
↓
[Metadata index: BM25 for hybrid search]
</syntaxhighlight>

'''2. Query pipeline architecture'''
<syntaxhighlight lang="text">
User Query
↓
[Query rewriting / expansion]
↓
[Hybrid retrieval: semantic + BM25 keyword]
↓
[Reranking: cross-encoder or Cohere Rerank API]
↓
[Context assembly: top 5–10 chunks + metadata]
↓
[Prompt construction: system + context + query]
↓
[LLM generation with citation extraction]
↓
[Post-processing: source validation, confidence scoring]
↓
Response + Citations
</syntaxhighlight>

'''3. Key design decisions'''
* Embedding model: balance quality vs. cost (BGE-M3 is strong open-source; text-embedding-3 for OpenAI ecosystem)
* Chunk size: 512–1024 tokens; larger for narrative text, smaller for dense technical docs
* Hybrid search (semantic + BM25) outperforms either alone on most benchmarks
* Always return source citations to build user trust and enable verification

[[Category:Artificial Intelligence]]
[[Category:Large Language Models]]
[[Category:Natural Language Processing]]