Editing Rag (section)

== <span style="color: #FFFFFF;">Understanding</span> ==
The fundamental problem RAG solves is the '''knowledge limitation of static LLMs'''. A model trained on data up to a cutoff date cannot know what happened after; a general model cannot know your company's internal documents; and any model may hallucinate on specific factual queries.

RAG works in three phases:

'''1. Indexing (offline)''': Documents are split into chunks, each chunk is converted to an embedding vector, and vectors are stored in a vector database. This is done once (or periodically as documents update).

'''2. Retrieval (at query time)''': The user's query is converted to an embedding. The vector store finds the k most semantically similar document chunks using approximate nearest neighbor (ANN) search.

'''3. Generation''': The retrieved chunks are inserted into the LLM's prompt as context. The model reads both the context and the query to generate a grounded answer.

The intuition: instead of the LLM trying to recall facts from its training data (unreliable), it reads the relevant facts directly from a "cheat sheet" (the retrieved documents). The model's job becomes comprehension and synthesis, not memorization.

'''Why not just use a large context window?''' You could stuff thousands of documents into a 1M token context. But this is expensive, slow, and LLMs struggle with "lost in the middle" — they attend poorly to information in the middle of very long contexts. Selective retrieval of the 5–20 most relevant chunks is far more efficient and effective.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">