Editing
Long Context and Memory in LLMs
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> {{BloomIntro}} Long context and memory in large language models addresses how AI systems can process, retain, and effectively use information across very long documents, conversations, and tasks. While early LLMs handled only 512β2048 tokens, modern models support 128kβ1M+ token contexts. Yet simply having a large context window doesn't mean the model effectively uses all the information in it. Memory in AI encompasses short-term context window memory, retrieval-augmented memory via vector databases, and the emerging frontier of persistent agent memory across sessions. </div> __TOC__ <div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Remembering</span> == * '''Context window''' β The maximum number of tokens an LLM can process in a single forward pass. * '''Long context''' β Context windows exceeding 32k tokens, enabling processing of long documents, books, and extended conversations. * '''KV cache''' β Key-Value cache storing attention keys and values for all processed tokens; grows linearly with context length. * '''Lost in the Middle''' β An empirical finding that LLMs perform worse at retrieving information from the middle of long contexts vs. the beginning/end. * '''Needle in a Haystack (NIAH)''' β A benchmark hiding a specific fact in a long document and asking the model to retrieve it; tests effective context utilization. * '''RULER''' β A more comprehensive long-context benchmark covering multi-hop retrieval, aggregation, and ordering. * '''RoPE (Rotary Position Embedding)''' β A position encoding method that generalizes to longer sequences than training length via "context extension." * '''YaRN''' β A technique for extending RoPE-based models to longer contexts without full retraining. * '''Ring Attention''' β A distributed attention mechanism enabling near-infinite context by distributing KV cache across devices. * '''Sliding window attention''' β Restricts attention to a local window; efficient but loses long-range information. * '''Retrieval-augmented memory''' β Augmenting model context with retrieved relevant chunks from external memory stores. * '''Episodic memory''' β Storing and retrieving specific past events or conversations, enabling persistent agent memory. * '''Working memory''' β The information currently held in the context window; limited by context length. * '''Compressive memory''' β Summarizing and compressing older context to extend effective memory beyond the raw context window. </div> <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Understanding</span> == As LLMs scale to longer context windows, two challenges emerge: **efficiency** and **effective utilization**. **The quadratic problem (efficiency)**: Standard self-attention is O(nΒ²) in memory and compute for sequence length n. At 1M tokens, this would require ~1 trillion attention computations per layer β impossible naively. Solutions: Flash Attention (IO-efficient exact attention), sliding window attention (local context), Ring Attention (distributes across devices), and sparse attention patterns. **"Lost in the Middle" (effective utilization)**: Even when models can technically process long contexts, they fail to equally utilize all parts. Information at the beginning and end of the context is reliably recalled; information in the middle is often lost. This means a 128k context window doesn't deliver 128k worth of reliable memory β a critical limitation for applications relying on long-document comprehension. **Memory taxonomy for AI agents**: - **In-context (working memory)**: Everything in the current context window. Fast, perfect recall, limited by context length and cost. - **Retrieval memory (episodic)**: External vector database storing embeddings of past interactions; retrieved by semantic search when needed. - **Parametric memory**: Knowledge encoded in model weights during pre-training. Fixed, can't be updated without retraining. - **Cache memory**: Prefix KV cache reuse for repeated context (e.g., system prompt); reduces latency/cost. **The solution stack**: For most applications, the best "memory" architecture is RAG over a well-organized vector database, supplemented by a conversation summary in the context. True long-context models (Gemini 1.5 Pro 1M, Claude 3.5 Sonnet 200k) are valuable for tasks requiring holistic document understanding, not just retrieval. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Applying</span> == '''Long-context document QA with sliding summary memory:''' <syntaxhighlight lang="python"> from openai import OpenAI from anthropic import Anthropic import tiktoken client = OpenAI() enc = tiktoken.encoding_for_model("gpt-4o") def chunk_document(text: str, chunk_size: int = 8000, overlap: int = 500) -> list[str]: """Split document into overlapping chunks.""" tokens = enc.encode(text) chunks = [] for i in range(0, len(tokens), chunk_size - overlap): chunk_tokens = tokens[i:i + chunk_size] chunks.append(enc.decode(chunk_tokens)) return chunks def answer_with_long_context(document: str, question: str) -> str: chunks = chunk_document(document) # Map phase: extract relevant info from each chunk relevant_excerpts = [] for chunk in chunks: response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role":"user", "content": f"From the following text, extract any information relevant to: '{question}'\nText: {chunk}\nIf nothing relevant, respond 'NONE'."}] ) excerpt = response.choices[0].message.content if excerpt != "NONE": relevant_excerpts.append(excerpt) # Reduce phase: synthesize extracted excerpts into final answer combined = "\n\n".join(relevant_excerpts) final = client.chat.completions.create( model="gpt-4o", messages=[{"role":"user", "content": f"Based on these excerpts, answer: {question}\n\nExcerpts:\n{combined}"}] ) return final.choices[0].message.content </syntaxhighlight> ; Long context strategy by use case : '''Full book comprehension''' β Gemini 1.5 Pro (1M tokens), Claude 3.5 Sonnet (200k) : '''Long-document QA''' β RAG with chunking + cross-encoder reranker : '''Multi-session agent memory''' β Conversation summary + vector DB (episodic memory) : '''Code repository understanding''' β Tree-sitter parsing + selective context, CodeGraph : '''Long conversations''' β Progressive summarization of older turns into rolling summary </div> <div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Analyzing</span> == {| class="wikitable" |+ Memory Strategy Comparison ! Strategy !! Max Coverage !! Recall Quality !! Cost !! Latency |- | Full context (128k) || 128k tokens || Good but "lost in middle" || High || High |- | RAG (retrieval) || Unlimited || Good for specific facts || Low || Medium |- | Rolling summary || Unlimited (lossy) || Good for continuity || Low || Low |- | Map-reduce over chunks || Unlimited || Good for aggregation || Medium || High |- | KV cache prefix reuse || Fixed (system prompt) || Perfect || Low || Very low |} '''Failure modes''': Lost-in-the-middle information loss for crucial facts buried in long documents. Context window overflow causing silent truncation (older messages silently dropped). KV cache memory exhaustion with very long contexts (hundreds of GB at 1M tokens). Retrieval misses in RAG when the query doesn't semantically match the relevant chunks. </div> <div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Evaluating</span> == Long-context evaluation: (1) **NIAH (Needle in a Haystack)**: place a specific fact at various positions (0%, 25%, 50%, 75%, 100%) in a long document; measure retrieval accuracy per position. Reveals "lost in middle" failure modes. (2) **RULER**: multi-hop retrieval, aggregation tasks. (3) **LongBench**: diverse long-context tasks (QA, summarization, code). (4) **Practical evaluation**: measure end-task performance on actual long documents from the target domain. </div> <div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Creating</span> == Designing an agent memory system: (1) **Short-term**: keep last N=20 turns in context. (2) **Episodic**: embed and store every conversation turn in a vector DB (ChromaDB/Weaviate); retrieve top-5 relevant past turns per new query. (3) **Summary**: after every 20 turns, generate a running summary (user preferences, ongoing tasks, established facts); include in every context. (4) **Long-document**: for per-session documents (contracts, manuals), use semantic chunking + RAG rather than stuffing the full document into context. (5) **Freshness**: tag memories with timestamps; decay relevance of old memories in retrieval scoring. [[Category:Artificial Intelligence]] [[Category:Large Language Models]] [[Category:Memory]] </div>
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Template used on this page:
Template:BloomIntro
(
edit
)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information