Editing Long Context and Memory in LLMs

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
Long context and memory in large language models addresses how AI systems can process, retain, and effectively use information across very long documents, conversations, and tasks. While early LLMs handled only 512–2048 tokens, modern models support 128k–1M+ token contexts. Yet simply having a large context window doesn't mean the model effectively uses all the information in it. Memory in AI encompasses short-term context window memory, retrieval-augmented memory via vector databases, and the emerging frontier of persistent agent memory across sessions.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Context window''' — The maximum number of tokens an LLM can process in a single forward pass.
* '''Long context''' — Context windows exceeding 32k tokens, enabling processing of long documents, books, and extended conversations.
* '''KV cache''' — Key-Value cache storing attention keys and values for all processed tokens; grows linearly with context length.
* '''Lost in the Middle''' — An empirical finding that LLMs perform worse at retrieving information from the middle of long contexts vs. the beginning/end.
* '''Needle in a Haystack (NIAH)''' — A benchmark hiding a specific fact in a long document and asking the model to retrieve it; tests effective context utilization.
* '''RULER''' — A more comprehensive long-context benchmark covering multi-hop retrieval, aggregation, and ordering.
* '''RoPE (Rotary Position Embedding)''' — A position encoding method that generalizes to longer sequences than training length via "context extension."
* '''YaRN''' — A technique for extending RoPE-based models to longer contexts without full retraining.
* '''Ring Attention''' — A distributed attention mechanism enabling near-infinite context by distributing KV cache across devices.
* '''Sliding window attention''' — Restricts attention to a local window; efficient but loses long-range information.
* '''Retrieval-augmented memory''' — Augmenting model context with retrieved relevant chunks from external memory stores.
* '''Episodic memory''' — Storing and retrieving specific past events or conversations, enabling persistent agent memory.
* '''Working memory''' — The information currently held in the context window; limited by context length.
* '''Compressive memory''' — Summarizing and compressing older context to extend effective memory beyond the raw context window.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
As LLMs scale to longer context windows, two challenges emerge: **efficiency** and **effective utilization**.

**The quadratic problem (efficiency)**: Standard self-attention is O(n²) in memory and compute for sequence length n. At 1M tokens, this would require ~1 trillion attention computations per layer — impossible naively. Solutions: Flash Attention (IO-efficient exact attention), sliding window attention (local context), Ring Attention (distributes across devices), and sparse attention patterns.

**"Lost in the Middle" (effective utilization)**: Even when models can technically process long contexts, they fail to equally utilize all parts. Information at the beginning and end of the context is reliably recalled; information in the middle is often lost. This means a 128k context window doesn't deliver 128k worth of reliable memory — a critical limitation for applications relying on long-document comprehension.

**Memory taxonomy for AI agents**:
- **In-context (working memory)**: Everything in the current context window. Fast, perfect recall, limited by context length and cost.
- **Retrieval memory (episodic)**: External vector database storing embeddings of past interactions; retrieved by semantic search when needed.
- **Parametric memory**: Knowledge encoded in model weights during pre-training. Fixed, can't be updated without retraining.
- **Cache memory**: Prefix KV cache reuse for repeated context (e.g., system prompt); reduces latency/cost.

**The solution stack**: For most applications, the best "memory" architecture is RAG over a well-organized vector database, supplemented by a conversation summary in the context. True long-context models (Gemini 1.5 Pro 1M, Claude 3.5 Sonnet 200k) are valuable for tasks requiring holistic document understanding, not just retrieval.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Long-context document QA with sliding summary memory:'''
<syntaxhighlight lang="python">
from openai import OpenAI
from anthropic import Anthropic
import tiktoken

client = OpenAI()
enc = tiktoken.encoding_for_model("gpt-4o")

def chunk_document(text: str, chunk_size: int = 8000, overlap: int = 500) -> list[str]:
    """Split document into overlapping chunks."""
    tokens = enc.encode(text)
    chunks = []
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk_tokens = tokens[i:i + chunk_size]
        chunks.append(enc.decode(chunk_tokens))
    return chunks

def answer_with_long_context(document: str, question: str) -> str:
    chunks = chunk_document(document)
    # Map phase: extract relevant info from each chunk
    relevant_excerpts = []
    for chunk in chunks:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role":"user", "content": f"From the following text, extract any information relevant to: '{question}'\nText: {chunk}\nIf nothing relevant, respond 'NONE'."}]
        )
        excerpt = response.choices[0].message.content
        if excerpt != "NONE":
            relevant_excerpts.append(excerpt)
    # Reduce phase: synthesize extracted excerpts into final answer
    combined = "\n\n".join(relevant_excerpts)
    final = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role":"user", "content": f"Based on these excerpts, answer: {question}\n\nExcerpts:\n{combined}"}]
    )
    return final.choices[0].message.content
</syntaxhighlight>

; Long context strategy by use case
: '''Full book comprehension''' → Gemini 1.5 Pro (1M tokens), Claude 3.5 Sonnet (200k)
: '''Long-document QA''' → RAG with chunking + cross-encoder reranker
: '''Multi-session agent memory''' → Conversation summary + vector DB (episodic memory)
: '''Code repository understanding''' → Tree-sitter parsing + selective context, CodeGraph
: '''Long conversations''' → Progressive summarization of older turns into rolling summary
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ Memory Strategy Comparison
! Strategy !! Max Coverage !! Recall Quality !! Cost !! Latency
|-
| Full context (128k) || 128k tokens || Good but "lost in middle" || High || High
|-
| RAG (retrieval) || Unlimited || Good for specific facts || Low || Medium
|-
| Rolling summary || Unlimited (lossy) || Good for continuity || Low || Low
|-
| Map-reduce over chunks || Unlimited || Good for aggregation || Medium || High
|-
| KV cache prefix reuse || Fixed (system prompt) || Perfect || Low || Very low
|}

'''Failure modes''': Lost-in-the-middle information loss for crucial facts buried in long documents. Context window overflow causing silent truncation (older messages silently dropped). KV cache memory exhaustion with very long contexts (hundreds of GB at 1M tokens). Retrieval misses in RAG when the query doesn't semantically match the relevant chunks.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
Long-context evaluation: (1) **NIAH (Needle in a Haystack)**: place a specific fact at various positions (0%, 25%, 50%, 75%, 100%) in a long document; measure retrieval accuracy per position. Reveals "lost in middle" failure modes. (2) **RULER**: multi-hop retrieval, aggregation tasks. (3) **LongBench**: diverse long-context tasks (QA, summarization, code). (4) **Practical evaluation**: measure end-task performance on actual long documents from the target domain.
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Designing an agent memory system: (1) **Short-term**: keep last N=20 turns in context. (2) **Episodic**: embed and store every conversation turn in a vector DB (ChromaDB/Weaviate); retrieve top-5 relevant past turns per new query. (3) **Summary**: after every 20 turns, generate a running summary (user preferences, ongoing tasks, established facts); include in every context. (4) **Long-document**: for per-session documents (contracts, manuals), use semantic chunking + RAG rather than stuffing the full document into context. (5) **Freshness**: tag memories with timestamps; decay relevance of old memories in retrieval scoring.

[[Category:Artificial Intelligence]]
[[Category:Large Language Models]]
[[Category:Memory]]
</div>