Long Context Memory

From BloomWiki
Revision as of 01:53, 25 April 2026 by Wordpad (talk | contribs) (BloomWiki: Long Context Memory)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Long context and memory in large language models addresses how AI systems can process, retain, and effectively use information across very long documents, conversations, and tasks. While early LLMs handled only 512–2048 tokens, modern models support 128k–1M+ token contexts. Yet simply having a large context window doesn't mean the model effectively uses all the information in it. Memory in AI encompasses short-term context window memory, retrieval-augmented memory via vector databases, and the emerging frontier of persistent agent memory across sessions.

Remembering

  • Context window — The maximum number of tokens an LLM can process in a single forward pass.
  • Long context — Context windows exceeding 32k tokens, enabling processing of long documents, books, and extended conversations.
  • KV cache — Key-Value cache storing attention keys and values for all processed tokens; grows linearly with context length.
  • Lost in the Middle — An empirical finding that LLMs perform worse at retrieving information from the middle of long contexts vs. the beginning/end.
  • Needle in a Haystack (NIAH) — A benchmark hiding a specific fact in a long document and asking the model to retrieve it; tests effective context utilization.
  • RULER — A more comprehensive long-context benchmark covering multi-hop retrieval, aggregation, and ordering.
  • RoPE (Rotary Position Embedding) — A position encoding method that generalizes to longer sequences than training length via "context extension."
  • YaRN — A technique for extending RoPE-based models to longer contexts without full retraining.
  • Ring Attention — A distributed attention mechanism enabling near-infinite context by distributing KV cache across devices.
  • Sliding window attention — Restricts attention to a local window; efficient but loses long-range information.
  • Retrieval-augmented memory — Augmenting model context with retrieved relevant chunks from external memory stores.
  • Episodic memory — Storing and retrieving specific past events or conversations, enabling persistent agent memory.
  • Working memory — The information currently held in the context window; limited by context length.
  • Compressive memory — Summarizing and compressing older context to extend effective memory beyond the raw context window.

Understanding

As LLMs scale to longer context windows, two challenges emerge: efficiency and effective utilization.

The quadratic problem (efficiency): Standard self-attention is O(n²) in memory and compute for sequence length n. At 1M tokens, this would require ~1 trillion attention computations per layer — impossible naively. Solutions: Flash Attention (IO-efficient exact attention), sliding window attention (local context), Ring Attention (distributes across devices), and sparse attention patterns.

"Lost in the Middle" (effective utilization): Even when models can technically process long contexts, they fail to equally utilize all parts. Information at the beginning and end of the context is reliably recalled; information in the middle is often lost. This means a 128k context window doesn't deliver 128k worth of reliable memory — a critical limitation for applications relying on long-document comprehension.

Memory taxonomy for AI agents: - In-context (working memory): Everything in the current context window. Fast, perfect recall, limited by context length and cost. - Retrieval memory (episodic): External vector database storing embeddings of past interactions; retrieved by semantic search when needed. - Parametric memory: Knowledge encoded in model weights during pre-training. Fixed, can't be updated without retraining. - Cache memory: Prefix KV cache reuse for repeated context (e.g., system prompt); reduces latency/cost.

The solution stack: For most applications, the best "memory" architecture is RAG over a well-organized vector database, supplemented by a conversation summary in the context. True long-context models (Gemini 1.5 Pro 1M, Claude 3.5 Sonnet 200k) are valuable for tasks requiring holistic document understanding, not just retrieval.

Applying

Long-context document QA with sliding summary memory: <syntaxhighlight lang="python"> from openai import OpenAI from anthropic import Anthropic import tiktoken

client = OpenAI() enc = tiktoken.encoding_for_model("gpt-4o")

def chunk_document(text: str, chunk_size: int = 8000, overlap: int = 500) -> list[str]:

   """Split document into overlapping chunks."""
   tokens = enc.encode(text)
   chunks = []
   for i in range(0, len(tokens), chunk_size - overlap):
       chunk_tokens = tokens[i:i + chunk_size]
       chunks.append(enc.decode(chunk_tokens))
   return chunks

def answer_with_long_context(document: str, question: str) -> str:

   chunks = chunk_document(document)
   # Map phase: extract relevant info from each chunk
   relevant_excerpts = []
   for chunk in chunks:
       response = client.chat.completions.create(
           model="gpt-4o-mini",
           messages=[{"role":"user", "content": f"From the following text, extract any information relevant to: '{question}'\nText: {chunk}\nIf nothing relevant, respond 'NONE'."}]
       )
       excerpt = response.choices[0].message.content
       if excerpt != "NONE":
           relevant_excerpts.append(excerpt)
   # Reduce phase: synthesize extracted excerpts into final answer
   combined = "\n\n".join(relevant_excerpts)
   final = client.chat.completions.create(
       model="gpt-4o",
       messages=[{"role":"user", "content": f"Based on these excerpts, answer: {question}\n\nExcerpts:\n{combined}"}]
   )
   return final.choices[0].message.content

</syntaxhighlight>

Long context strategy by use case
Full book comprehension → Gemini 1.5 Pro (1M tokens), Claude 3.5 Sonnet (200k)
Long-document QA → RAG with chunking + cross-encoder reranker
Multi-session agent memory → Conversation summary + vector DB (episodic memory)
Code repository understanding → Tree-sitter parsing + selective context, CodeGraph
Long conversations → Progressive summarization of older turns into rolling summary

Analyzing

Memory Strategy Comparison
Strategy Max Coverage Recall Quality Cost Latency
Full context (128k) 128k tokens Good but "lost in middle" High High
RAG (retrieval) Unlimited Good for specific facts Low Medium
Rolling summary Unlimited (lossy) Good for continuity Low Low
Map-reduce over chunks Unlimited Good for aggregation Medium High
KV cache prefix reuse Fixed (system prompt) Perfect Low Very low

Failure modes: Lost-in-the-middle information loss for crucial facts buried in long documents. Context window overflow causing silent truncation (older messages silently dropped). KV cache memory exhaustion with very long contexts (hundreds of GB at 1M tokens). Retrieval misses in RAG when the query doesn't semantically match the relevant chunks.

Evaluating

Long-context evaluation:

  1. NIAH (Needle in a Haystack): place a specific fact at various positions (0%, 25%, 50%, 75%, 100%) in a long document; measure retrieval accuracy per position. Reveals "lost in middle" failure modes.
  2. RULER: multi-hop retrieval, aggregation tasks.
  3. LongBench: diverse long-context tasks (QA, summarization, code).
  4. Practical evaluation: measure end-task performance on actual long documents from the target domain.

Creating

Designing an agent memory system:

  1. Short-term: keep last N=20 turns in context.
  2. Episodic: embed and store every conversation turn in a vector DB (ChromaDB/Weaviate); retrieve top-5 relevant past turns per new query.
  3. Summary: after every 20 turns, generate a running summary (user preferences, ongoing tasks, established facts); include in every context.
  4. Long-document: for per-session documents (contracts, manuals), use semantic chunking + RAG rather than stuffing the full document into context.
  5. Freshness: tag memories with timestamps; decay relevance of old memories in retrieval scoring.