Editing Long Context Memory (section)

== <span style="color: #FFFFFF;">Understanding</span> ==
As LLMs scale to longer context windows, two challenges emerge: '''efficiency''' and '''effective utilization'''.

'''The quadratic problem (efficiency)''': Standard self-attention is O(n²) in memory and compute for sequence length n. At 1M tokens, this would require ~1 trillion attention computations per layer — impossible naively. Solutions: Flash Attention (IO-efficient exact attention), sliding window attention (local context), Ring Attention (distributes across devices), and sparse attention patterns.

'''"Lost in the Middle" (effective utilization)''': Even when models can technically process long contexts, they fail to equally utilize all parts. Information at the beginning and end of the context is reliably recalled; information in the middle is often lost. This means a 128k context window doesn't deliver 128k worth of reliable memory — a critical limitation for applications relying on long-document comprehension.

'''Memory taxonomy for AI agents''':
- '''In-context (working memory)''': Everything in the current context window. Fast, perfect recall, limited by context length and cost.
- '''Retrieval memory (episodic)''': External vector database storing embeddings of past interactions; retrieved by semantic search when needed.
- '''Parametric memory''': Knowledge encoded in model weights during pre-training. Fixed, can't be updated without retraining.
- '''Cache memory''': Prefix KV cache reuse for repeated context (e.g., system prompt); reduces latency/cost.

'''The solution stack''': For most applications, the best "memory" architecture is RAG over a well-organized vector database, supplemented by a conversation summary in the context. True long-context models (Gemini 1.5 Pro 1M, Claude 3.5 Sonnet 200k) are valuable for tasks requiring holistic document understanding, not just retrieval.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">