Editing
Long Context Memory
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== <span style="color: #FFFFFF;">Understanding</span> == As LLMs scale to longer context windows, two challenges emerge: '''efficiency''' and '''effective utilization'''. '''The quadratic problem (efficiency)''': Standard self-attention is O(nΒ²) in memory and compute for sequence length n. At 1M tokens, this would require ~1 trillion attention computations per layer β impossible naively. Solutions: Flash Attention (IO-efficient exact attention), sliding window attention (local context), Ring Attention (distributes across devices), and sparse attention patterns. '''"Lost in the Middle" (effective utilization)''': Even when models can technically process long contexts, they fail to equally utilize all parts. Information at the beginning and end of the context is reliably recalled; information in the middle is often lost. This means a 128k context window doesn't deliver 128k worth of reliable memory β a critical limitation for applications relying on long-document comprehension. '''Memory taxonomy for AI agents''': - '''In-context (working memory)''': Everything in the current context window. Fast, perfect recall, limited by context length and cost. - '''Retrieval memory (episodic)''': External vector database storing embeddings of past interactions; retrieved by semantic search when needed. - '''Parametric memory''': Knowledge encoded in model weights during pre-training. Fixed, can't be updated without retraining. - '''Cache memory''': Prefix KV cache reuse for repeated context (e.g., system prompt); reduces latency/cost. '''The solution stack''': For most applications, the best "memory" architecture is RAG over a well-organized vector database, supplemented by a conversation summary in the context. True long-context models (Gemini 1.5 Pro 1M, Claude 3.5 Sonnet 200k) are valuable for tasks requiring holistic document understanding, not just retrieval. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information