Information Retrieval: Difference between revisions
BloomWiki: Information Retrieval |
BloomWiki: Information Retrieval |
||
| (One intermediate revision by the same user not shown) | |||
| Line 1: | Line 1: | ||
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
{{BloomIntro}} | {{BloomIntro}} | ||
Information retrieval (IR) is the science of finding relevant information from large collections in response to user queries. It is the technology behind search engines, document management systems, semantic search, legal discovery, and increasingly the retrieval component of RAG (Retrieval-Augmented Generation) systems. Modern IR has evolved from Boolean keyword matching through statistical TF-IDF methods to dense neural retrieval using transformer-based embeddings, enabling semantic understanding that goes far beyond keyword overlap. | Information retrieval (IR) is the science of finding relevant information from large collections in response to user queries. It is the technology behind search engines, document management systems, semantic search, legal discovery, and increasingly the retrieval component of RAG (Retrieval-Augmented Generation) systems. Modern IR has evolved from Boolean keyword matching through statistical TF-IDF methods to dense neural retrieval using transformer-based embeddings, enabling semantic understanding that goes far beyond keyword overlap. | ||
</div> | |||
== Remembering == | __TOC__ | ||
<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | |||
== <span style="color: #FFFFFF;">Remembering</span> == | |||
* '''Query''' — A user's information need expressed as text (keywords, natural language question, or structured expression). | * '''Query''' — A user's information need expressed as text (keywords, natural language question, or structured expression). | ||
* '''Document''' — A unit of retrievable information: web page, paragraph, PDF, database record. | * '''Document''' — A unit of retrievable information: web page, paragraph, PDF, database record. | ||
| Line 18: | Line 23: | ||
* '''Re-ranking''' — Applying a more accurate (but slower) model to re-order an initial set of retrieved candidates. | * '''Re-ranking''' — Applying a more accurate (but slower) model to re-order an initial set of retrieved candidates. | ||
* '''BEIR (Benchmarking Information Retrieval)''' — A heterogeneous benchmark suite for evaluating zero-shot dense retrieval across diverse domains. | * '''BEIR (Benchmarking Information Retrieval)''' — A heterogeneous benchmark suite for evaluating zero-shot dense retrieval across diverse domains. | ||
</div> | |||
== Understanding == | <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Understanding</span> == | |||
The fundamental challenge of information retrieval: the user's information need and the documents that satisfy it may use entirely different words. "How do I fix a flat tire?" and "Procedures for replacing a deflated automotive tire" have zero word overlap but are semantically equivalent. Bridging this vocabulary mismatch is the central challenge. | The fundamental challenge of information retrieval: the user's information need and the documents that satisfy it may use entirely different words. "How do I fix a flat tire?" and "Procedures for replacing a deflated automotive tire" have zero word overlap but are semantically equivalent. Bridging this vocabulary mismatch is the central challenge. | ||
| Line 28: | Line 35: | ||
'''Hybrid retrieval''': BM25 handles exact keyword matches; dense retrieval handles semantic similarity. Combining both (e.g., Reciprocal Rank Fusion of BM25 and dense results) outperforms either alone. | '''Hybrid retrieval''': BM25 handles exact keyword matches; dense retrieval handles semantic similarity. Combining both (e.g., Reciprocal Rank Fusion of BM25 and dense results) outperforms either alone. | ||
'''The re-ranking pipeline''': | '''The re-ranking pipeline''': | ||
# '''Retrieval''' (top-1000): BM25 or dense retrieval. Fast, high recall. | |||
# '''Re-ranking''' (top-10): Cross-encoder scores all 1000 candidates. Slow but highly accurate. This two-stage pipeline balances speed and accuracy. | |||
'''Neural IR in RAG''': In retrieval-augmented generation, the IR component retrieves relevant document chunks that are passed to an LLM for response generation. The quality of retrieval directly determines the quality of the generated answer — bad retrieval = bad generation, even with a perfect LLM. | '''Neural IR in RAG''': In retrieval-augmented generation, the IR component retrieves relevant document chunks that are passed to an LLM for response generation. The quality of retrieval directly determines the quality of the generated answer — bad retrieval = bad generation, even with a perfect LLM. | ||
</div> | |||
== Applying == | <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> | ||
== <span style="color: #FFFFFF;">Applying</span> == | |||
'''Hybrid BM25 + dense retrieval pipeline:''' | '''Hybrid BM25 + dense retrieval pipeline:''' | ||
<syntaxhighlight lang="python"> | <syntaxhighlight lang="python"> | ||
| Line 67: | Line 78: | ||
for doc, score in results: | for doc, score in results: | ||
print(f"Score: {score:.3f} | {doc}") | print(f"Score: {score:.3f} | {doc}") | ||
</div> | |||
Latest revision as of 01:52, 25 April 2026
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
Information retrieval (IR) is the science of finding relevant information from large collections in response to user queries. It is the technology behind search engines, document management systems, semantic search, legal discovery, and increasingly the retrieval component of RAG (Retrieval-Augmented Generation) systems. Modern IR has evolved from Boolean keyword matching through statistical TF-IDF methods to dense neural retrieval using transformer-based embeddings, enabling semantic understanding that goes far beyond keyword overlap.
Remembering[edit]
- Query — A user's information need expressed as text (keywords, natural language question, or structured expression).
- Document — A unit of retrievable information: web page, paragraph, PDF, database record.
- Relevance — The degree to which a retrieved document satisfies the user's information need.
- Inverted index — A data structure mapping terms to the documents containing them; the backbone of all keyword-based IR.
- TF-IDF (Term Frequency-Inverse Document Frequency) — A weighting scheme that ranks terms by their frequency in a document (TF) discounted by their commonness across all documents (IDF).
- BM25 (Best Match 25) — A probabilistic retrieval function that extends TF-IDF with document length normalization; the dominant lexical retrieval algorithm.
- Dense retrieval — Retrieving documents by computing similarity between dense vector embeddings of query and documents (vs. sparse keyword matching).
- Sparse retrieval — Retrieval based on term overlap (BM25, TF-IDF); fast and exact but lacks semantic understanding.
- Embedding — A dense vector representation of text enabling semantic similarity search.
- Bi-encoder — A retrieval model with separate encoders for query and document; enables fast approximate nearest neighbor search.
- Cross-encoder — A model that jointly encodes query + document for highly accurate relevance scoring; too slow for retrieval, used for re-ranking.
- FAISS — Facebook's library for efficient approximate nearest neighbor search in high-dimensional embedding spaces.
- Hybrid retrieval — Combining dense and sparse retrieval (e.g., BM25 + dense vectors) for better coverage.
- Re-ranking — Applying a more accurate (but slower) model to re-order an initial set of retrieved candidates.
- BEIR (Benchmarking Information Retrieval) — A heterogeneous benchmark suite for evaluating zero-shot dense retrieval across diverse domains.
Understanding[edit]
The fundamental challenge of information retrieval: the user's information need and the documents that satisfy it may use entirely different words. "How do I fix a flat tire?" and "Procedures for replacing a deflated automotive tire" have zero word overlap but are semantically equivalent. Bridging this vocabulary mismatch is the central challenge.
Lexical retrieval (BM25): Count and weight term occurrences. Fast (inverted index lookup), handles exact matches perfectly, but fails on synonyms, paraphrases, and semantic equivalence. BM25 is the baseline that all modern systems must beat.
Dense retrieval (DPR, E5, BGE): Encode query and documents into dense vectors using a transformer model trained for retrieval. Retrieve by finding vectors close to the query vector (cosine similarity or dot product). Understands semantics but slower to index, requires approximate nearest neighbor search for scale, and may miss exact keyword matches.
Hybrid retrieval: BM25 handles exact keyword matches; dense retrieval handles semantic similarity. Combining both (e.g., Reciprocal Rank Fusion of BM25 and dense results) outperforms either alone.
The re-ranking pipeline:
- Retrieval (top-1000): BM25 or dense retrieval. Fast, high recall.
- Re-ranking (top-10): Cross-encoder scores all 1000 candidates. Slow but highly accurate. This two-stage pipeline balances speed and accuracy.
Neural IR in RAG: In retrieval-augmented generation, the IR component retrieves relevant document chunks that are passed to an LLM for response generation. The quality of retrieval directly determines the quality of the generated answer — bad retrieval = bad generation, even with a perfect LLM.
Applying[edit]
Hybrid BM25 + dense retrieval pipeline: <syntaxhighlight lang="python"> from rank_bm25 import BM25Okapi from sentence_transformers import SentenceTransformer, util import numpy as np
docs = [
"Machine learning is a subset of artificial intelligence.", "Neural networks are computational models inspired by the brain.", "Gradient descent optimizes model parameters by following the loss gradient.", "Transformers use self-attention mechanisms to process sequences.",
] tokenized = [d.lower().split() for d in docs] bm25 = BM25Okapi(tokenized) embedder = SentenceTransformer("BAAI/bge-small-en-v1.5") docembeddings = embedder.encode(docs, normalizeembeddings=True)
def hybridretrieve(query: str, topk: int = 3, alpha: float = 0.5) -> list:
"""alpha: weight for dense scores (1-alpha for BM25).""" # BM25 scores (sparse) bm25scores = np.array(bm25.getscores(query.lower().split())) bm25norm = (bm25scores - bm25scores.min()) / (bm25scores.max() - bm25_scores.min() + 1e-10) # Dense scores qemb = embedder.encode(query, normalizeembeddings=True) densescores = util.dotscore(qemb, docembeddings).numpy().flatten() densenorm = (densescores - densescores.min()) / (densescores.max() - dense_scores.min() + 1e-10) # Combine combined = alpha * densenorm + (1 - alpha) * bm25norm topidx = combined.argsort()[-topk:][::-1] return [(docs[i], combined[i]) for i in top_idx]
results = hybrid_retrieve("How do neural networks learn?") for doc, score in results:
print(f"Score: {score:.3f} | {doc}")