Editing
Information Retrieval and Neural Search
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> {{BloomIntro}} Information retrieval (IR) is the science of finding relevant information from large collections in response to user queries. It is the technology behind search engines, document management systems, semantic search, legal discovery, and increasingly the retrieval component of RAG (Retrieval-Augmented Generation) systems. Modern IR has evolved from Boolean keyword matching through statistical TF-IDF methods to dense neural retrieval using transformer-based embeddings, enabling semantic understanding that goes far beyond keyword overlap. </div> __TOC__ <div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Remembering</span> == * '''Query''' β A user's information need expressed as text (keywords, natural language question, or structured expression). * '''Document''' β A unit of retrievable information: web page, paragraph, PDF, database record. * '''Relevance''' β The degree to which a retrieved document satisfies the user's information need. * '''Inverted index''' β A data structure mapping terms to the documents containing them; the backbone of all keyword-based IR. * '''TF-IDF (Term Frequency-Inverse Document Frequency)''' β A weighting scheme that ranks terms by their frequency in a document (TF) discounted by their commonness across all documents (IDF). * '''BM25 (Best Match 25)''' β A probabilistic retrieval function that extends TF-IDF with document length normalization; the dominant lexical retrieval algorithm. * '''Dense retrieval''' β Retrieving documents by computing similarity between dense vector embeddings of query and documents (vs. sparse keyword matching). * '''Sparse retrieval''' β Retrieval based on term overlap (BM25, TF-IDF); fast and exact but lacks semantic understanding. * '''Embedding''' β A dense vector representation of text enabling semantic similarity search. * '''Bi-encoder''' β A retrieval model with separate encoders for query and document; enables fast approximate nearest neighbor search. * '''Cross-encoder''' β A model that jointly encodes query + document for highly accurate relevance scoring; too slow for retrieval, used for re-ranking. * '''FAISS''' β Facebook's library for efficient approximate nearest neighbor search in high-dimensional embedding spaces. * '''Hybrid retrieval''' β Combining dense and sparse retrieval (e.g., BM25 + dense vectors) for better coverage. * '''Re-ranking''' β Applying a more accurate (but slower) model to re-order an initial set of retrieved candidates. * '''BEIR (Benchmarking Information Retrieval)''' β A heterogeneous benchmark suite for evaluating zero-shot dense retrieval across diverse domains. </div> <div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Understanding</span> == The fundamental challenge of information retrieval: the user's information need and the documents that satisfy it may use entirely different words. "How do I fix a flat tire?" and "Procedures for replacing a deflated automotive tire" have zero word overlap but are semantically equivalent. Bridging this vocabulary mismatch is the central challenge. **Lexical retrieval** (BM25): Count and weight term occurrences. Fast (inverted index lookup), handles exact matches perfectly, but fails on synonyms, paraphrases, and semantic equivalence. BM25 is the baseline that all modern systems must beat. **Dense retrieval** (DPR, E5, BGE): Encode query and documents into dense vectors using a transformer model trained for retrieval. Retrieve by finding vectors close to the query vector (cosine similarity or dot product). Understands semantics but slower to index, requires approximate nearest neighbor search for scale, and may miss exact keyword matches. **Hybrid retrieval**: BM25 handles exact keyword matches; dense retrieval handles semantic similarity. Combining both (e.g., Reciprocal Rank Fusion of BM25 and dense results) outperforms either alone. **The re-ranking pipeline**: (1) **Retrieval** (top-1000): BM25 or dense retrieval. Fast, high recall. (2) **Re-ranking** (top-10): Cross-encoder scores all 1000 candidates. Slow but highly accurate. This two-stage pipeline balances speed and accuracy. **Neural IR in RAG**: In retrieval-augmented generation, the IR component retrieves relevant document chunks that are passed to an LLM for response generation. The quality of retrieval directly determines the quality of the generated answer β bad retrieval = bad generation, even with a perfect LLM. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;"> == <span style="color: #FFFFFF;">Applying</span> == '''Hybrid BM25 + dense retrieval pipeline:''' <syntaxhighlight lang="python"> from rank_bm25 import BM25Okapi from sentence_transformers import SentenceTransformer, util import numpy as np docs = [ "Machine learning is a subset of artificial intelligence.", "Neural networks are computational models inspired by the brain.", "Gradient descent optimizes model parameters by following the loss gradient.", "Transformers use self-attention mechanisms to process sequences.", ] tokenized = [d.lower().split() for d in docs] bm25 = BM25Okapi(tokenized) embedder = SentenceTransformer("BAAI/bge-small-en-v1.5") doc_embeddings = embedder.encode(docs, normalize_embeddings=True) def hybrid_retrieve(query: str, top_k: int = 3, alpha: float = 0.5) -> list: """alpha: weight for dense scores (1-alpha for BM25).""" # BM25 scores (sparse) bm25_scores = np.array(bm25.get_scores(query.lower().split())) bm25_norm = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-10) # Dense scores q_emb = embedder.encode(query, normalize_embeddings=True) dense_scores = util.dot_score(q_emb, doc_embeddings).numpy().flatten() dense_norm = (dense_scores - dense_scores.min()) / (dense_scores.max() - dense_scores.min() + 1e-10) # Combine combined = alpha * dense_norm + (1 - alpha) * bm25_norm top_idx = combined.argsort()[-top_k:][::-1] return [(docs[i], combined[i]) for i in top_idx] results = hybrid_retrieve("How do neural networks learn?") for doc, score in results: print(f"Score: {score:.3f} | {doc}") </div>
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Template used on this page:
Template:BloomIntro
(
edit
)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information