Editing Information Retrieval and Neural Search

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
Information retrieval (IR) is the science of finding relevant information from large collections in response to user queries. It is the technology behind search engines, document management systems, semantic search, legal discovery, and increasingly the retrieval component of RAG (Retrieval-Augmented Generation) systems. Modern IR has evolved from Boolean keyword matching through statistical TF-IDF methods to dense neural retrieval using transformer-based embeddings, enabling semantic understanding that goes far beyond keyword overlap.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Query''' — A user's information need expressed as text (keywords, natural language question, or structured expression).
* '''Document''' — A unit of retrievable information: web page, paragraph, PDF, database record.
* '''Relevance''' — The degree to which a retrieved document satisfies the user's information need.
* '''Inverted index''' — A data structure mapping terms to the documents containing them; the backbone of all keyword-based IR.
* '''TF-IDF (Term Frequency-Inverse Document Frequency)''' — A weighting scheme that ranks terms by their frequency in a document (TF) discounted by their commonness across all documents (IDF).
* '''BM25 (Best Match 25)''' — A probabilistic retrieval function that extends TF-IDF with document length normalization; the dominant lexical retrieval algorithm.
* '''Dense retrieval''' — Retrieving documents by computing similarity between dense vector embeddings of query and documents (vs. sparse keyword matching).
* '''Sparse retrieval''' — Retrieval based on term overlap (BM25, TF-IDF); fast and exact but lacks semantic understanding.
* '''Embedding''' — A dense vector representation of text enabling semantic similarity search.
* '''Bi-encoder''' — A retrieval model with separate encoders for query and document; enables fast approximate nearest neighbor search.
* '''Cross-encoder''' — A model that jointly encodes query + document for highly accurate relevance scoring; too slow for retrieval, used for re-ranking.
* '''FAISS''' — Facebook's library for efficient approximate nearest neighbor search in high-dimensional embedding spaces.
* '''Hybrid retrieval''' — Combining dense and sparse retrieval (e.g., BM25 + dense vectors) for better coverage.
* '''Re-ranking''' — Applying a more accurate (but slower) model to re-order an initial set of retrieved candidates.
* '''BEIR (Benchmarking Information Retrieval)''' — A heterogeneous benchmark suite for evaluating zero-shot dense retrieval across diverse domains.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
The fundamental challenge of information retrieval: the user's information need and the documents that satisfy it may use entirely different words. "How do I fix a flat tire?" and "Procedures for replacing a deflated automotive tire" have zero word overlap but are semantically equivalent. Bridging this vocabulary mismatch is the central challenge.

**Lexical retrieval** (BM25): Count and weight term occurrences. Fast (inverted index lookup), handles exact matches perfectly, but fails on synonyms, paraphrases, and semantic equivalence. BM25 is the baseline that all modern systems must beat.

**Dense retrieval** (DPR, E5, BGE): Encode query and documents into dense vectors using a transformer model trained for retrieval. Retrieve by finding vectors close to the query vector (cosine similarity or dot product). Understands semantics but slower to index, requires approximate nearest neighbor search for scale, and may miss exact keyword matches.

**Hybrid retrieval**: BM25 handles exact keyword matches; dense retrieval handles semantic similarity. Combining both (e.g., Reciprocal Rank Fusion of BM25 and dense results) outperforms either alone.

**The re-ranking pipeline**: (1) **Retrieval** (top-1000): BM25 or dense retrieval. Fast, high recall. (2) **Re-ranking** (top-10): Cross-encoder scores all 1000 candidates. Slow but highly accurate. This two-stage pipeline balances speed and accuracy.

**Neural IR in RAG**: In retrieval-augmented generation, the IR component retrieves relevant document chunks that are passed to an LLM for response generation. The quality of retrieval directly determines the quality of the generated answer — bad retrieval = bad generation, even with a perfect LLM.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Hybrid BM25 + dense retrieval pipeline:'''
<syntaxhighlight lang="python">
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, util
import numpy as np

docs = [
    "Machine learning is a subset of artificial intelligence.",
    "Neural networks are computational models inspired by the brain.",
    "Gradient descent optimizes model parameters by following the loss gradient.",
    "Transformers use self-attention mechanisms to process sequences.",
]
tokenized = [d.lower().split() for d in docs]
bm25 = BM25Okapi(tokenized)
embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
doc_embeddings = embedder.encode(docs, normalize_embeddings=True)

def hybrid_retrieve(query: str, top_k: int = 3, alpha: float = 0.5) -> list:
    """alpha: weight for dense scores (1-alpha for BM25)."""
    # BM25 scores (sparse)
    bm25_scores = np.array(bm25.get_scores(query.lower().split()))
    bm25_norm = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-10)
    # Dense scores
    q_emb = embedder.encode(query, normalize_embeddings=True)
    dense_scores = util.dot_score(q_emb, doc_embeddings).numpy().flatten()
    dense_norm = (dense_scores - dense_scores.min()) / (dense_scores.max() - dense_scores.min() + 1e-10)
    # Combine
    combined = alpha * dense_norm + (1 - alpha) * bm25_norm
    top_idx = combined.argsort()[-top_k:][::-1]
    return [(docs[i], combined[i]) for i in top_idx]

results = hybrid_retrieve("How do neural networks learn?")
for doc, score in results:
    print(f"Score: {score:.3f} | {doc}")
</div>