Embeddings and Vector Databases

From BloomWiki
Revision as of 01:50, 25 April 2026 by Wordpad (talk | contribs) (BloomWiki: Embeddings and Vector Databases)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

Embeddings and vector databases are the foundational infrastructure of modern semantic AI applications. An embedding is a dense numerical vector that represents the meaning of text, images, audio, or other data in a high-dimensional space, where semantically similar items are geometrically close together. Vector databases store these embeddings and enable lightning-fast similarity search at scale. Together, they power semantic search, RAG systems, recommendation engines, duplicate detection, and anomaly detection.

Remembering

  • Embedding — A dense vector of real numbers representing the meaning of a piece of data. Similar items have vectors that are close together in the embedding space.
  • Embedding model — A neural network trained to produce embeddings. Examples: text-embedding-3-small (OpenAI), BGE-M3 (BAAI), all-MiniLM-L6-v2 (Sentence Transformers).
  • Dimensionality — The number of values in an embedding vector. Common sizes: 384, 768, 1536, 3072. Higher dimensions can capture more nuance but require more storage and compute.
  • Semantic similarity — The degree to which two items mean the same thing, encoded as the geometric distance between their embeddings.
  • Cosine similarity — The most common similarity metric for embeddings; measures the angle between two vectors. Values range from -1 (opposite) to 1 (identical).
  • Dot product — An alternative similarity metric; equivalent to cosine similarity when vectors are normalized.
  • L2 distance (Euclidean) — The straight-line distance between two vectors; used in some retrieval scenarios.
  • Vector database — A database optimized for storing embedding vectors and performing fast approximate nearest neighbor (ANN) search. Examples: Pinecone, Weaviate, Chroma, Qdrant, Milvus, pgvector.
  • ANN (Approximate Nearest Neighbor) — An algorithm that finds vectors approximately close to a query vector very quickly (sacrificing exact precision for speed).
  • HNSW (Hierarchical Navigable Small World) — The most widely used ANN index structure, offering excellent speed-recall trade-offs.
  • Metadata filtering — Restricting vector search results to items matching certain criteria (e.g., only articles from 2024, only products in category "electronics").
  • Biencoder — A model that encodes queries and documents independently into embedding space, enabling fast retrieval (e.g., Sentence-BERT).
  • Cross-encoder — A model that takes a query-document pair as input and outputs a relevance score; more accurate than biencoder but much slower (used for reranking).
  • Chunking — Splitting large documents into smaller pieces before embedding, since embedding models have token limits.

Understanding

The magic of embeddings is that they transform the hard problem of semantic similarity into simple geometric distance. After training on massive amounts of text data (or image-text pairs), embedding models learn to place words, sentences, and documents that mean similar things close together in a high-dimensional space.

The classic demonstration: In a good word embedding space:

  • "king" - "man" + "woman" ≈ "queen"
  • "Paris" - "France" + "Italy" ≈ "Rome"

This isn't hardcoded — it emerges from the statistical patterns of how words co-occur in language.

Why not use keyword search? Keywords match exact strings. Semantic search understands meaning. A query for "cardiac event" will find documents about "heart attack" via embeddings; keyword search would miss this unless the exact phrase appears.

How vector databases work: Storing millions of embedding vectors and doing exact search (computing cosine similarity against every stored vector) would be too slow. ANN algorithms solve this by building smart index structures. HNSW (Hierarchical Navigable Small World) builds a layered graph where each layer is a sparser approximation of the dense lower layer — like a highway system where you first navigate between cities (coarse layer) then between neighborhoods (fine layer). This achieves sub-millisecond query times on millions of vectors.

Hybrid search combines vector (semantic) search with BM25 keyword search, using a Reciprocal Rank Fusion (RRF) algorithm to merge results. This consistently outperforms either approach alone, because different query types benefit from different retrieval mechanisms.

Applying

Generating and storing embeddings with Sentence Transformers + Chroma:

<syntaxhighlight lang="python"> from sentence_transformers import SentenceTransformer import chromadb import numpy as np

  1. Load embedding model

model = SentenceTransformer("BAAI/bge-m3")

  1. Sample documents

docs = [

   "Neural networks are the foundation of deep learning.",
   "The heart pumps blood through the circulatory system.",
   "Python is a popular programming language for data science.",
   "Transformers use self-attention mechanisms for NLP tasks.",
   "The mitochondria are the powerhouse of the cell.",

]

  1. Generate embeddings

embeddings = model.encode(docs, normalize_embeddings=True) print(f"Embedding shape: {embeddings.shape}") # (5, 1024)

  1. Store in vector database

client = chromadb.Client() collection = client.create_collection("knowledge_base")

collection.add(

   documents=docs,
   embeddings=embeddings.tolist(),
   ids=[f"doc_{i}" for i in range(len(docs))]

)

  1. Semantic search

query = "How do attention mechanisms work?" query_embedding = model.encode([query], normalize_embeddings=True).tolist()

results = collection.query(

   query_embeddings=query_embedding,
   n_results=2

) print(results["documents"])

  1. [["Transformers use self-attention mechanisms for NLP tasks.",
  2. "Neural networks are the foundation of deep learning."]]

</syntaxhighlight>

Vector database selection guide
Local/development → Chroma (in-memory, file-backed), FAISS (library)
Self-hosted production → Qdrant (Rust, great performance), Weaviate (rich features), Milvus (scale)
Managed cloud → Pinecone (simplest API), Weaviate Cloud, Zilliz Cloud
Existing PostgreSQL stack → pgvector extension (good for <10M vectors)
Multimodal (text + image) → Weaviate, Qdrant (both support multiple vector types)

Analyzing

Embedding Model Comparison
Model Dimensions Speed Quality Cost
text-embedding-3-small (OpenAI) 1536 Fast (API) Very high Paid per token
text-embedding-3-large (OpenAI) 3072 Fast (API) Highest More expensive
BAAI/BGE-M3 1024 Moderate (local) Very high Free (self-hosted)
all-MiniLM-L6-v2 384 Very fast (local) Good Free (self-hosted)
E5-mistral-7b 4096 Slow (large model) Excellent Free (GPU needed)

Failure modes and pitfalls:

  • Embedding model-retrieval mismatch — The embedding model used to index documents must be identical to the one used to embed queries. Using different models produces nonsensical results.
  • Chunking artifacts — Important context split across chunks leads to poor retrievals. If an answer spans two chunks, neither may score high enough to be retrieved.
  • Embedding stale data — If documents are updated but re-embedding is not triggered, the index serves outdated information. Implement change detection and incremental re-indexing.
  • Dimensionality curse — In very high dimensions, all vectors tend to become equidistant from each other, degrading nearest-neighbor search quality. Use models with well-calibrated dimensionalities.
  • Semantic gap — Embeddings capture distributional semantics but may miss precise numerical facts, dates, or codes. Combine with structured filters or keyword search.

Evaluating

Expert practitioners evaluate embedding systems at multiple levels:

Embedding quality metrics:

  • MTEB (Massive Text Embedding Benchmark): The standard benchmark suite for text embeddings, covering retrieval, classification, clustering, semantic similarity, and more.
  • BEIR benchmark: Zero-shot retrieval across diverse domains — the true test of embedding generalization.

System-level retrieval metrics:

  • Recall@k and Precision@k on held-out query-document pairs
  • Mean Reciprocal Rank (MRR) for ranking quality
  • Query latency at p50/p95/p99 percentiles

Operational metrics:

  • Index build time and storage size (cost)
  • Query throughput (QPS) at target latency SLAs
  • Index freshness lag (time between document update and searchability)

Expert practitioners also evaluate embeddings on their specific domain — a general embedding model trained on web text may underperform a fine-tuned domain-specific model on medical, legal, or code retrieval tasks.

Creating

Designing a scalable embedding and vector search infrastructure:

1. Embedding pipeline <syntaxhighlight lang="text"> Data source (docs, products, articles)

[Change detection: hash-based or timestamp comparison]

[Chunking: semantic or recursive, 512-1024 tokens]

[Embedding generation: batched, GPU-accelerated]

[Vector store upsert: id, vector, metadata, document text]

[BM25 index update for hybrid search] </syntaxhighlight>

2. Query pipeline <syntaxhighlight lang="text"> User query

[Query preprocessing: lowercase, strip special chars]

[Parallel retrieval:

 ├── Dense (ANN): top-50 by cosine similarity
 └── Sparse (BM25): top-50 by keyword relevance]
   ↓

[Reciprocal Rank Fusion: merge and deduplicate]

[Cross-encoder reranking: top-10 → top-5]

Top-k results with metadata and scores </syntaxhighlight>

3. Production considerations

  • Pre-filter by metadata before ANN search to reduce search space
  • Cache frequent query embeddings (TTL-based)
  • Use asynchronous indexing to avoid blocking on document ingestion
  • Set up monitoring: index size growth, query latency, empty result rates