Vector Databases

Semantic search at scale: Store high-dimensional embeddings, find similar documents in milliseconds.

Posted Apr 5, 2025

10 min read

Vector Databases

Key Properties

Property	Requirement	Impact
Dimensionality	256–4096 (typically 1536 for OpenAI)	Higher = more expressive, more memory
Query Latency	<50–100ms per query	Affects user experience
Throughput	1000–100K queries/sec	Depends on indexing strategy
Recall	>90%	Approximate vs exact nearest neighbors
Memory Efficiency	Quantization reduces by 4–10x	Trade-off: compression vs accuracy

Core Operations

Insert: Store new embeddings with metadata

        
      
vector_db.upsert(vectors=[
    {"id": "doc1", "embedding": [0.1, 0.2, ..., 0.5], "metadata": {"source": "wiki"}},
])

Search: Find k-nearest neighbors by similarity

        
results = vector_db.query(query_embedding=[0.15, 0.25, ..., 0.48], k=5)
# Returns: top-5 similar vectors with distances

Delete/Update: Remove or modify vectors by ID

Indexing Strategies

Index Type	Time (Build)	Time (Query)	Memory	Best For
Flat (brute-force)	O(1)	O(n×d)	Low	<1M vectors
HNSW (Hierarchical Navigable Small World)	O(n log n)	O(log n)	Medium	General-purpose, fast
IVF (Inverted File)	O(n log n)	O(n/nlist)	Low	Large scale, commodity hardware
Quantized (product/binary)	O(n)	O(n/nlist)	Very Low	100M+ vectors, memory constrained

HNSW Details:

Hierarchical layers organize vectors
Navigable small world: proximity + randomness
Recall: >98% at scale
Pinecone, Weaviate, Qdrant use HNSW

Popular Systems

System	Recall	Speed	Scale	Cost	Best For
Pinecone	>95%	<50ms	Billions	$10–1000/mo	Managed, serverless
Weaviate	>90%	<100ms	1B+	Open-source (free)	Self-hosted flexibility
Milvus	>95%	<50ms	1B+	Open-source (free)	Distributed, cloud-native
Qdrant	>98%	<50ms	Billions	Free/paid	Production-grade, performant
Chroma	>90%	<100ms	100M	Open-source (free)	Lightweight, embeddings-focused

Production Deployment

Typical Setup:

Application → Vector DB (8 replicas, sharded across 4 nodes)
           ↓
        Pinecone / Weaviate / Milvus (1B vectors, 1536 dims each)
           ↓
       Latency: P50=30ms, P99=100ms
       Throughput: 10K queries/sec
       Memory: 6TB (1B × 1536 × 4 bytes)

Optimization:

Batch queries: 100 queries in one request instead of 100 separate
Async: Non-blocking I/O for web servers
Caching: Cache frequent queries (redundant search queries)
Quantization: Reduce dimensions from 1536 → 512 with minimal accuracy loss

When to Use / Avoid Vector Databases

Use Vector Databases when:

✅ Need semantic similarity search (not keyword matching)
✅ Scaling to 100M+ vectors (HNSW outperforms flat search)
✅ Querying in <100ms required (ANN essential, exact search too slow)
✅ Building RAG systems (embeddings + retrieval core functionality)
✅ Enabling generative AI features (search + LLM context integration)

Avoid Vector Databases when:

❌ <1M vectors and latency not critical (flat/brute-force cheaper)
❌ Need exact nearest neighbors for critical tasks (approximation error unacceptable)
❌ Vectors change extremely frequently (reindexing overhead)
❌ All queries identical (traditional caching better)

How Real Systems Use This

Pinecone (Notion AI Integration): Notion uses Pinecone to power their AI search across user workspaces. When Notion receives a user query to search notes, they: (1) embed the query using OpenAI text-embedding-3-small (384 dimensions, 10ms), (2) query Pinecone with k=10 candidates, (3) Pinecone returns top-10 matches in <30ms using HNSW indexing, (4) Notion returns matching notes to user. Pinecone stores ~500M vectors for Notion (across all users). Per-user namespaces isolate data. Metrics: P99 query latency = 45ms, precision@10 > 95%. Cost: $0.08/100K vectors/month managed service. Why Notion chose Pinecone: Managed serverless (no infrastructure), multi-tenancy built-in, 99.95% SLA, metadata filtering for per-workspace isolation, built-in replication for HA.

Weaviate (Stack Overflow Q&A Search): Stack Overflow integrated Weaviate to improve question recommendation. When a user asks a new question, Weaviate: (1) vectorizes the question using their fine-tuned encoder model, (2) searches 22M Stack Overflow questions in their self-hosted Weaviate cluster, (3) returns top-20 similar questions in <100ms, (4) displays as “Similar questions” sidebar. Weaviate cluster spans 5 nodes with 22M vectors × 768 dimensions. Metrics: Recall@100 = 98% (catches ~all relevant duplicates), P99 latency = 150ms. Cost: Self-hosted, negligible compute (uses existing servers). Why Stack Overflow chose Weaviate: Open-source (no vendor lock-in), GraphQL API enables flexible queries, hybrid search (semantic + keyword BM25), module system for custom vectorizers, cost-effective at scale.

Qdrant (Stripe RAG System): Stripe uses Qdrant for their internal documentation RAG. 1000+ API documentation pages are vectorized (OpenAI text-embedding-3-large, 1536 dims) and stored in Qdrant. When a developer asks “How do I handle webhook retries?”: (1) query is embedded (15ms), (2) Qdrant returns top-5 similar docs in <30ms using HNSW, (3) optional reranking with cross-encoder (40ms), (4) docs inserted into LLM context. Self-hosted Qdrant cluster: 2 nodes, 1M vectors, ~5GB memory. Metrics: Answer accuracy = 92%, retrieval precision@5 = 94%. Why Qdrant: Extremely fast HNSW (payload-aware indexing), high precision, scales to billions of vectors, built-in filtering (metadata), REST + gRPC APIs.

Milvus (E-commerce Product Search): An e-commerce platform uses open-source Milvus to search 50M product embeddings. When a customer searches “blue running shoes”, the system: (1) embeds query using product encoder, (2) queries Milvus cluster (8 data nodes + 2 query nodes), (3) returns top-50 products ranked by similarity in <50ms, (4) results re-ranked by price/rating/inventory. Milvus cluster: 50M vectors × 768 dimensions = ~150GB memory, distributed across 8 machines. Metrics: Recall@50 = 96%, P99 latency = 80ms, throughput = 10K queries/sec. Cost: Self-hosted, ~$5K/month infrastructure (commodity servers). Why chose Milvus: Distributed-first design (scales horizontally), cloud-native (Kubernetes ready), IVF compression reduces memory 4x vs flat, HNSW for extreme precision trade-off.

Chroma (Local LLM Development): Developers using local/open-source LLMs (LLaMA, Mistral) often use Chroma for lightweight embedding storage. Chroma runs in-process (Python library) or as Docker container. For 100K embeddings: (1) embeds documents with open-source model (MiniLM, 384 dims), (2) stores in Chroma SQLite backend (~50MB), (3) queries return top-k in <50ms. Typical usage: local chatbot on laptop querying personal documents. Metrics: Recall@10 = 90% (sufficient for prototyping), P99 latency = 100ms (variable, local hardware dependent). Cost: Free, open-source. Why developers choose Chroma: Zero infrastructure, instant setup, SQL filtering, client-focused APIs, Python-native.

Advanced Indexing Patterns

IVF-PQ (Inverted File + Product Quantization):

Divide vectors into clusters (IVF)
Within each cluster, quantize vectors to 8-bit (PQ)
Query searches only 1-2 clusters (90% reduction in search space)
Trade-off: 2-5% accuracy loss for 10-50x speedup
Popular in: FAISS, Elasticsearch, OpenSearch

HNSW with Pruning:

Store only top neighbors per layer
Skip intermediate connections
Reduces index memory 20-30%
Minimal latency impact (<5ms)

Approximate Nearest Neighbor (ANN) Scaling:

1M vectors × 1536 dims = 6GB flat memory
With HNSW: add 20% for graph structure = 7.2GB
With IVF-PQ: compress to 2-3GB
Choice depends on latency vs memory trade-off

Production Deployment Patterns

Single-region High Availability:

Load Balancer
  ├── Qdrant replica 1 (vector index + data)
  ├── Qdrant replica 2 (vector index + data)
  └── Qdrant replica 3 (vector index + data)

Configuration:
- All 3 replicas hot (no failover latency)
- Consensus protocol for consistency
- Write replication factor: 3
- Read from any replica (load balanced)

Multi-region Disaster Recovery:

Primary Region (Pinecone US)
  └── 500M vectors, read/write

Secondary Region (Pinecone EU)
  └── Replica of 500M vectors, read-only

Async replication from primary → secondary (5-30 sec lag)

Caching Layer (for hot queries):

Redis In-Memory Cache
  ├── Popular queries: "return top 10 for query X"
  └── Hit rate: 60-70% for well-distributed access

Query Pattern:
1. Check Redis cache (1ms hit)
2. If miss, query vector DB (30-100ms)
3. Update cache (TTL: 5 min)

Embedding Model Comparison

Model	Dimensions	Speed	Quality	Cost
text-embedding-3-large (OpenAI)	3072	Slow (100ms)	Excellent (MTEB 64.6)	$0.13/1M tokens
text-embedding-3-small (OpenAI)	1536	Medium (50ms)	Very Good (MTEB 62.2)	$0.02/1M tokens
MiniLM-L6 (open-source)	384	Fast (5ms local)	Good (MTEB 56)	Free
bge-large-en (open-source)	1024	Fast (10ms)	Excellent (MTEB 63.6)	Free
voyage-large-2 (Voyage AI)	1024	Medium (60ms)	Excellent (MTEB 63.9)	$0.10/1M tokens

Quantization Impact on Recall

Quantization	Recall@10	Memory Reduction	Speed Improvement
Float32 (baseline)	100%	1x	1x
Float16	99.8%	2x	1.2x
Int8	98.5%	4x	1.8x
Int4	95%	8x	2.5x

Approximate Nearest Neighbor (ANN) Trade-offs

Exact (Brute-Force) Search:

Algorithm: Compare query to all vectors, return k smallest distances
Complexity: O(n) queries, O(1) build
Pros: 100% recall, no approximation
Cons: Slow for large n (n>1M)
When to use: <1M vectors, accuracy critical

HNSW (Hierarchical Navigable Small World):

Algorithm: Multi-layer graph, navigate via small-world paths
Complexity: O(log n) queries, O(n log n) build
Recall: 95-99% (depends on ef parameter)
When to use: General-purpose, 1M-1B vectors
Trade-off: Memory overhead (20-30% vs flat), but enables sub-millisecond queries

IVF-PQ (Inverted File + Product Quantization):

Algorithm: Cluster vectors (IVF) + compress within cluster (PQ)
Complexity: O(n / n_clusters) queries
Recall: 85-95%
When to use: Extreme scale (100M+ vectors), memory constrained
Trade-off: Lower recall than HNSW, but massive memory savings

Vector Database Comparison (Detailed)

Factor	Pinecone	Weaviate	Qdrant	Milvus	Chroma
Type	Managed	Self-hosted	Self-hosted	Self-hosted	In-memory
Scaling	Serverless	Kubernetes	Kubernetes	Kubernetes	Single machine
Consistency	Eventually consistent	Strong	Strong	Strong	Strong
Filtering	Yes (metadata)	Yes	Yes	Yes	Yes
Price (100K vectors)	~$1-5/mo	Free	Free	Free	Free
Ops Burden	None	Medium (K8s)	Medium	High (distributed)	None
Best For	SaaS app, serverless	ML apps, on-prem	Production, latency-critical	BigData, Hadoop ecosystem	Local dev, prototyping

Common Vector DB Pitfalls

Problem 1: Curse of Dimensionality (High-Dimensional Vectors)

Vector dimensionality: 1536 (OpenAI), 768 (BERT)
Issue: Distance becomes less meaningful in high dimensions
Symptom: Top-k queries return similar distances (hard to distinguish)
Solution: Dimensionality reduction (PCA), use cosine distance (vs L2), quantization

Problem 2: Embedding Model Mismatch

Indexing with model A (text-embedding-3-small), querying with model B (BERT)
Result: Query embedding not comparable to indexed embeddings
Solution: Always use same embedding model for indexing + querying

Problem 3: Memory Explosion at Scale

10B vectors × 1536 dims × 4 bytes = 61.4TB uncompressed
Common mistake: Load all vectors into RAM
Solution: Use quantization (reduce to Int8: 15.3TB), or use disk-backed index

Problem 4: No Filtering = Irrelevant Results

Issue: Dense retrieval finds semantically similar vectors, but may be wrong type
Example: Searching “how to reset password” returns docs about password policies (semantically similar, but not what user wants)
Solution: Add metadata filters (doc_type == “how-to”), use hybrid search (dense + sparse)

Problem 5: N+1 Queries During Indexing

Indexing 1B vectors naively: 1B individual inserts = network overhead
Symptom: Indexing takes weeks
Solution: Batch inserts (1000 vectors per request), use bulk ingest APIs

Performance Tuning

Query Latency Optimization:

1. Reduce ef (search budget in HNSW)
   - Default: ef=100 (search top-100 candidates)
   - Reduce to ef=20 for faster query (10ms vs 50ms)
   - Recall drops from 99% to 92% (trade-off)

2. Use approximate distance calculations
   - Instead of exact cosine, use dot product (faster)
   - Difference: negligible for ranking

3. Implement query caching
   - Cache results of frequent queries (5-min TTL)
   - Hit rate: 60-70% for well-distributed workloads

Indexing Speed Optimization:

1. Batch upserts (bulk insert 1000+ vectors)
   - Single insert: 100 vectors/sec
   - Batch insert: 10K-100K vectors/sec

2. Disable reindexing during bulk load
   - Immediately re-index after bulk load
   - Faster than incremental reindexing

3. Pre-sort vectors by cluster
   - Helps HNSW build more efficient graph
   - 20-30% faster indexing

References

📄 HNSW Paper (Malkov & Yashunin, 2018) 📄 Product Quantization (Jégou et al., 2011) 📄 Billion-scale Similarity Search with GPUs (Johnson et al., 2017) — FAISS introduction 📄 MTEB: Massive Text Embedding Benchmark (Muennighoff et al., 2023) 🔗 Pinecone Documentation 🔗 Weaviate Documentation 🔗 Qdrant Documentation 🔗 Milvus Documentation 🎥 Vector Search Explained (ByteByteGo)

AI & Agents, GenAI & LLMs

rag-and-retrieval

This post is licensed under CC BY 4.0 by the author.