Post

Vector Databases

Semantic search at scale: Store high-dimensional embeddings, find similar documents in milliseconds.

Vector Databases

Key Properties

Property Requirement Impact
Dimensionality 256–4096 (typically 1536 for OpenAI) Higher = more expressive, more memory
Query Latency <50–100ms per query Affects user experience
Throughput 1000–100K queries/sec Depends on indexing strategy
Recall >90% Approximate vs exact nearest neighbors
Memory Efficiency Quantization reduces by 4–10x Trade-off: compression vs accuracy

Core Operations

Insert: Store new embeddings with metadata

1
2
3
vector_db.upsert(vectors=[
    {"id": "doc1", "embedding": [0.1, 0.2, ..., 0.5], "metadata": {"source": "wiki"}},
])

Search: Find k-nearest neighbors by similarity

1
2
results = vector_db.query(query_embedding=[0.15, 0.25, ..., 0.48], k=5)
# Returns: top-5 similar vectors with distances

Delete/Update: Remove or modify vectors by ID


Indexing Strategies

Index Type Time (Build) Time (Query) Memory Best For
Flat (brute-force) O(1) O(n×d) Low <1M vectors
HNSW (Hierarchical Navigable Small World) O(n log n) O(log n) Medium General-purpose, fast
IVF (Inverted File) O(n log n) O(n/nlist) Low Large scale, commodity hardware
Quantized (product/binary) O(n) O(n/nlist) Very Low 100M+ vectors, memory constrained

HNSW Details:

  • Hierarchical layers organize vectors
  • Navigable small world: proximity + randomness
  • Recall: >98% at scale
  • Pinecone, Weaviate, Qdrant use HNSW

System Recall Speed Scale Cost Best For
Pinecone >95% <50ms Billions $10–1000/mo Managed, serverless
Weaviate >90% <100ms 1B+ Open-source (free) Self-hosted flexibility
Milvus >95% <50ms 1B+ Open-source (free) Distributed, cloud-native
Qdrant >98% <50ms Billions Free/paid Production-grade, performant
Chroma >90% <100ms 100M Open-source (free) Lightweight, embeddings-focused

Production Deployment

Typical Setup:

1
2
3
4
5
6
7
Application → Vector DB (8 replicas, sharded across 4 nodes)
           ↓
        Pinecone / Weaviate / Milvus (1B vectors, 1536 dims each)
           ↓
       Latency: P50=30ms, P99=100ms
       Throughput: 10K queries/sec
       Memory: 6TB (1B × 1536 × 4 bytes)

Optimization:

  • Batch queries: 100 queries in one request instead of 100 separate
  • Async: Non-blocking I/O for web servers
  • Caching: Cache frequent queries (redundant search queries)
  • Quantization: Reduce dimensions from 1536 → 512 with minimal accuracy loss

When to Use / Avoid Vector Databases

Use Vector Databases when:

  • ✅ Need semantic similarity search (not keyword matching)
  • ✅ Scaling to 100M+ vectors (HNSW outperforms flat search)
  • ✅ Querying in <100ms required (ANN essential, exact search too slow)
  • ✅ Building RAG systems (embeddings + retrieval core functionality)
  • ✅ Enabling generative AI features (search + LLM context integration)

Avoid Vector Databases when:

  • ❌ <1M vectors and latency not critical (flat/brute-force cheaper)
  • ❌ Need exact nearest neighbors for critical tasks (approximation error unacceptable)
  • ❌ Vectors change extremely frequently (reindexing overhead)
  • ❌ All queries identical (traditional caching better)

How Real Systems Use This

Pinecone (Notion AI Integration): Notion uses Pinecone to power their AI search across user workspaces. When Notion receives a user query to search notes, they: (1) embed the query using OpenAI text-embedding-3-small (384 dimensions, 10ms), (2) query Pinecone with k=10 candidates, (3) Pinecone returns top-10 matches in <30ms using HNSW indexing, (4) Notion returns matching notes to user. Pinecone stores ~500M vectors for Notion (across all users). Per-user namespaces isolate data. Metrics: P99 query latency = 45ms, precision@10 > 95%. Cost: $0.08/100K vectors/month managed service. Why Notion chose Pinecone: Managed serverless (no infrastructure), multi-tenancy built-in, 99.95% SLA, metadata filtering for per-workspace isolation, built-in replication for HA.

Weaviate (Stack Overflow Q&A Search): Stack Overflow integrated Weaviate to improve question recommendation. When a user asks a new question, Weaviate: (1) vectorizes the question using their fine-tuned encoder model, (2) searches 22M Stack Overflow questions in their self-hosted Weaviate cluster, (3) returns top-20 similar questions in <100ms, (4) displays as “Similar questions” sidebar. Weaviate cluster spans 5 nodes with 22M vectors × 768 dimensions. Metrics: Recall@100 = 98% (catches ~all relevant duplicates), P99 latency = 150ms. Cost: Self-hosted, negligible compute (uses existing servers). Why Stack Overflow chose Weaviate: Open-source (no vendor lock-in), GraphQL API enables flexible queries, hybrid search (semantic + keyword BM25), module system for custom vectorizers, cost-effective at scale.

Qdrant (Stripe RAG System): Stripe uses Qdrant for their internal documentation RAG. 1000+ API documentation pages are vectorized (OpenAI text-embedding-3-large, 1536 dims) and stored in Qdrant. When a developer asks “How do I handle webhook retries?”: (1) query is embedded (15ms), (2) Qdrant returns top-5 similar docs in <30ms using HNSW, (3) optional reranking with cross-encoder (40ms), (4) docs inserted into LLM context. Self-hosted Qdrant cluster: 2 nodes, 1M vectors, ~5GB memory. Metrics: Answer accuracy = 92%, retrieval precision@5 = 94%. Why Qdrant: Extremely fast HNSW (payload-aware indexing), high precision, scales to billions of vectors, built-in filtering (metadata), REST + gRPC APIs.

Milvus (E-commerce Product Search): An e-commerce platform uses open-source Milvus to search 50M product embeddings. When a customer searches “blue running shoes”, the system: (1) embeds query using product encoder, (2) queries Milvus cluster (8 data nodes + 2 query nodes), (3) returns top-50 products ranked by similarity in <50ms, (4) results re-ranked by price/rating/inventory. Milvus cluster: 50M vectors × 768 dimensions = ~150GB memory, distributed across 8 machines. Metrics: Recall@50 = 96%, P99 latency = 80ms, throughput = 10K queries/sec. Cost: Self-hosted, ~$5K/month infrastructure (commodity servers). Why chose Milvus: Distributed-first design (scales horizontally), cloud-native (Kubernetes ready), IVF compression reduces memory 4x vs flat, HNSW for extreme precision trade-off.

Chroma (Local LLM Development): Developers using local/open-source LLMs (LLaMA, Mistral) often use Chroma for lightweight embedding storage. Chroma runs in-process (Python library) or as Docker container. For 100K embeddings: (1) embeds documents with open-source model (MiniLM, 384 dims), (2) stores in Chroma SQLite backend (~50MB), (3) queries return top-k in <50ms. Typical usage: local chatbot on laptop querying personal documents. Metrics: Recall@10 = 90% (sufficient for prototyping), P99 latency = 100ms (variable, local hardware dependent). Cost: Free, open-source. Why developers choose Chroma: Zero infrastructure, instant setup, SQL filtering, client-focused APIs, Python-native.


Advanced Indexing Patterns

IVF-PQ (Inverted File + Product Quantization):

  • Divide vectors into clusters (IVF)
  • Within each cluster, quantize vectors to 8-bit (PQ)
  • Query searches only 1-2 clusters (90% reduction in search space)
  • Trade-off: 2-5% accuracy loss for 10-50x speedup
  • Popular in: FAISS, Elasticsearch, OpenSearch

HNSW with Pruning:

  • Store only top neighbors per layer
  • Skip intermediate connections
  • Reduces index memory 20-30%
  • Minimal latency impact (<5ms)

Approximate Nearest Neighbor (ANN) Scaling:

  • 1M vectors × 1536 dims = 6GB flat memory
  • With HNSW: add 20% for graph structure = 7.2GB
  • With IVF-PQ: compress to 2-3GB
  • Choice depends on latency vs memory trade-off

Production Deployment Patterns

Single-region High Availability:

1
2
3
4
5
6
7
8
9
10
Load Balancer
  ├── Qdrant replica 1 (vector index + data)
  ├── Qdrant replica 2 (vector index + data)
  └── Qdrant replica 3 (vector index + data)

Configuration:
- All 3 replicas hot (no failover latency)
- Consensus protocol for consistency
- Write replication factor: 3
- Read from any replica (load balanced)

Multi-region Disaster Recovery:

1
2
3
4
5
6
7
Primary Region (Pinecone US)
  └── 500M vectors, read/write

Secondary Region (Pinecone EU)
  └── Replica of 500M vectors, read-only

Async replication from primary → secondary (5-30 sec lag)

Caching Layer (for hot queries):

1
2
3
4
5
6
7
8
Redis In-Memory Cache
  ├── Popular queries: "return top 10 for query X"
  └── Hit rate: 60-70% for well-distributed access

Query Pattern:
1. Check Redis cache (1ms hit)
2. If miss, query vector DB (30-100ms)
3. Update cache (TTL: 5 min)

Embedding Model Comparison

Model Dimensions Speed Quality Cost
text-embedding-3-large (OpenAI) 3072 Slow (100ms) Excellent (MTEB 64.6) $0.13/1M tokens
text-embedding-3-small (OpenAI) 1536 Medium (50ms) Very Good (MTEB 62.2) $0.02/1M tokens
MiniLM-L6 (open-source) 384 Fast (5ms local) Good (MTEB 56) Free
bge-large-en (open-source) 1024 Fast (10ms) Excellent (MTEB 63.6) Free
voyage-large-2 (Voyage AI) 1024 Medium (60ms) Excellent (MTEB 63.9) $0.10/1M tokens

Quantization Impact on Recall

Quantization Recall@10 Memory Reduction Speed Improvement
Float32 (baseline) 100% 1x 1x
Float16 99.8% 2x 1.2x
Int8 98.5% 4x 1.8x
Int4 95% 8x 2.5x

Approximate Nearest Neighbor (ANN) Trade-offs

Exact (Brute-Force) Search:

  • Algorithm: Compare query to all vectors, return k smallest distances
  • Complexity: O(n) queries, O(1) build
  • Pros: 100% recall, no approximation
  • Cons: Slow for large n (n>1M)
  • When to use: <1M vectors, accuracy critical

HNSW (Hierarchical Navigable Small World):

  • Algorithm: Multi-layer graph, navigate via small-world paths
  • Complexity: O(log n) queries, O(n log n) build
  • Recall: 95-99% (depends on ef parameter)
  • When to use: General-purpose, 1M-1B vectors
  • Trade-off: Memory overhead (20-30% vs flat), but enables sub-millisecond queries

IVF-PQ (Inverted File + Product Quantization):

  • Algorithm: Cluster vectors (IVF) + compress within cluster (PQ)
  • Complexity: O(n / n_clusters) queries
  • Recall: 85-95%
  • When to use: Extreme scale (100M+ vectors), memory constrained
  • Trade-off: Lower recall than HNSW, but massive memory savings

Vector Database Comparison (Detailed)

Factor Pinecone Weaviate Qdrant Milvus Chroma
Type Managed Self-hosted Self-hosted Self-hosted In-memory
Scaling Serverless Kubernetes Kubernetes Kubernetes Single machine
Consistency Eventually consistent Strong Strong Strong Strong
Filtering Yes (metadata) Yes Yes Yes Yes
Price (100K vectors) ~$1-5/mo Free Free Free Free
Ops Burden None Medium (K8s) Medium High (distributed) None
Best For SaaS app, serverless ML apps, on-prem Production, latency-critical BigData, Hadoop ecosystem Local dev, prototyping

Common Vector DB Pitfalls

Problem 1: Curse of Dimensionality (High-Dimensional Vectors)

  • Vector dimensionality: 1536 (OpenAI), 768 (BERT)
  • Issue: Distance becomes less meaningful in high dimensions
  • Symptom: Top-k queries return similar distances (hard to distinguish)
  • Solution: Dimensionality reduction (PCA), use cosine distance (vs L2), quantization

Problem 2: Embedding Model Mismatch

  • Indexing with model A (text-embedding-3-small), querying with model B (BERT)
  • Result: Query embedding not comparable to indexed embeddings
  • Solution: Always use same embedding model for indexing + querying

Problem 3: Memory Explosion at Scale

  • 10B vectors × 1536 dims × 4 bytes = 61.4TB uncompressed
  • Common mistake: Load all vectors into RAM
  • Solution: Use quantization (reduce to Int8: 15.3TB), or use disk-backed index

Problem 4: No Filtering = Irrelevant Results

  • Issue: Dense retrieval finds semantically similar vectors, but may be wrong type
  • Example: Searching “how to reset password” returns docs about password policies (semantically similar, but not what user wants)
  • Solution: Add metadata filters (doc_type == “how-to”), use hybrid search (dense + sparse)

Problem 5: N+1 Queries During Indexing

  • Indexing 1B vectors naively: 1B individual inserts = network overhead
  • Symptom: Indexing takes weeks
  • Solution: Batch inserts (1000 vectors per request), use bulk ingest APIs

Performance Tuning

Query Latency Optimization:

1
2
3
4
5
6
7
8
9
10
11
12
1. Reduce ef (search budget in HNSW)
   - Default: ef=100 (search top-100 candidates)
   - Reduce to ef=20 for faster query (10ms vs 50ms)
   - Recall drops from 99% to 92% (trade-off)

2. Use approximate distance calculations
   - Instead of exact cosine, use dot product (faster)
   - Difference: negligible for ranking

3. Implement query caching
   - Cache results of frequent queries (5-min TTL)
   - Hit rate: 60-70% for well-distributed workloads

Indexing Speed Optimization:

1
2
3
4
5
6
7
8
9
10
11
1. Batch upserts (bulk insert 1000+ vectors)
   - Single insert: 100 vectors/sec
   - Batch insert: 10K-100K vectors/sec

2. Disable reindexing during bulk load
   - Immediately re-index after bulk load
   - Faster than incremental reindexing

3. Pre-sort vectors by cluster
   - Helps HNSW build more efficient graph
   - 20-30% faster indexing

References

📄 HNSW Paper (Malkov & Yashunin, 2018) 📄 Product Quantization (Jégou et al., 2011) 📄 Billion-scale Similarity Search with GPUs (Johnson et al., 2017) — FAISS introduction 📄 MTEB: Massive Text Embedding Benchmark (Muennighoff et al., 2023) 🔗 Pinecone Documentation 🔗 Weaviate Documentation 🔗 Qdrant Documentation 🔗 Milvus Documentation 🎥 Vector Search Explained (ByteByteGo)

This post is licensed under CC BY 4.0 by the author.