Vector Databases
Semantic search at scale: Store high-dimensional embeddings, find similar documents in milliseconds.
Key Properties
| Property | Requirement | Impact |
|---|---|---|
| Dimensionality | 256–4096 (typically 1536 for OpenAI) | Higher = more expressive, more memory |
| Query Latency | <50–100ms per query | Affects user experience |
| Throughput | 1000–100K queries/sec | Depends on indexing strategy |
| Recall | >90% | Approximate vs exact nearest neighbors |
| Memory Efficiency | Quantization reduces by 4–10x | Trade-off: compression vs accuracy |
Core Operations
Insert: Store new embeddings with metadata
1
2
3
vector_db.upsert(vectors=[
{"id": "doc1", "embedding": [0.1, 0.2, ..., 0.5], "metadata": {"source": "wiki"}},
])
Search: Find k-nearest neighbors by similarity
1
2
results = vector_db.query(query_embedding=[0.15, 0.25, ..., 0.48], k=5)
# Returns: top-5 similar vectors with distances
Delete/Update: Remove or modify vectors by ID
Indexing Strategies
| Index Type | Time (Build) | Time (Query) | Memory | Best For |
|---|---|---|---|---|
| Flat (brute-force) | O(1) | O(n×d) | Low | <1M vectors |
| HNSW (Hierarchical Navigable Small World) | O(n log n) | O(log n) | Medium | General-purpose, fast |
| IVF (Inverted File) | O(n log n) | O(n/nlist) | Low | Large scale, commodity hardware |
| Quantized (product/binary) | O(n) | O(n/nlist) | Very Low | 100M+ vectors, memory constrained |
HNSW Details:
- Hierarchical layers organize vectors
- Navigable small world: proximity + randomness
- Recall: >98% at scale
- Pinecone, Weaviate, Qdrant use HNSW
Popular Systems
| System | Recall | Speed | Scale | Cost | Best For |
|---|---|---|---|---|---|
| Pinecone | >95% | <50ms | Billions | $10–1000/mo | Managed, serverless |
| Weaviate | >90% | <100ms | 1B+ | Open-source (free) | Self-hosted flexibility |
| Milvus | >95% | <50ms | 1B+ | Open-source (free) | Distributed, cloud-native |
| Qdrant | >98% | <50ms | Billions | Free/paid | Production-grade, performant |
| Chroma | >90% | <100ms | 100M | Open-source (free) | Lightweight, embeddings-focused |
Production Deployment
Typical Setup:
1
2
3
4
5
6
7
Application → Vector DB (8 replicas, sharded across 4 nodes)
↓
Pinecone / Weaviate / Milvus (1B vectors, 1536 dims each)
↓
Latency: P50=30ms, P99=100ms
Throughput: 10K queries/sec
Memory: 6TB (1B × 1536 × 4 bytes)
Optimization:
- Batch queries: 100 queries in one request instead of 100 separate
- Async: Non-blocking I/O for web servers
- Caching: Cache frequent queries (redundant search queries)
- Quantization: Reduce dimensions from 1536 → 512 with minimal accuracy loss
When to Use / Avoid Vector Databases
Use Vector Databases when:
- ✅ Need semantic similarity search (not keyword matching)
- ✅ Scaling to 100M+ vectors (HNSW outperforms flat search)
- ✅ Querying in <100ms required (ANN essential, exact search too slow)
- ✅ Building RAG systems (embeddings + retrieval core functionality)
- ✅ Enabling generative AI features (search + LLM context integration)
Avoid Vector Databases when:
- ❌ <1M vectors and latency not critical (flat/brute-force cheaper)
- ❌ Need exact nearest neighbors for critical tasks (approximation error unacceptable)
- ❌ Vectors change extremely frequently (reindexing overhead)
- ❌ All queries identical (traditional caching better)
How Real Systems Use This
Pinecone (Notion AI Integration): Notion uses Pinecone to power their AI search across user workspaces. When Notion receives a user query to search notes, they: (1) embed the query using OpenAI text-embedding-3-small (384 dimensions, 10ms), (2) query Pinecone with k=10 candidates, (3) Pinecone returns top-10 matches in <30ms using HNSW indexing, (4) Notion returns matching notes to user. Pinecone stores ~500M vectors for Notion (across all users). Per-user namespaces isolate data. Metrics: P99 query latency = 45ms, precision@10 > 95%. Cost: $0.08/100K vectors/month managed service. Why Notion chose Pinecone: Managed serverless (no infrastructure), multi-tenancy built-in, 99.95% SLA, metadata filtering for per-workspace isolation, built-in replication for HA.
Weaviate (Stack Overflow Q&A Search): Stack Overflow integrated Weaviate to improve question recommendation. When a user asks a new question, Weaviate: (1) vectorizes the question using their fine-tuned encoder model, (2) searches 22M Stack Overflow questions in their self-hosted Weaviate cluster, (3) returns top-20 similar questions in <100ms, (4) displays as “Similar questions” sidebar. Weaviate cluster spans 5 nodes with 22M vectors × 768 dimensions. Metrics: Recall@100 = 98% (catches ~all relevant duplicates), P99 latency = 150ms. Cost: Self-hosted, negligible compute (uses existing servers). Why Stack Overflow chose Weaviate: Open-source (no vendor lock-in), GraphQL API enables flexible queries, hybrid search (semantic + keyword BM25), module system for custom vectorizers, cost-effective at scale.
Qdrant (Stripe RAG System): Stripe uses Qdrant for their internal documentation RAG. 1000+ API documentation pages are vectorized (OpenAI text-embedding-3-large, 1536 dims) and stored in Qdrant. When a developer asks “How do I handle webhook retries?”: (1) query is embedded (15ms), (2) Qdrant returns top-5 similar docs in <30ms using HNSW, (3) optional reranking with cross-encoder (40ms), (4) docs inserted into LLM context. Self-hosted Qdrant cluster: 2 nodes, 1M vectors, ~5GB memory. Metrics: Answer accuracy = 92%, retrieval precision@5 = 94%. Why Qdrant: Extremely fast HNSW (payload-aware indexing), high precision, scales to billions of vectors, built-in filtering (metadata), REST + gRPC APIs.
Milvus (E-commerce Product Search): An e-commerce platform uses open-source Milvus to search 50M product embeddings. When a customer searches “blue running shoes”, the system: (1) embeds query using product encoder, (2) queries Milvus cluster (8 data nodes + 2 query nodes), (3) returns top-50 products ranked by similarity in <50ms, (4) results re-ranked by price/rating/inventory. Milvus cluster: 50M vectors × 768 dimensions = ~150GB memory, distributed across 8 machines. Metrics: Recall@50 = 96%, P99 latency = 80ms, throughput = 10K queries/sec. Cost: Self-hosted, ~$5K/month infrastructure (commodity servers). Why chose Milvus: Distributed-first design (scales horizontally), cloud-native (Kubernetes ready), IVF compression reduces memory 4x vs flat, HNSW for extreme precision trade-off.
Chroma (Local LLM Development): Developers using local/open-source LLMs (LLaMA, Mistral) often use Chroma for lightweight embedding storage. Chroma runs in-process (Python library) or as Docker container. For 100K embeddings: (1) embeds documents with open-source model (MiniLM, 384 dims), (2) stores in Chroma SQLite backend (~50MB), (3) queries return top-k in <50ms. Typical usage: local chatbot on laptop querying personal documents. Metrics: Recall@10 = 90% (sufficient for prototyping), P99 latency = 100ms (variable, local hardware dependent). Cost: Free, open-source. Why developers choose Chroma: Zero infrastructure, instant setup, SQL filtering, client-focused APIs, Python-native.
Advanced Indexing Patterns
IVF-PQ (Inverted File + Product Quantization):
- Divide vectors into clusters (IVF)
- Within each cluster, quantize vectors to 8-bit (PQ)
- Query searches only 1-2 clusters (90% reduction in search space)
- Trade-off: 2-5% accuracy loss for 10-50x speedup
- Popular in: FAISS, Elasticsearch, OpenSearch
HNSW with Pruning:
- Store only top neighbors per layer
- Skip intermediate connections
- Reduces index memory 20-30%
- Minimal latency impact (<5ms)
Approximate Nearest Neighbor (ANN) Scaling:
- 1M vectors × 1536 dims = 6GB flat memory
- With HNSW: add 20% for graph structure = 7.2GB
- With IVF-PQ: compress to 2-3GB
- Choice depends on latency vs memory trade-off
Production Deployment Patterns
Single-region High Availability:
1
2
3
4
5
6
7
8
9
10
Load Balancer
├── Qdrant replica 1 (vector index + data)
├── Qdrant replica 2 (vector index + data)
└── Qdrant replica 3 (vector index + data)
Configuration:
- All 3 replicas hot (no failover latency)
- Consensus protocol for consistency
- Write replication factor: 3
- Read from any replica (load balanced)
Multi-region Disaster Recovery:
1
2
3
4
5
6
7
Primary Region (Pinecone US)
└── 500M vectors, read/write
Secondary Region (Pinecone EU)
└── Replica of 500M vectors, read-only
Async replication from primary → secondary (5-30 sec lag)
Caching Layer (for hot queries):
1
2
3
4
5
6
7
8
Redis In-Memory Cache
├── Popular queries: "return top 10 for query X"
└── Hit rate: 60-70% for well-distributed access
Query Pattern:
1. Check Redis cache (1ms hit)
2. If miss, query vector DB (30-100ms)
3. Update cache (TTL: 5 min)
Embedding Model Comparison
| Model | Dimensions | Speed | Quality | Cost |
|---|---|---|---|---|
| text-embedding-3-large (OpenAI) | 3072 | Slow (100ms) | Excellent (MTEB 64.6) | $0.13/1M tokens |
| text-embedding-3-small (OpenAI) | 1536 | Medium (50ms) | Very Good (MTEB 62.2) | $0.02/1M tokens |
| MiniLM-L6 (open-source) | 384 | Fast (5ms local) | Good (MTEB 56) | Free |
| bge-large-en (open-source) | 1024 | Fast (10ms) | Excellent (MTEB 63.6) | Free |
| voyage-large-2 (Voyage AI) | 1024 | Medium (60ms) | Excellent (MTEB 63.9) | $0.10/1M tokens |
Quantization Impact on Recall
| Quantization | Recall@10 | Memory Reduction | Speed Improvement |
|---|---|---|---|
| Float32 (baseline) | 100% | 1x | 1x |
| Float16 | 99.8% | 2x | 1.2x |
| Int8 | 98.5% | 4x | 1.8x |
| Int4 | 95% | 8x | 2.5x |
Approximate Nearest Neighbor (ANN) Trade-offs
Exact (Brute-Force) Search:
- Algorithm: Compare query to all vectors, return k smallest distances
- Complexity: O(n) queries, O(1) build
- Pros: 100% recall, no approximation
- Cons: Slow for large n (n>1M)
- When to use: <1M vectors, accuracy critical
HNSW (Hierarchical Navigable Small World):
- Algorithm: Multi-layer graph, navigate via small-world paths
- Complexity: O(log n) queries, O(n log n) build
- Recall: 95-99% (depends on ef parameter)
- When to use: General-purpose, 1M-1B vectors
- Trade-off: Memory overhead (20-30% vs flat), but enables sub-millisecond queries
IVF-PQ (Inverted File + Product Quantization):
- Algorithm: Cluster vectors (IVF) + compress within cluster (PQ)
- Complexity: O(n / n_clusters) queries
- Recall: 85-95%
- When to use: Extreme scale (100M+ vectors), memory constrained
- Trade-off: Lower recall than HNSW, but massive memory savings
Vector Database Comparison (Detailed)
| Factor | Pinecone | Weaviate | Qdrant | Milvus | Chroma |
|---|---|---|---|---|---|
| Type | Managed | Self-hosted | Self-hosted | Self-hosted | In-memory |
| Scaling | Serverless | Kubernetes | Kubernetes | Kubernetes | Single machine |
| Consistency | Eventually consistent | Strong | Strong | Strong | Strong |
| Filtering | Yes (metadata) | Yes | Yes | Yes | Yes |
| Price (100K vectors) | ~$1-5/mo | Free | Free | Free | Free |
| Ops Burden | None | Medium (K8s) | Medium | High (distributed) | None |
| Best For | SaaS app, serverless | ML apps, on-prem | Production, latency-critical | BigData, Hadoop ecosystem | Local dev, prototyping |
Common Vector DB Pitfalls
Problem 1: Curse of Dimensionality (High-Dimensional Vectors)
- Vector dimensionality: 1536 (OpenAI), 768 (BERT)
- Issue: Distance becomes less meaningful in high dimensions
- Symptom: Top-k queries return similar distances (hard to distinguish)
- Solution: Dimensionality reduction (PCA), use cosine distance (vs L2), quantization
Problem 2: Embedding Model Mismatch
- Indexing with model A (text-embedding-3-small), querying with model B (BERT)
- Result: Query embedding not comparable to indexed embeddings
- Solution: Always use same embedding model for indexing + querying
Problem 3: Memory Explosion at Scale
- 10B vectors × 1536 dims × 4 bytes = 61.4TB uncompressed
- Common mistake: Load all vectors into RAM
- Solution: Use quantization (reduce to Int8: 15.3TB), or use disk-backed index
Problem 4: No Filtering = Irrelevant Results
- Issue: Dense retrieval finds semantically similar vectors, but may be wrong type
- Example: Searching “how to reset password” returns docs about password policies (semantically similar, but not what user wants)
- Solution: Add metadata filters (doc_type == “how-to”), use hybrid search (dense + sparse)
Problem 5: N+1 Queries During Indexing
- Indexing 1B vectors naively: 1B individual inserts = network overhead
- Symptom: Indexing takes weeks
- Solution: Batch inserts (1000 vectors per request), use bulk ingest APIs
Performance Tuning
Query Latency Optimization:
1
2
3
4
5
6
7
8
9
10
11
12
1. Reduce ef (search budget in HNSW)
- Default: ef=100 (search top-100 candidates)
- Reduce to ef=20 for faster query (10ms vs 50ms)
- Recall drops from 99% to 92% (trade-off)
2. Use approximate distance calculations
- Instead of exact cosine, use dot product (faster)
- Difference: negligible for ranking
3. Implement query caching
- Cache results of frequent queries (5-min TTL)
- Hit rate: 60-70% for well-distributed workloads
Indexing Speed Optimization:
1
2
3
4
5
6
7
8
9
10
11
1. Batch upserts (bulk insert 1000+ vectors)
- Single insert: 100 vectors/sec
- Batch insert: 10K-100K vectors/sec
2. Disable reindexing during bulk load
- Immediately re-index after bulk load
- Faster than incremental reindexing
3. Pre-sort vectors by cluster
- Helps HNSW build more efficient graph
- 20-30% faster indexing
References
📄 HNSW Paper (Malkov & Yashunin, 2018) 📄 Product Quantization (Jégou et al., 2011) 📄 Billion-scale Similarity Search with GPUs (Johnson et al., 2017) — FAISS introduction 📄 MTEB: Massive Text Embedding Benchmark (Muennighoff et al., 2023) 🔗 Pinecone Documentation 🔗 Weaviate Documentation 🔗 Qdrant Documentation 🔗 Milvus Documentation 🎥 Vector Search Explained (ByteByteGo)