| Imrul Sheikh

Posted Apr 25, 2026

By Imrul Sheikh

15 min read

The Core Decision Framework

The “right” database depends on three dimensions:

Data shape — structured (schema-enforced) vs flexible (schema-less)
Access pattern — how data is read/written (random vs sequential, point vs range, graph traversal)
Scale — throughput (requests/sec), latency, and data volume

Data Shape

Shape	Definition	Examples
Structured	Fixed schema, enforced at write	Financial transactions, user accounts, orders
Semi-structured	JSON documents, flexible fields	User profiles, logs, events, configs
Unstructured	Binary files, no schema	Images, videos, PDFs, text documents
Graph	Nodes + edges, relationships matter	Social networks, fraud rings, recommendation graphs
Time-series	Timestamp-ordered, high volume	Metrics, events, IoT sensor readings

Access Pattern Analysis

Point Lookups: “Get user with ID=123”

SQL: SELECT * FROM users WHERE id = 123
NoSQL: db.get("user:123")
Optimal: Hash index (O(1)) or B+ tree (O(log n))

Range Queries: “Get all users aged 25–35”

SQL: SELECT * FROM users WHERE age BETWEEN 25 AND 35
NoSQL: Not native; often require application-side filtering
Optimal: B+ tree (O(log n) + range size)

Aggregations: “Sum of all purchases by country”

SQL: SELECT country, SUM(amount) FROM purchases GROUP BY country
NoSQL: Usually requires full table scan + application aggregation
Optimal: Columnar DB (OLAP) for billion-row tables

Traversals: “Find all friends of friends of friends”

SQL: 3+ JOINs, exponential complexity
Graph DB: Linear in number of edges
Optimal: Graph DB (O(edges)) vs SQL (O(n³))

Text Search: “Find all articles containing ‘machine learning’”

SQL: SELECT * FROM articles WHERE content LIKE '%machine%' (slow full table scan)
Search Engine: Inverted index, instant
Optimal: Elasticsearch (O(k) where k = matching docs)

Similarity Search: “Find similar documents to query embedding”

SQL: Compute distance to all N documents, sort (O(N log N))
Vector DB: HNSW indexing, k-nearest neighbors (O(log N))
Optimal: Vector DB for embeddings

Database Categories & Trade-offs

1. Relational (SQL)

Best for: Structured data + complex queries + transactions

Property	Strength
ACID transactions	✅ Full ACID (Atomicity, Consistency, Isolation, Durability)
Schema	Enforced at write (prevents garbage data)
Joins	Native, optimized by query planner
Scaling	Vertical (add RAM/CPU) + read replicas (read-only copies); sharding requires consistent hashing
Consistency	Strong (read your own writes)
Query flexibility	Arbitrary queries across tables

Examples:

PostgreSQL — Feature-rich, JSON support, full-text search extension
MySQL — Web workloads, stable, wide tooling (WordPress, Laravel)
MariaDB — MySQL fork, open-source, drop-in replacement
Aurora (AWS) — Managed PostgreSQL/MySQL with auto-scaling read replicas

When to use:

Complex business logic (invoicing, reservations, financial transfers)
Multiple tables with cross-table queries
Data integrity is critical (foreign keys, constraints)

When NOT to use:

Need horizontal scaling (sharding with consistent hashing is complex)
Write-heavy workload (> 50% writes) — use LSM-based DB
Unstructured data (JSON documents are fine; but binary blobs are expensive)

2. Document (NoSQL)

Best for: Flexible schema + hierarchical data + horizontal scaling

Property	Benefit
Schema-less	No schema enforcement (add fields freely)
JSON storage	Native nested objects (no multi-table JOINs needed)
Horizontal scaling	Sharding built-in (data partitioned by ID)
Eventual consistency	Faster writes (trade immediate consistency)

Examples:

MongoDB — Document-oriented, aggregation pipeline, wide adoption
Firestore (Google Cloud) — Managed Firebase DB, real-time sync
CouchDB — Multi-master replication, offline-first
RethinkDB — Real-time feeds, changefeeds

When to use:

Flexible schema (fields vary per document)
Hierarchical data (user + profile + preferences in one doc)
Horizontal scaling needed (millions of users)

When NOT to use:

Complex multi-document transactions needed (use relational DB)
Heavy aggregations across entire collection (use OLAP/data warehouse)

3. Key-Value (In-Memory / Cache)

Best for: Simple lookups + extreme speed + caching

Property	Benefit
O(1) lookup	Microsecond latency
In-memory	RAM speed, not disk
Horizontal scaling	Consistent hashing, sharding

Examples:

Redis — In-memory data structures, caching, sessions, leaderboards
Memcached — Simple key-value cache
DynamoDB (AWS) — Managed key-value, auto-scaling

When to use:

Caching hot data (user sessions, frequently accessed records)
Counters and real-time aggregations
Rate limiting, leaderboards, distributed locks

When NOT to use:

Primary data store for critical data (use with replication)
Complex queries needed (use relational DB)
Data larger than available RAM

4. Wide-Column (NoSQL)

Best for: Write-heavy time-series + massive scale

Property	Benefit
Column families	Group related data (e.g., user metrics, user metadata)
Sorted by row key	Range queries on row key (e.g., `timestamp` range)
Horizontal scaling	Built-in sharding, consistent hashing

Examples:

Cassandra — Facebook-origin, 1M+ writes/sec per cluster, eventual consistency
HBase — Hadoop ecosystem, serious operations overhead
Bigtable (Google) — Proprietary, powers Google Analytics, Search

When to use:

Time-series data (metrics, logs, events)
Write-heavy workload (> 100K writes/sec)
Can tolerate eventual consistency

When NOT to use:

Need strong consistency (use relational DB)
Complex queries (use relational or OLAP)

5. Search Engine

Best for: Full-text search + relevance ranking

Property	Benefit
Inverted index	Instant full-text search
Relevance scoring	TF-IDF, BM25 scoring
Faceted search	Filter by category, tag, etc.

Examples:

Elasticsearch — Distributed Lucene, REST API, rich aggregations
OpenSearch — AWS fork of Elasticsearch
Solr — Mature Lucene wrapper
Meilisearch — Developer-friendly, typo-tolerant search

When to use:

Full-text search (users type keywords)
Relevance ranking needed
Fast filtering + sorting (e.g., e-commerce search)

When NOT to use:

Structured queries only (use relational DB)
Exact-match lookups only (use key-value cache)

6. Graph

Best for: Relationship traversal + complex connections

Property	Benefit
Adjacency lists	Sub-millisecond traversal
Path queries	“Find all friends of friends”
Pattern matching	Cypher language for graph queries

Examples:

Neo4j — Leading graph DB, Cypher language, ACID transactions
Amazon Neptune — Managed graph, Gremlin/SPARQL APIs
TigerGraph — Large-scale graph analytics

When to use:

Social networks (friend connections, recommendations)
Fraud detection (find connected accounts)
Knowledge graphs (entity relationships)

When NOT to use:

No relationship queries (use relational DB)
Graph too large for memory (distributed graphs have latency overhead)

7. Time-Series

Best for: Metrics + IoT + append-only sequential data

Property	Benefit
Timestamp ordering	Data is naturally sorted
Compression	Delta encoding, dictionary → 40–50× compression
Auto-TTL	Automatic expiration of old data
High ingest	1M+ metrics/sec per node

Examples:

InfluxDB — Purpose-built time-series, Flux query language
Prometheus — Metrics scraping, 2-hour local retention
TimescaleDB — PostgreSQL extension, hypertable partitioning
ClickHouse — Columnar, 100M+ events/sec

When to use:

Metrics (CPU, memory, network)
Events (user actions, system logs)
High-volume sequential data

When NOT to use:

Complex multi-dimensional queries (use OLAP)
Primary relational data (use SQL)

8. Vector

Best for: Semantic search + embeddings + similarity

Property	Benefit
HNSW indexing	Fast nearest-neighbor search
Embedding storage	Store LLM embeddings natively
Semantic search	“Find documents similar to query”

Examples:

Pinecone — Fully managed vector DB
pgvector (PostgreSQL extension) — Vector support in Postgres
Weaviate — Open-source vector DB with GraphQL

When to use:

LLM-powered search (embeddings from OpenAI, etc.)
Semantic similarity (not keyword matching)
Approximate nearest neighbors (k-NN)

When NOT to use:

Structured queries only (use relational DB)
Data is low-dimensional (< 10 dimensions, use B+ tree)

9. Data Warehouse (OLAP)

Best for: Analytics + aggregations + historical data

Property	Benefit
Columnar storage	50–100× compression, column-only scans
Massive parallelism	Query billion-row tables in seconds
Complex aggregations	GROUP BY, WINDOW functions

Examples:

BigQuery (Google) — Fully managed, petabyte scale
Redshift (AWS) — PostgreSQL-based, columnar, massive scale
Snowflake — Cloud-native, sharing, automatic scaling
ClickHouse — Fast columnar, open-source

When to use:

Analytics on historical data (last 1–5 years)
Aggregations across millions/billions of rows
Ad-hoc reporting and exploration

When NOT to use:

Real-time transactional updates (use OLTP)
Complex multi-table transactions (use relational DB)

10. NewSQL

Best for: SQL + horizontal scaling + strong consistency

Property	Benefit
SQL interface	Familiar SQL syntax
Horizontal scaling	Partition data across nodes
Strong consistency	ACID across shards (distributed transactions)
High availability	Automatic failover, replicas

Examples:

Google Spanner — Globally distributed, ACID, nanosecond-precision time
CockroachDB — PostgreSQL-compatible, auto-sharding
TiDB — MySQL-compatible, distributed ACID

When to use:

Need both SQL + horizontal scaling
Global distribution (data centers across continents)
Strong consistency is non-negotiable

When NOT to use:

Single-datacenter (overhead not worth it)
Throughput < 10K req/sec (PostgreSQL replicas cheaper)

Interview Decision Template

When asked “what database would you choose?”, use this structure:

Step 1: Identify Data Shape

Question: “Is the data structured (fixed schema) or flexible (JSON)?

Answer: “The primary data is [structured/semi-structured/time-series/graph] because [reason].”

Example: “The primary data is structured because transactions must enforce referential integrity (orders must reference valid users, products, etc.), so I’d use a relational database.”

Step 2: Identify Access Patterns

Question: “What are the 3 most common queries?”

Answer: List top 3 access patterns + their complexity

Example:

Pattern 1: “Get order details by order ID” — point lookup, O(1) hash index
Pattern 2: “List orders placed between date X and Y, filtered by customer” — range + equality, O(log n) B+ tree
Pattern 3: “Sum order totals by region” — aggregation, O(n) full table scan

Step 3: Identify Scale

Question: “What’s your throughput and data volume?”

Answer: State throughput (req/sec), latency target, and data size

Example: “1M users, 100K purchases/day (1 req/sec average, 50 req/sec peak), <100ms latency requirement. Data: 100GB after 5 years (manageable).”

Step 4: Identify Consistency Requirements

Question: “Can you tolerate eventual consistency or do you need strong consistency?”

Answer: State consistency model + reasoning

Example: “We need strong consistency because users must see their updated profile immediately after editing. We cannot tolerate a 5-minute eventual consistency delay.”

Step 5: Name 2 Options & Trade-offs

Answer: Choose 2 databases, compare pros/cons

Example:

Option 1: PostgreSQL (relational)

✅ ACID transactions (prevent duplicate orders)
✅ JOIN queries (fetch order + user + product data in one query)
✅ Constraints (foreign keys ensure data integrity)
❌ Scaling limited to read replicas (sharding is manual, complex)
❌ For 10M orders, single primary becomes bottleneck

Option 2: MongoDB (document)

✅ Horizontal scaling (shard by customer_id, distribute 10M orders across nodes)
✅ Flexible schema (add fields per order without migration)
❌ No multi-document ACID (if updating order + inventory in one transaction, must handle rollback manually)
❌ JOINs are application-level (fetch order, then fetch user separately)

Final recommendation: “Given the consistency requirement, I’d choose PostgreSQL with read replicas for reads (checkout, order history) and writes going to primary. If order volume exceeds 10M and becomes a bottleneck, I’d shard by customer_id and run multiple PostgreSQL primaries, each owning a customer range.”

Worked Example: E-Commerce Platform

Requirements

1M users
10M orders (5 years history)
50 req/sec average, 500 req/sec peak
Read-heavy (80% reads: view orders, 20% writes: checkout)
Cross-user analytics: “top 100 products by revenue”

Analysis

Data shape: Structured (orders, products, users have fixed schemas)

Access patterns:

“Get my orders” — range + equality, O(log n)
“Get product details” — point lookup, O(1)
“Top 100 products by revenue” — aggregation, O(10M) full scan

Scale: 10M orders, 500 req/sec peak

Consistency: Strong (orders must be accurate)

Decision

OLTP (transactional): PostgreSQL

Primary handles 20 writes/sec (checkout)
Read replicas handle 400 reads/sec (view orders, product details)
Queries 1 & 2 use B+ tree indexes on (customer_id, date) and (product_id)

Analytics (query 3): Separate ClickHouse or Redshift cluster

Mirror orders table from PostgreSQL → ClickHouse nightly
“Top 100 products by revenue” query runs in 100ms on billion-row table
Reporting dashboard queries the warehouse, not production DB

Caching layer: Redis

Cache hot products (top 100) — 1-hour TTL
Cache user sessions
Handle rate limiting per user

Architecture summary:

Users
  ↓
 [Application]
  ├→ PostgreSQL (Orders, Products, Users)
  │  ├→ Primary (writes)
  │  └→ Read Replicas (reads)
  │
  ├→ Redis (cache, sessions)
  │
  └→ ClickHouse (analytics, offline)

Quick Reference: Common Scenarios

Scenario	Best DB	Why
User authentication	Redis (cache) + Postgres (authoritative)	Sessions in Redis (fast), users in Postgres (reliable)
Shopping cart	Redis (session) + Postgres (orders)	Cart is short-lived (session), orders are permanent
Real-time metrics	InfluxDB or TimescaleDB	Time-series optimized, auto-TTL, high compression
Recommendations	Graph (Neo4j) or Vector DB	Traversals (graph) or semantic (vector)
Product catalog	Elasticsearch + Postgres	Search (ES), updates (Postgres), denormalize catalog to ES
Analytics dashboard	BigQuery or Redshift	OLAP, columnar, petabyte scale
Logs & events	ClickHouse or ELK stack	High-volume sequential, aggregations
Chat messages	MongoDB or Firestore	Flexible schema, easy horizontal scaling
Financial transactions	Postgres (only)	ACID non-negotiable, no eventual consistency

How Real Systems Use This — Polyglot Persistence at Scale

Airbnb — MySQL + Cassandra + Elasticsearch + Redis

Airbnb’s database architecture is a textbook example of polyglot persistence, designed to handle 1M+ listings, billions of searches, and complex reservations.

The Problem: Airbnb must support fast reservation lookups (strong consistency), full-text search over listings (relevance ranking), time-series metrics (availability calendars), and high-throughput data pipeline ingestion.

The Solution:

MySQL (Amazon RDS): Core transactional database for reservations, bookings, user accounts, and listing metadata. Strong ACID guarantees ensure a confirmed booking can’t be double-booked. Primary-replica setup handles read replicas for scaling; writes go to primary.
Cassandra: Distributed storage for user search history, availability calendars (time-series data per listing), and inbox messages. Write-heavy workload (millions of calendar updates/day), so Cassandra’s LSM write efficiency is ideal. Partition by listing_id; availability data is naturally sorted by date.
Elasticsearch: Full-text search over listing descriptions, review text, and location data. Inverted index powers faceted search (“private rooms, pool, patio, near downtown”). Relevance scoring based on review count, booking frequency, and host rating.
Redis: Session storage, rate limiting, real-time availability cache (hot listings), and leaderboards (trending listings by booking count).

Why This Mix: MySQL guarantees booking accuracy (critical), Cassandra absorbs massive write volume (calendars, messages), Elasticsearch powers user experience (search), Redis handles throughput (caching, sessions). No single database could handle all three requirements equally well.

Concrete numbers: Airbnb ingests ~100K bookings/day (consistent load on MySQL), ~1M calendar updates/day (perfect for Cassandra), ~1B search queries/month (ES cluster with 100+ nodes), and 10M+ concurrent sessions (Redis with cluster replication).

Lessons:

Specialized tools beat generalists. Elasticsearch search is orders of magnitude faster than LIKE '%pattern%' in MySQL.
Consistency tiers matter. Bookings need strong consistency; calendar availability can be eventually consistent.
Hot path optimization. Redis caches hot listings; most users search the same popular destinations repeatedly.

Uber — MySQL → Cassandra → Schemaless KV Evolution

Uber’s database journey shows how architecture evolves as scale changes. They moved from MySQL (single database) to Cassandra (distributed writes) to a custom schemaless KV store (maximum flexibility and control).

Phase 1: MySQL (Early Stage) When Uber was smaller (2011-2013), a single MySQL instance held all data: rides, users, locations, payments. Queries were straightforward: “Get all active rides in a city,” “Fetch user by ID.” Replication handled reads. Sharding by city_id allowed basic horizontal scaling.

Problem: At 10K rides/day, MySQL primary became bottleneck. Write volume to rides table exceeded what a single primary could handle. Resharding was manual and painful.

Phase 2: Cassandra (Growth Phase) As Uber scaled (2013-2016), they migrated ride data to Cassandra. Cassandra’s distributed architecture and LSM writes were perfect for time-series data (ride history is append-mostly). Partition by rider_id + date; querying “All rides by user in Q3” scans one partition per node instead of full table scan.

Cassandra’s replication factor 3 gave them fault tolerance: losing one node didn’t impact availability. Write latency stayed low even at 100K rides/day (peak RPS > 1K writes/sec).

Problem: Cassandra’s fixed schema and strict column family structure forced schema migrations to be coordinated. Adding a new field to ride data required compaction and coordination across all nodes. Operational complexity grew with cluster size.

Phase 3: Schemaless KV (Maturity Phase) By 2016+, Uber built custom infrastructure: UberData (internal name), a key-value store built on top of RocksDB (LSM) with Cassandra-style distributed architecture but JSON schema-free. Data: { ride_id: "abc123", user_id: "user456", origin: {...}, destination: {...}, driver_id: "driver789", status: "completed", metadata: { ... } }.

Advantages:

No schema versioning. Add fields without coordination; old clients ignore new fields.
Lower operational overhead. Owned and tuned for Uber’s exact access patterns (point lookups by ride_id dominate).
Better compression. Schema-free JSON compresses better than fixed-column Cassandra format.
Faster writes. RocksDB’s native LSM is faster than Cassandra’s Thrift protocol overhead.

Trade-off: Uber lost Cassandra’s battle-tested operational tooling. They now manage the custom KV store’s replication, failover, and disaster recovery themselves. But at their scale, the gains (lower latency, fewer operational surprises, full control) outweighed the engineering cost.

Concrete numbers: Uber stores trillions of ride records (compressing to petabytes with their KV store). Write throughput is >100K ops/sec across the cluster. Read latency is <10ms p95 (critical for user app showing “Your driver is 5 min away”).

Lessons:

Evolution is natural. Start with proven databases (MySQL, Cassandra); optimize at scale.
Schema flexibility matters. Schemaless systems let product teams iterate faster (new ride metadata, new driver attributes) without database migrations.
Operational burden increases with customization. Building your own KV store is a last resort, justified only by extreme scale.

References

📄 Building Reliable Systems — Designing and Operating Systems for 10⁶ Events Per Day — System design trade-offs at scale
📖 PostgreSQL Documentation — Index Types — B+ tree, hash, GiST index choices
📖 MongoDB Database Design Best Practices — Sharding strategies, schema design for scaling
📖 Cassandra Data Model — Wide-column design patterns
🎥 ByteByteGo — Database Design — Visual comparisons of database trade-offs
📄 Designing Data-Intensive Applications — Martin Kleppmann (Book) — In-depth analysis of database internals and trade-offs
📖 AWS Database Decision Guide — Quick decision matrix for AWS services
📖 Google Cloud Databases — BigQuery, Firestore, Cloud SQL, Cloud Spanner comparison

This post is licensed under CC BY 4.0 by the author.

The Core Decision Framework

Data Shape

Access Pattern Analysis

Database Categories & Trade-offs

1. Relational (SQL)

2. Document (NoSQL)

3. Key-Value (In-Memory / Cache)

4. Wide-Column (NoSQL)

5. Search Engine

6. Graph

7. Time-Series

8. Vector

9. Data Warehouse (OLAP)

10. NewSQL

Interview Decision Template

Step 1: Identify Data Shape

Step 2: Identify Access Patterns

Step 3: Identify Scale

Step 4: Identify Consistency Requirements

Step 5: Name 2 Options & Trade-offs

Worked Example: E-Commerce Platform

Requirements

Analysis

Decision

Quick Reference: Common Scenarios

How Real Systems Use This — Polyglot Persistence at Scale

Airbnb — MySQL + Cassandra + Elasticsearch + Redis

Uber — MySQL → Cassandra → Schemaless KV Evolution

References

Trending Tags