Post

Time-Series & Specialized Databases

When standard relational or document databases don't fit -- specialized engines optimized for specific data shapes (time-ordered metrics, graph traversals, vector embeddings, columnar analytics). Each trades general-purpose flexibility for domain-specific performance gains (10-100x throughput or latency improvements).

Time-Series & Specialized Databases

Time-Series Databases

The Time-Series Problem

Time-series data differs fundamentally from transactional data:

  • Access pattern: Always sorted by timestamp (time-ordered writes)
  • Retention: Automatic expiration (metrics from 90 days ago often discarded)
  • Aggregation: Range queries (`SELECT avg(cpu) FROM metrics WHERE timestamp >= now - 1h`)
  • Cardinality: Millions of metric series (e.g., metrics from 10K servers, 100 metrics each = 1M series)

Why not PostgreSQL?

  • PostgreSQL enforces ACID + complex query logic, adding overhead for append-only workloads
  • No native TTL (time-to-live) for automatic data expiration
  • Random I/O during compaction interferes with real-time write latency
  • Compression is generic (10-15x); time-series compression can achieve 40-50x using delta encoding + dictionary

Time-Series Storage Model: InfluxDB TSM

Time-Structured Merge-tree (TSM) – InfluxDB’s storage engine. Similar to LSM but optimized for time-series:

  1. Write path: Incoming metrics go into in-memory shard. Once shard reaches 10MB, flush to immutable TSM file on disk.
  2. Compression: Delta encoding (store differences between consecutive timestamps, not absolute values). Example: `[1000, 1001, 1003, 1002]` stores as `[1000, +1, +2, -1]`, then variable-byte encodes to ~2 bytes per point vs 8.
  3. Compaction: Merges old TSM files, removing deleted points and compressing further. Can achieve 40-50x compression on raw metrics.

Concrete numbers (InfluxDB Cloud):

  • Compression ratio: 40-50x (1B raw metrics ~ 20-25MB compressed)
  • Write throughput: 1M metrics/sec per node
  • Query latency: 50-200ms for 24-hour range query across 1M series
  • Retention: Auto-expire old data (configurable, typically 7-90 days)

TimescaleDB – PostgreSQL Extension

What it is: Built-on-top PostgreSQL using “hypertables” – transparent partitioning by time. Writes still go to PostgreSQL but are partitioned automatically.

Hypertable concept: ``` – Create a hypertable partitioned by time CREATE TABLE metrics ( time TIMESTAMP, host TEXT, cpu FLOAT ) PARTITION BY (time ‘1 day’); – Auto-partition daily

– Transparently handles: – - Inserting into correct partition – - Automatic TTL drop (DROP PARTITION IF OLDER THAN 90 DAYS) – - Compression within each partition ```

Trade-offs:

  • ✅ SQL queries (PostgreSQL syntax)
  • ✅ Reuses PostgreSQL infrastructure (backups, replication)
  • ❌ Slower than purpose-built systems (InfluxDB, Prometheus)
  • ❌ Must manage partitions manually if not using auto-drop

Concrete numbers (TimescaleDB benchmarks):

  • Compression ratio: 10-15x (worse than InfluxDB due to PostgreSQL overhead)
  • Write throughput: 500K metrics/sec per node
  • Query latency: 200-500ms for same 24-hour query

Prometheus – Pull-Based Metrics

Architecture: Prometheus scrapes metrics FROM applications (pull model) vs applications pushing to metric store (push model).

Storage: Custom TSDB (time-series database) in Prometheus:

  • In-memory: Last 2 hours of metrics in RAM
  • Long-term: Exports to external storage (S3, Thanos, Cortex, VictoriaMetrics)
  • Blocks: 2-hour blocks written to disk; each block contains ~1M time-series x 7200 samples (2 hours at 1-second granularity)

Concrete numbers (Prometheus community data):

  • In-memory footprint: ~1KB per active time-series (metric x labels)
  • Scrape interval: Typically 15 seconds (4 samples/min per series)
  • Block size: ~1 MB per 2-hour block (heavily compressed)
  • Query latency: 10-100ms for PromQL queries (simpler than SQL)

Trade-off: Simpler than InfluxDB (no complex query language), but 2-hour local retention + manual long-term storage setup.

ClickHouse – Columnar Analytics

What it is: Columnar OLAP database optimized for analytics queries. Unlike row-oriented databases (PostgreSQL, MySQL) that store row by row, ClickHouse stores column by column.

Why columnar matters:

  • Query: “Find average CPU over 1 day” – only reads CPU column (not host, timestamp columns)
  • Row-oriented: Must load all columns for all rows, then filter
  • Columnar: Loads only CPU column -> 50-100x faster for this query

Real-world:

  • Cloudflare: Processes 100M+ HTTP request logs per second. ClickHouse ingests raw logs, aggregates into time-series, stores compressed in S3. Queries like “top 100 countries by request count” run in 100ms on 30-day history.
  • Compression: Dictionary encoding (repeated values compressed), delta encoding (for timestamps), LZ4 compression -> 5-10x total

Concrete numbers (Cloudflare analytics):

  • Ingest rate: 100M+ events/sec
  • Retention: 30 days hot (SSD), older -> S3
  • Query latency: 50-500ms for aggregations over billion-row tables
  • Compression: 5-10x

Graph Databases

Why Graphs > SQL Joins

SQL problem: ```sql – Find friends-of-friends-of-friends (3 levels deep) SELECT DISTINCT f3.user_id FROM users u1 JOIN friendships f1 ON u1.id = f1.user_id JOIN users f2 ON f1.friend_id = f2.id JOIN friendships f2 ON f2.user_id = f2.id JOIN users f3 ON f2.friend_id = f3.id WHERE u1.id = ‘alice’ AND f2.id != ‘alice’ AND f3.id != ‘alice’; ```

Result: 3 JOINs x index lookups x nested loops = O(n^3) in worst case. Real query takes seconds.

Neo4j Cypher: ```cypher MATCH (alice:User {name: “alice”})-[:FRIENDS*1..3]-(foaf) RETURN DISTINCT foaf ```

Result: Graph traversal in microseconds (O(k) where k = number of edges traversed). Real query takes milliseconds.

Graph Storage Model

Graphs store nodes and edges directly:

  • Node: ID, properties (name, email, etc.), outgoing edges
  • Edge: Source node ID, target node ID, edge type (FRIENDS_WITH, WORKS_AT), properties

Adjacency list representation (in-memory): ```python node_alice = { “id”: “user:123”, “label”: “User”, “properties”: {“name”: “alice”, “email”: “alice@example.com”}, “edges”: { “FRIENDS_WITH”: [“user:456”, “user:789”], # Fast traversal “WORKS_AT”: [“company:999”] } } ```

Real Systems Using Graphs

  • LinkedIn Economic Graph: 1B+ profiles + relationships. Neo4j powers recommendations (“People you may know”).
  • Facebook Social Graph: 3B+ users + friendships. Custom graph engine (not Neo4j) for scale.
  • Fraud Detection: Banks use graphs to detect fraud rings. Query: “Find all connected accounts that share phone number or email” – linear in graph size, instant detection.

When to Use Graph

Use Case SQL Complexity Graph Efficiency
“Is Alice connected to Bob?” (distance <= 5) 5 JOINs Single traversal
“Who are Alice’s mutual friends?” 2 JOINs Intersection of adjacency lists
“Recommend products Alice’s friends bought” 4 JOINs Breadth-first search

Vector Databases – Emerging Category

The Vector Problem

With LLMs and embeddings (e.g., OpenAI embeddings):

  • Text -> fixed-size vector (1536 dimensions)
  • “Find similar documents to query” = find vectors closest in space

Naive approach: Store vectors in PostgreSQL, compute distance to all N vectors -> O(N). Vector DB: Use spatial indexing (HNSW, IVF) -> O(log N) or O(sqrt N).

HNSW Indexing (Hierarchical Navigable Small World)

Idea: Build multi-level graph where nodes (vectors) are connected to nearby neighbors. Query starts at top level, jumps down to nearest neighbor, repeat. Mimics graph navigation.

Concrete numbers:

  • 1M vectors (1536 dimensions each)
  • HNSW index: ~2GB in memory (2KB per vector for graph pointers)
  • Query latency: 1-10ms for top-10 nearest neighbors

Real systems:

  • Pinecone: Fully managed vector DB, HNSW + GPU-accelerated search
  • pgvector (PostgreSQL extension): IVF (Inverted File) indexing, subset of pgvector on GPU coming soon
  • Weaviate: Open-source vector DB with HNSW

```python class SimpleVectorDB: def init(self, dimension=1536): self.vectors = {} # id -> vector self.dimension = dimension

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def insert(self, doc_id, vector):
    """Add vector for document."""
    self.vectors[doc_id] = vector

def search(self, query_vector, k=10):
    """Find k nearest neighbors using brute-force (O(N))."""
    distances = []
    for doc_id, vector in self.vectors.items():
        # Cosine similarity (dot product for normalized vectors)
        sim = sum(a * b for a, b in zip(query_vector, vector))
        distances.append((doc_id, sim))

    # Sort by similarity, descending
    distances.sort(key=lambda x: x[1], reverse=True)
    return distances[:k] \`\`\`

Production HNSW details:

  • Hierarchical layers (0, 1, 2…) where each layer is sparser
  • Layer 0: fully connected (slow, doesn’t scale)
  • Layer 1: 1 in 2 nodes (faster)
  • Layer 2: 1 in 4 nodes (fastest, sparse)
  • Query: Start at layer 2, greedy nearest neighbor descent -> layer 1 -> layer 0

Search Engines (Full-Text)

Elasticsearch – Distributed Lucene

What it is: Distributed wrapper around Lucene (Java full-text indexing library). Scales search across multiple nodes.

Inverted index structure: ``` Term Postings List (compressed, variable-byte encoding) “machine” [1, 5, 7, 42, 100, …] (document IDs) “learning” [1, 2, 4, 5, 8, …] ```

Query execution: Search “machine learning” -> find intersection of postings lists for “machine” AND “learning” -> return doc IDs 1, 5.

Real systems:

  • Search queries: Web search, documentation search
  • Log analysis: ELK stack (Elasticsearch, Logstash, Kibana) – parse logs, index, query
  • Metrics: Elasticsearch also used for analytics dashboards

Concrete numbers (typical Elasticsearch cluster):

  • Index size: 1 billion documents ~ 500 GB (depends on document size + compression)
  • Query latency: 10-100ms for full-text search
  • Indexing rate: 100K docs/sec per node

OLAP / Data Warehouse

Row vs Column Storage

Row-oriented (PostgreSQL, MySQL): ``` [user:1, alice, 100] [user:2, bob, 200] [user:3, carol, 150] ```

Query “SELECT SUM(balance) FROM users” must load all 3 columns for all rows.

Column-oriented (ClickHouse, BigQuery, Redshift): ``` user: [1, 2, 3] name: [alice, bob, carol] balance: [100, 200, 150] ```

Same query only loads balance column -> 3x faster (1/3 data read).

Concrete Numbers (BigQuery, Google’s OLAP)

Dataflow Analytics:

  • Data scale: Petabytes (1000+ TB per project)
  • Query latency: 1-10 seconds for billion-row aggregations
  • Cost: Charged per GB scanned (not per query)
  • Compression: 10-50x (highly compressible analytical data)

Use case: “How many purchases happened last week, grouped by country?” -> scans purchase table (5TB), groups by country column, returns 200-row result in 2 seconds.


Summary – Specialized Database Decision

Data Type Engine Category Best Systems Characteristic
Time-series metrics Time-series DB InfluxDB, Prometheus, TimescaleDB Timestamp-ordered, auto-TTL, compression
Graph relationships Graph DB Neo4j, Amazon Neptune, TigerGraph Traversal queries, millions of edges
Full-text search Search Engine Elasticsearch, Solr, Meilisearch Inverted index, term relevance, Boolean queries
Vector similarity Vector DB Pinecone, pgvector, Weaviate HNSW indexing, embeddings, semantic search
Analytics / OLAP Data Warehouse BigQuery, Redshift, ClickHouse Columnar storage, billion-row aggregations
Relational + scale NewSQL Spanner, CockroachDB SQL + horizontal partitioning, strong consistency

When to Use / Avoid

✅ Use Time-Series DB When

  • Append-only workload – data rarely updated or deleted
  • High-volume metrics – 1M+ series across millions of data points
  • Automatic TTL needed – old data should expire
  • Time-range queries are primary – “stats over last hour/day/week”
  • Compression is critical – disk I/O is a bottleneck

❌ Avoid When

  • Complex queries across non-time dimensions – use ClickHouse
  • Need complex transactions – use PostgreSQL

✅ Use Graph DB When

  • Relationships are first-class – traversal speed matters
  • Queries are “find connected nodes” – fraud detection, recommendations
  • Depth of traversal is constant (not scaling with data)

❌ Avoid When

  • No relationship queries – use relational DB
  • Graph is too large to fit in memory – distributed graphs have latency

✅ Use Vector DB When

  • Similarity search on embeddings – LLM semantic search
  • High-dimensional data (1000+ dimensions)
  • Need approximate nearest neighbors (exact is too slow)

❌ Avoid When

  • Exact search only – use relational DB with B+ tree index
  • Data is low-dimensional (< 10 dimensions)

✅ Use OLAP When

  • Analytical queries on historical data – aggregations, reporting
  • Data is rarely updated – append-only or batch updates
  • Queries touch small % of rows/columns – columnar compression helps

❌ Avoid When

  • Real-time transactional updates – use OLTP (PostgreSQL, MySQL)
  • Small dataset (< 1GB) – overhead not worth it

References

This post is licensed under CC BY 4.0 by the author.