Inverted Index
Map term to list of documents containing it -- the core data structure powering every full-text search engine.
Map term → list of documents containing it — the core data structure powering every full-text search engine.
Posting List Detail
Each entry in the posting list stores more than doc ID:
1
2
"scale" → [(Doc1, pos:[3], tf:1, score:0.8),
(Doc2, pos:[2], tf:1, score:0.6)]
- tf = term frequency in doc
- pos = position for phrase queries (“distributed systems”)
- score = relevance (TF-IDF or BM25)
Key Properties
| Property | Value |
|---|---|
| Build time | O(N × L) — N docs, L avg length |
| Query time | O(log V + k) — V vocab size, k results |
| Space | O(N × L) — proportional to total terms |
| Updates | Expensive (re-index) — batched in segments |
| Phrase queries | Needs positional index |
When to Use ✅ / Avoid ❌
✅ Full-text search across large document collections ✅ Faceted search, tag filtering ✅ Log search and aggregation ❌ Exact key-value lookup (use hash index) ❌ Numeric range queries (use B-Tree index) ❌ Frequently updating documents (re-indexing cost)
Real Systems
| System | How Used |
|——–|———-|
| Elasticsearch / OpenSearch | Core index via Lucene |
| Apache Solr | Lucene-based inverted index |
| PostgreSQL | tsvector / tsquery full-text |
| SQLite FTS5 | Built-in full-text search |
| Splunk | Log indexing and search |
Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
import re
from collections import defaultdict
from typing import List, Dict, Set
from dataclasses import dataclass
from math import log
@dataclass
class Posting:
"""A single entry in a posting list — one document mentioning a term."""
doc_id: int
term_frequency: int # How many times this term appears in this doc
positions: List[int] # Positions of each occurrence (for phrase queries)
class InvertedIndex:
"""
Core full-text search data structure: term → list of documents.
Supports AND/OR queries and TF-IDF relevance ranking.
"""
def __init__(self):
"""Initialize: index maps term → sorted list of Posting objects."""
self.index: Dict[str, List[Posting]] = defaultdict(list)
self.documents: Dict[int, str] = {} # Store raw doc text for reference
self.stopwords = {'the', 'a', 'an', 'and', 'or', 'is', 'in', 'at', 'to', 'for'}
def _tokenize(self, text: str) -> List[str]:
"""
Tokenize and normalize text: lowercase, remove punctuation, filter stopwords.
Why: Enables case-insensitive search and reduces noise from common words.
"""
# Convert to lowercase
text = text.lower()
# Split on non-alphanumeric (remove punctuation)
tokens = re.findall(r'\b\w+\b', text)
# Filter out stopwords
return [t for t in tokens if t not in self.stopwords]
def add_document(self, doc_id: int, text: str) -> None:
"""
Index a new document: tokenize, track positions, build posting entries.
Time: O(L) where L = document length.
Why: Each token is processed once; position tracking enables phrase queries.
"""
self.documents[doc_id] = text
# Full tokenization WITH positions for phrase queries
text_lower = text.lower()
all_tokens = re.finditer(r'\b\w+\b', text_lower)
# Track term frequency and positions
term_positions: Dict[str, List[int]] = defaultdict(list)
position = 0
for match in all_tokens:
token = match.group()
if token not in self.stopwords:
term_positions[token].append(position)
position += 1
# Build posting entries
for term, positions in term_positions.items():
# Check if this term already has postings for this doc
existing = None
for posting in self.index[term]:
if posting.doc_id == doc_id:
existing = posting
break
if existing:
# Update existing posting (shouldn't happen if indexing each doc once)
existing.term_frequency += len(positions)
existing.positions.extend(positions)
else:
# Create new posting for this (term, doc) pair
posting = Posting(
doc_id=doc_id,
term_frequency=len(positions),
positions=positions
)
self.index[term].append(posting)
# Sort posting lists by doc_id (enables fast merge intersection)
for term in self.index:
self.index[term].sort(key=lambda p: p.doc_id)
def search_and(self, terms: List[str]) -> List[int]:
"""
Intersection query: return docs containing ALL terms.
Algorithm: sorted merge of posting lists (like merge in merge-sort).
Time: O(n + m) where n, m = posting list sizes (not O(n*m)).
Why: Posting lists are sorted. Start with smallest list, advance pointers in parallel.
"""
if not terms:
return []
# Normalize terms
terms = [t.lower() for t in terms if t.lower() not in self.stopwords]
if not terms:
return []
# Get posting lists, sorted by size (start with smallest)
posting_lists = []
for term in terms:
if term not in self.index:
return [] # Missing term means no intersection
posting_lists.append(self.index[term])
# Sort by list length: smaller lists first for faster intersection
posting_lists.sort(key=len)
# Sorted merge intersection
result_docs = {p.doc_id for p in posting_lists[0]}
for posting_list in posting_lists[1:]:
candidates = {p.doc_id for p in posting_list}
result_docs &= candidates # Set intersection
return sorted(list(result_docs))
def search_or(self, terms: List[str]) -> List[int]:
"""
Union query: return docs containing ANY term.
Algorithm: union of posting lists.
Time: O(sum of posting list lengths).
"""
if not terms:
return []
terms = [t.lower() for t in terms if t.lower() not in self.stopwords]
if not terms:
return []
result_docs = set()
for term in terms:
if term in self.index:
for posting in self.index[term]:
result_docs.add(posting.doc_id)
return sorted(list(result_docs))
def tf_idf(self, term: str, doc_id: int, total_docs: int) -> float:
"""
Compute TF-IDF score for relevance ranking.
TF-IDF = Term Frequency × Inverse Document Frequency
Why: Boosts relevance of rare, specific terms; down-weights common terms.
"""
# Find posting for this (term, doc) pair
posting = None
if term in self.index:
for p in self.index[term]:
if p.doc_id == doc_id:
posting = p
break
if not posting:
return 0.0
# Term Frequency: count of term in document
tf = posting.term_frequency
# Inverse Document Frequency: rarity across all docs
# log(total_docs / docs_containing_term)
docs_with_term = len(self.index[term])
idf = log(total_docs / (1 + docs_with_term))
return tf * idf
def __repr__(self) -> str:
"""Pretty-print index structure for debugging."""
result = ["Inverted Index:"]
for term in sorted(self.index.keys()):
doc_ids = [str(p.doc_id) for p in self.index[term]]
result.append(f" '{term}' → docs {doc_ids}")
return "\n".join(result)
# Example usage
if __name__ == "__main__":
idx = InvertedIndex()
# Index 3 documents
idx.add_document(1, "the cat sat on the mat")
idx.add_document(2, "the dog sat on the mat")
idx.add_document(3, "cat chased dog")
print(idx)
print()
# AND query: docs with both "cat" and "dog"
result = idx.search_and(["cat", "dog"])
print(f"AND query ('cat' AND 'dog'): {result}") # [3]
# OR query: docs with "cat" or "sat"
result = idx.search_or(["cat", "sat"])
print(f"OR query ('cat' OR 'sat'): {result}") # [1, 2, 3]
# TF-IDF scoring
score = idx.tf_idf("sat", 1, total_docs=3)
print(f"TF-IDF('sat', doc=1): {score:.3f}")
Algorithm Flow: Index Build + Query
Elasticsearch (Apache Lucene Core)
Elasticsearch batches all writes into in-memory buffers. When a buffer fills (~30MB by default), Lucene flushes it to an immutable segment — a self-contained inverted index on disk. Each segment stores its own vocabulary, posting lists, and metadata. To keep query latency low, Lucene periodically merges smaller segments into larger ones (similar to LSM tree compaction). Posting lists use Frame-of-Reference (FOR) delta encoding and roaring bitmaps for compression: a list of 1M doc IDs takes ~2MB instead of 8MB raw. Each field type has its own analyzer chain (character filters → tokenizers → token filters), enabling language-specific indexing.
PostgreSQL Full-Text Search (GIN Index)
PostgreSQL’s tsvector type pre-computes document tokens at index time. A GIN (Generalized Inverted iNdex) index maps each lexeme to a sorted posting list of heap tuple IDs (row addresses). The query to_tsquery('cat & dog') compiles to a posting list intersection at query time. No separate search server needed — useful for small-to-medium datasets (<10M docs) where Elasticsearch overhead isn’t justified. Updates are re-indexed in the GIN, so it scales to moderate corpus sizes but not billions of documents.
Apache Solr
Solr shares Lucene’s core but adds orchestration for distributed search. SolrCloud partitions the index across multiple nodes; each shard is a complete Lucene inverted index. Solr’s distinguishing strength is faceted search: while fetching results, Solr also counts hits per category (color, price range, brand). This is achieved via DocValues — a column-oriented store alongside the inverted index. Commonly used in e-commerce product catalogs where faceted filtering (“show red shoes under $100”) is critical for user experience.
Splunk — Machine Data / Log Search
Splunk indexes raw machine logs in time-based buckets (e.g., one bucket per day). Within each bucket, an inverted index maps keywords to line offsets in the raw log files. A critical optimization: Bloom filters per bucket allow Splunk to skip entire time ranges that don’t contain a searched term. This enables queries like “find this error string in 90 days of logs” to complete in seconds (scanning terabytes would take hours without filters). Splunk is optimized for ad-hoc search, not pre-aggregated analytics.
SQLite FTS5
SQLite’s Full-Text Search module embeds an inverted index engine inside SQLite — zero server, zero configuration. Stores the index in auxiliary virtual tables alongside regular data. Supports MATCH 'term1 AND term2', prefix queries, phrase queries, BM25 ranking, and snippet() function for result highlighting. Used by iOS Spotlight (device-side search), Android app search, and offline-capable mobile/desktop apps. Performance is sufficient for millions of documents on a single device.
References
- 📄 Manning, Raghavan & Schütze — Introduction to Information Retrieval — the definitive textbook, freely available online with chapters on posting lists and ranking
- 🎥 ByteByteGo — Search Engine Architecture — systems-level walkthrough of how inverted indices power search at scale
- 🔗 Lucene Codec Documentation — detailed internals of how Elasticsearch/Solr store and compress posting lists
- 🔗 PostgreSQL Full-Text Search Documentation — GIN index mechanics and tsvector/tsquery API reference
- 📖 Wikipedia: Inverted Index — posting list variants, compression techniques, and boolean retrieval models