Post

Inverted Index

Map term to list of documents containing it -- the core data structure powering every full-text search engine.

Inverted Index

Map term → list of documents containing it — the core data structure powering every full-text search engine.

Posting List Detail

Each entry in the posting list stores more than doc ID:

1
2
"scale" → [(Doc1, pos:[3], tf:1, score:0.8),
            (Doc2, pos:[2], tf:1, score:0.6)]
  • tf = term frequency in doc
  • pos = position for phrase queries (“distributed systems”)
  • score = relevance (TF-IDF or BM25)

Key Properties

Property Value
Build time O(N × L) — N docs, L avg length
Query time O(log V + k) — V vocab size, k results
Space O(N × L) — proportional to total terms
Updates Expensive (re-index) — batched in segments
Phrase queries Needs positional index

When to Use ✅ / Avoid ❌

✅ Full-text search across large document collections ✅ Faceted search, tag filtering ✅ Log search and aggregation ❌ Exact key-value lookup (use hash index) ❌ Numeric range queries (use B-Tree index) ❌ Frequently updating documents (re-indexing cost)

Real Systems

| System | How Used | |——–|———-| | Elasticsearch / OpenSearch | Core index via Lucene | | Apache Solr | Lucene-based inverted index | | PostgreSQL | tsvector / tsquery full-text | | SQLite FTS5 | Built-in full-text search | | Splunk | Log indexing and search |

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
import re
from collections import defaultdict
from typing import List, Dict, Set
from dataclasses import dataclass
from math import log

@dataclass
class Posting:
    """A single entry in a posting list — one document mentioning a term."""
    doc_id: int
    term_frequency: int  # How many times this term appears in this doc
    positions: List[int]  # Positions of each occurrence (for phrase queries)

class InvertedIndex:
    """
    Core full-text search data structure: term → list of documents.
    Supports AND/OR queries and TF-IDF relevance ranking.
    """

    def __init__(self):
        """Initialize: index maps term → sorted list of Posting objects."""
        self.index: Dict[str, List[Posting]] = defaultdict(list)
        self.documents: Dict[int, str] = {}  # Store raw doc text for reference
        self.stopwords = {'the', 'a', 'an', 'and', 'or', 'is', 'in', 'at', 'to', 'for'}

    def _tokenize(self, text: str) -> List[str]:
        """
        Tokenize and normalize text: lowercase, remove punctuation, filter stopwords.
        Why: Enables case-insensitive search and reduces noise from common words.
        """
        # Convert to lowercase
        text = text.lower()
        # Split on non-alphanumeric (remove punctuation)
        tokens = re.findall(r'\b\w+\b', text)
        # Filter out stopwords
        return [t for t in tokens if t not in self.stopwords]

    def add_document(self, doc_id: int, text: str) -> None:
        """
        Index a new document: tokenize, track positions, build posting entries.
        Time: O(L) where L = document length.
        Why: Each token is processed once; position tracking enables phrase queries.
        """
        self.documents[doc_id] = text
        # Full tokenization WITH positions for phrase queries
        text_lower = text.lower()
        all_tokens = re.finditer(r'\b\w+\b', text_lower)

        # Track term frequency and positions
        term_positions: Dict[str, List[int]] = defaultdict(list)
        position = 0
        for match in all_tokens:
            token = match.group()
            if token not in self.stopwords:
                term_positions[token].append(position)
            position += 1

        # Build posting entries
        for term, positions in term_positions.items():
            # Check if this term already has postings for this doc
            existing = None
            for posting in self.index[term]:
                if posting.doc_id == doc_id:
                    existing = posting
                    break

            if existing:
                # Update existing posting (shouldn't happen if indexing each doc once)
                existing.term_frequency += len(positions)
                existing.positions.extend(positions)
            else:
                # Create new posting for this (term, doc) pair
                posting = Posting(
                    doc_id=doc_id,
                    term_frequency=len(positions),
                    positions=positions
                )
                self.index[term].append(posting)

        # Sort posting lists by doc_id (enables fast merge intersection)
        for term in self.index:
            self.index[term].sort(key=lambda p: p.doc_id)

    def search_and(self, terms: List[str]) -> List[int]:
        """
        Intersection query: return docs containing ALL terms.
        Algorithm: sorted merge of posting lists (like merge in merge-sort).
        Time: O(n + m) where n, m = posting list sizes (not O(n*m)).
        Why: Posting lists are sorted. Start with smallest list, advance pointers in parallel.
        """
        if not terms:
            return []

        # Normalize terms
        terms = [t.lower() for t in terms if t.lower() not in self.stopwords]
        if not terms:
            return []

        # Get posting lists, sorted by size (start with smallest)
        posting_lists = []
        for term in terms:
            if term not in self.index:
                return []  # Missing term means no intersection
            posting_lists.append(self.index[term])

        # Sort by list length: smaller lists first for faster intersection
        posting_lists.sort(key=len)

        # Sorted merge intersection
        result_docs = {p.doc_id for p in posting_lists[0]}
        for posting_list in posting_lists[1:]:
            candidates = {p.doc_id for p in posting_list}
            result_docs &= candidates  # Set intersection

        return sorted(list(result_docs))

    def search_or(self, terms: List[str]) -> List[int]:
        """
        Union query: return docs containing ANY term.
        Algorithm: union of posting lists.
        Time: O(sum of posting list lengths).
        """
        if not terms:
            return []

        terms = [t.lower() for t in terms if t.lower() not in self.stopwords]
        if not terms:
            return []

        result_docs = set()
        for term in terms:
            if term in self.index:
                for posting in self.index[term]:
                    result_docs.add(posting.doc_id)

        return sorted(list(result_docs))

    def tf_idf(self, term: str, doc_id: int, total_docs: int) -> float:
        """
        Compute TF-IDF score for relevance ranking.
        TF-IDF = Term Frequency × Inverse Document Frequency
        Why: Boosts relevance of rare, specific terms; down-weights common terms.
        """
        # Find posting for this (term, doc) pair
        posting = None
        if term in self.index:
            for p in self.index[term]:
                if p.doc_id == doc_id:
                    posting = p
                    break

        if not posting:
            return 0.0

        # Term Frequency: count of term in document
        tf = posting.term_frequency

        # Inverse Document Frequency: rarity across all docs
        # log(total_docs / docs_containing_term)
        docs_with_term = len(self.index[term])
        idf = log(total_docs / (1 + docs_with_term))

        return tf * idf

    def __repr__(self) -> str:
        """Pretty-print index structure for debugging."""
        result = ["Inverted Index:"]
        for term in sorted(self.index.keys()):
            doc_ids = [str(p.doc_id) for p in self.index[term]]
            result.append(f"  '{term}' → docs {doc_ids}")
        return "\n".join(result)

# Example usage
if __name__ == "__main__":
    idx = InvertedIndex()

    # Index 3 documents
    idx.add_document(1, "the cat sat on the mat")
    idx.add_document(2, "the dog sat on the mat")
    idx.add_document(3, "cat chased dog")

    print(idx)
    print()

    # AND query: docs with both "cat" and "dog"
    result = idx.search_and(["cat", "dog"])
    print(f"AND query ('cat' AND 'dog'): {result}")  # [3]

    # OR query: docs with "cat" or "sat"
    result = idx.search_or(["cat", "sat"])
    print(f"OR query ('cat' OR 'sat'): {result}")  # [1, 2, 3]

    # TF-IDF scoring
    score = idx.tf_idf("sat", 1, total_docs=3)
    print(f"TF-IDF('sat', doc=1): {score:.3f}")

Algorithm Flow: Index Build + Query


Elasticsearch (Apache Lucene Core)

Elasticsearch batches all writes into in-memory buffers. When a buffer fills (~30MB by default), Lucene flushes it to an immutable segment — a self-contained inverted index on disk. Each segment stores its own vocabulary, posting lists, and metadata. To keep query latency low, Lucene periodically merges smaller segments into larger ones (similar to LSM tree compaction). Posting lists use Frame-of-Reference (FOR) delta encoding and roaring bitmaps for compression: a list of 1M doc IDs takes ~2MB instead of 8MB raw. Each field type has its own analyzer chain (character filters → tokenizers → token filters), enabling language-specific indexing.

PostgreSQL Full-Text Search (GIN Index)

PostgreSQL’s tsvector type pre-computes document tokens at index time. A GIN (Generalized Inverted iNdex) index maps each lexeme to a sorted posting list of heap tuple IDs (row addresses). The query to_tsquery('cat & dog') compiles to a posting list intersection at query time. No separate search server needed — useful for small-to-medium datasets (<10M docs) where Elasticsearch overhead isn’t justified. Updates are re-indexed in the GIN, so it scales to moderate corpus sizes but not billions of documents.

Apache Solr

Solr shares Lucene’s core but adds orchestration for distributed search. SolrCloud partitions the index across multiple nodes; each shard is a complete Lucene inverted index. Solr’s distinguishing strength is faceted search: while fetching results, Solr also counts hits per category (color, price range, brand). This is achieved via DocValues — a column-oriented store alongside the inverted index. Commonly used in e-commerce product catalogs where faceted filtering (“show red shoes under $100”) is critical for user experience.

Splunk indexes raw machine logs in time-based buckets (e.g., one bucket per day). Within each bucket, an inverted index maps keywords to line offsets in the raw log files. A critical optimization: Bloom filters per bucket allow Splunk to skip entire time ranges that don’t contain a searched term. This enables queries like “find this error string in 90 days of logs” to complete in seconds (scanning terabytes would take hours without filters). Splunk is optimized for ad-hoc search, not pre-aggregated analytics.

SQLite FTS5

SQLite’s Full-Text Search module embeds an inverted index engine inside SQLite — zero server, zero configuration. Stores the index in auxiliary virtual tables alongside regular data. Supports MATCH 'term1 AND term2', prefix queries, phrase queries, BM25 ranking, and snippet() function for result highlighting. Used by iOS Spotlight (device-side search), Android app search, and offline-capable mobile/desktop apps. Performance is sufficient for millions of documents on a single device.


References

This post is licensed under CC BY 4.0 by the author.