Merkle Tree
A tree of hashes where each parent = hash(left child + right child) -- detect exactly which data blocks differ between two systems in O(log n).
Key Properties
| Property | Value |
|---|---|
| Detect any change | O(1) — compare root hashes |
| Locate changed block | O(log n) — traverse tree |
| Data integrity proof | O(log n) — send sibling hashes |
| Space overhead | O(n) additional hash nodes |
Highlights
- Tamper detection: Change one bit -> completely different root hash
- Efficient sync: Only transmit hashes for mismatched subtrees
- Sibling path: Prove single block integrity with O(log n) hashes
- Immutable proof: Root hash commits to all leaf data
When to Use / Avoid
- Use to efficiently sync large datasets between nodes
- Use to verify data integrity without transferring all data
- Use to detect tampering (blockchain, content-addressed storage)
- Avoid for single-node systems (overhead not justified)
- Avoid for frequently updated single values (hash recomputation cost)
Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import hashlib
class MerkleTree:
"""
Simple Merkle Tree using SHA-256.
- Build from leaf data chunks
- Verify inclusion proofs
- Generate sibling path for O(log n) verification
"""
def __init__(self, data_blocks):
"""Initialize tree from data blocks."""
self.data_blocks = data_blocks
self.tree = [] # List of levels: tree[0] = leaves, tree[-1] = root
self.build_tree()
def _hash(self, data):
"""Hash a data block or pair of hashes."""
if isinstance(data, str):
data = data.encode()
return hashlib.sha256(data).hexdigest()
def build_tree(self):
"""Build Merkle tree bottom-up from leaves to root."""
if not self.data_blocks:
return
# Level 0: Hash each data block (leaves)
current_level = [self._hash(block) for block in self.data_blocks]
self.tree.append(current_level[:])
# Build upward: pair adjacent hashes, hash concatenation
while len(current_level) > 1:
# Pad with duplicate last hash if odd number
if len(current_level) % 2 == 1:
current_level.append(current_level[-1])
next_level = []
for i in range(0, len(current_level), 2):
# Concatenate and hash pairs (left + right order matters)
combined = current_level[i] + current_level[i + 1]
parent_hash = self._hash(combined)
next_level.append(parent_hash)
current_level = next_level
self.tree.append(current_level[:])
def get_root_hash(self):
"""Return root hash (top of tree)."""
if self.tree:
return self.tree[-1][0]
return None
def generate_proof(self, leaf_index):
"""
Generate inclusion proof for leaf at leaf_index.
Returns list of (sibling_hash, position) pairs.
"""
if leaf_index >= len(self.tree[0]):
return None
proof = []
idx = leaf_index
# Walk up tree collecting siblings
for level_idx in range(len(self.tree) - 1):
level = self.tree[level_idx]
sibling_idx = idx ^ 1 # XOR gives sibling (if idx even -> idx+1, if odd -> idx-1)
if sibling_idx < len(level):
position = 'left' if idx % 2 == 1 else 'right'
proof.append((level[sibling_idx], position))
idx = idx // 2 # Move to parent level
return proof
def verify_proof(self, leaf_data, leaf_index, root_hash, proof):
"""
Verify that leaf_data at leaf_index produces root_hash.
Proof: list of (sibling_hash, position) tuples.
"""
# Step 1: Hash the leaf
current_hash = self._hash(leaf_data)
# Step 2: Combine with each sibling hash going up
for sibling_hash, position in proof:
if position == 'left':
combined = sibling_hash + current_hash
else:
combined = current_hash + sibling_hash
current_hash = self._hash(combined)
# Step 3: Compare with known root
return current_hash == root_hash
def find_differences(self, other_root_hash, other_tree):
"""
Compare root hashes. If different, recursively find
which subtree (block range) differs.
"""
if self.get_root_hash() == other_root_hash:
return [] # No differences
# If roots differ, drill down to find divergent blocks
differences = []
def compare_level(level_idx, our_hashes, their_hashes):
for i, (our_hash, their_hash) in enumerate(zip(our_hashes, their_hashes)):
if our_hash != their_hash:
if level_idx == 0: # Leaf level
differences.append(i)
else:
# Drill deeper (in production, request child level from peer)
pass
compare_level(0, self.tree[0], other_tree.tree[0])
return differences
# Example usage
blocks = ['Alice->Bob: 5', 'Bob->Carol: 3', 'Carol->Dave: 2', 'Dave->Eve: 7']
tree = MerkleTree(blocks)
print("Root hash:", tree.get_root_hash())
# Generate proof for block at index 1
proof = tree.generate_proof(1)
print("Proof for block 1:", proof)
# Verify inclusion
verified = tree.verify_proof('Bob->Carol: 3', 1, tree.get_root_hash(), proof)
print("Verified:", verified) # True
# Try invalid verification
verified_bad = tree.verify_proof('Bob->Carol: 999', 1, tree.get_root_hash(), proof)
print("Verified (invalid):", verified_bad) # False
Algorithm Diagram:
Git
Git uses Merkle DAG (Directed Acyclic Graph) where every commit, tree, and blob is content-addressed: hash = SHA-1(data). A commit object contains: parent commit hash + tree hash + author + message. The tree object contains file hashes and subtree hashes. Any single-bit change cascades to all ancestors’ hashes, making tampering immediately detectable (git refuses to accept commits with bad hash). This enables O(log n) diffs: git diff commit1 commit2 compares tree hashes at each level, descending only into divergent subtrees.
Bitcoin
Bitcoin’s block contains a Merkle root of all transactions. Miners hash every transaction, pair adjacent hashes, and recurse upward. This allows SPV (Simple Payment Verification) clients to verify a transaction exists without downloading the entire block: they request the transaction and ~log2(n) sibling hashes from a full node, recompute root, and confirm it matches the block header. A 2000-transaction block requires only ~11 hashes (~352 bytes) for proof, instead of megabytes of data.
Cassandra
Cassandra performs anti-entropy repair to reconcile data between replicas after network partitions. Instead of transferring all data, replicas exchange Merkle tree root hashes. If roots match, replicas are in sync. If not, Cassandra recursively compares child levels to find divergent key ranges, then transfers only those ranges. This reduces bandwidth by orders of magnitude: comparing 100M keys might only require syncing 1M keys (1% divergence).
AWS S3 & DynamoDB
S3’s multipart upload uses a Merkle tree: each part is hashed, then parts are paired and rehashed upward. The final tree hash (stored in ETag) detects corruption of any part during upload. DynamoDB uses similar structures for cross-region replication consistency verification: root hashes between regions are compared; divergence triggers targeted reconciliation. This scales to petabytes without expensive full data transfers.
IPFS (InterPlanetary File System)
IPFS uses content-addressed Merkle DAG: every file/directory is represented as a tree of hashes. A file is split into chunks, each chunk is hashed, chunks are grouped into tree nodes and hashed upward. The content hash (root) serves as the immutable file identifier. Deduplication is automatic: two files sharing identical chunks have the same hash for that subtree, so IPFS stores it once. Peers can verify integrity by recomputing hashes; no central authority needed.
References
- Merkle (1979) — A Certified Digital Signature — Original patent introducing hash trees for cryptographic verification
- Bitcoin Whitepaper — Merkle Trees Section — Describes SPV and how Merkle roots enable lightweight clients
- Git Internals — The Git Object Model — Deep dive into how Git uses Merkle DAG for content addressing and tamper detection
- ByteByteGo — How Does Bitcoin’s Merkle Tree Work? — Visual walkthrough of block verification and SPV proof
- Wikipedia: Merkle Tree — Comprehensive reference with applications to blockchain, version control, and distributed systems