Post

Hashing Algorithms

Map arbitrary data to fixed-size digests -- used for integrity, identity, distribution, and authentication across every layer of a system.

Hashing Algorithms

Map arbitrary data to fixed-size digests — used for integrity, identity, distribution, and authentication across every layer of a system.

Key Properties

Property Cryptographic Non-Cryptographic
Preimage Resistance ✅ Hard to reverse ❌ Not required
Collision Resistance ✅ ~2^128 work for 256-bit ⚠️ Theoretical, not verified
Speed Medium (SHA-256: 500 MB/s) Very fast (MurmurHash: 5+ GB/s)
Output Determinism ✅ Always same for input ✅ Always same for input
Avalanche Effect ✅ Strict (1 bit → 50% output bits) ✅ Good (1 bit → 30-40% output bits)
Use Cases Signing, integrity, identity Sharding, hash tables, Bloom filters

Algorithm Comparison

Algorithm Output Speed Collision Resist Status Primary Use
MD5 128-bit Fast ❌ Broken (collisions found 2004) Deprecated Legacy checksums only
SHA-1 160-bit Fast ❌ Broken (collisions found 2017) Sunset (Git migrating away) Legacy Git, deprecated in TLS 1.3
SHA-256 256-bit Medium ✅ NIST standard (no known attacks) Current standard TLS certs, JWT, Bitcoin, Git (future)
SHA-3 (Keccak) 256/512-bit Medium ✅ NIST 2015 standard (sponge construction) Post-quantum preferred Future migration, post-quantum readiness
bcrypt 192-bit Tunable slow ✅ Adaptive; salted Recommended Password storage (legacy systems)
Argon2id Variable Tunable ✅ Memory-hard; resistant to GPU/ASIC attacks OWASP best practice 2024 Password storage (modern standard)
MurmurHash3 32/128-bit Very fast (3-5 GB/s) Non-cryptographic Production standard Cassandra, Bloom filters, hash tables
xxHash 32/64/128-bit Fastest (10+ GB/s) Non-cryptographic Production standard File checksums, high-speed hash tables
CRC32 32-bit Fast (hardware accelerated) ❌ Error detection only Standard Storage integrity, network packets

Decision Guide

Choosing a Hash Algorithm:

For Security/Authentication:

  • Password storage: Use Argon2id (OWASP 2024 recommendation: m=46 MiB, t=1, p=1)
  • JWT signing / TLS certs: Use SHA-256 (standard for ECDSA and RSA-PSS)
  • Future-proofing: Use SHA-3-256 for post-quantum resilience

For Data Distribution (Sharding/Load Balancing):

  • Partition keys: Use MurmurHash3 (Cassandra standard; 3-5x faster than MD5)
  • Hash tables: Use xxHash for speed (10+ GB/s) or CityHash for simplicity

For Error Detection:

  • Network packets / storage checksums: Use CRC32 (hardware-accelerated, lightweight)

For Content Addressing (Deduplication):

  • Git: Currently SHA-1 (migrating to SHA-256 by 2026 in Git 3.0)
  • IPFS / content-addressed storage: SHA-256 (collision-free guarantee)

Real Systems Using Hashing

System Hash Used Purpose Configuration Benefit
Bitcoin Blockchain SHA-256 (double) Proof of work; block integrity Header: 1024 bits (version, prev_hash, merkle_root, timestamp, target, nonce); double hashed; difficulty adjusts every 2,016 blocks (~2 weeks) to maintain 10-min block time Immutable ledger; 51% attack requires controlling >50% of hash rate (~exahash/sec)
Git (v2.45+) SHA-1 → SHA-256 Content-addressable objects (commits, trees, blobs) Currently SHA-1 for compatibility; a future Git major release defaults to SHA-256; transitional repos support both Collision resistance; Git 3.0 forces ecosystem adaptation (GitHub support pending)
TLS 1.3 Certificates SHA-256 (ECDSA/RSA-PSS) Chain signature verification Certificate signature: RSA-PSS with SHA-256 (OID 0x0804); ECDSA-SHA256 for EC keys; 256-bit hash prevents second-preimage attacks on 2048-bit RSA Industry standard; Firefox, Chrome, Safari all require SHA-256 minimum
Argon2id (OWASP) Custom KDF with BLAKE2b Password hashing resistant to GPU cracking m=46 MiB (memory), t=1 (time cost), p=1 (parallelism); or m=19 MiB, t=2, p=1 for resource-constrained environments; 16 bytes salt Tunable cost; ~100ms per hash on modern CPU; GPU/ASIC resistant due to memory-hard computation
Cassandra Partition Keys MurmurHash3 (128-bit, x64, upper 64 bits) Token generation for consistent hashing Maps to 64-bit ring: -2^63 to +2^63-1; token_range determines replication factor ownership; 3-5x faster than RandomPartitioner Uniform distribution across nodes; deterministic ring; enables stream-based replica balancing
JWT Signing (RS256/HS256) SHA-256 with HMAC or RSA Token integrity; tamper detection Header.Payload.Signature format; shared secret (HMAC) or public key (RSA); signature covers first two parts Stateless auth; impossible to forge without key; widely supported (OpenID Connect standard)

How Real Systems Use This

Bitcoin: Double SHA-256 Proof of Work

Bitcoin uses SHA-256(SHA-256(block_header)) to create an immutable ledger. The block header is 1024 bits: version (32-bit), previous block hash (256-bit), Merkle root of transactions (256-bit), timestamp (32-bit), difficulty target (32-bit), and nonce (32-bit). Miners compete to find a nonce that makes the double-hashed result fall below the difficulty target. The difficulty adjusts every 2,016 blocks (~two weeks) to maintain a 10-minute average block time. At current hashrate (~520 exahash/sec as of 2025), finding a valid nonce requires ~10^19 hash operations, making a 51% attack cost-prohibitive. Nodes verify blocks in O(1) by recomputing the double hash; the double hashing reduces certain cryptographic attacks and reinforces integrity.

Git: SHA-1 → SHA-256 Migration

Git uses SHA-1 to create content-addressable object IDs (commits, trees, blobs are identified by their hash). A commit SHA-1 is the hash of: parent commit, tree object, author metadata, and message. For decades this was secure due to Git’s use case (trusted internal repositories), but in 2017 Google published the first SHA-1 collision. Git 2.45 (released 2024) added SHA-256 support with interoperability; Git 3.0 (targeting late 2026) will make SHA-256 default for new repos. The challenge: GitHub, GitLab, and other forges still don’t support SHA-256 repos, creating a chicken-and-egg problem. The migration plan includes hybrid repos that can contain both SHA-1 and SHA-256 objects, allowing gradual transition without breaking existing workflows.

TLS 1.3 Certificate Chain Signatures

Modern TLS certificates use SHA-256 with RSA-PSS or ECDSA for signatures. A certificate contains: public key, subject name, issuer name, validity dates, extensions, and a signature from the Certificate Authority. The signature algorithm (e.g., sha256WithRSAEncryption, OID 1.2.840.113549.1.1.11) specifies how the hash is computed. For RSA, the CA computes SHA-256(certificate_body), then encrypts with its private key. When verifying, clients recompute SHA-256(certificate_body), decrypt the signature with the CA’s public key, and compare. TLS 1.3 deprecates SHA-1 signatures and all major browsers require SHA-256 minimum, ensuring 256-bit collision resistance. The hash length matches the public key strength (256-bit hash for 2048-bit RSA).

Argon2id: GPU-Resistant Password Hashing

Argon2id is a memory-hard key derivation function designed to resist GPU and ASIC attacks. OWASP 2024 recommends: m=46 MiB (memory), t=1 (time cost in iterations), p=1 (parallelism degree). This takes ~100 milliseconds on a modern CPU to hash a password, making dictionary attacks prohibitively slow. The “memory-hard” aspect forces attackers to allocate 46 MB per hash attempt; even with GPUs (thousands of parallel threads), memory bandwidth becomes the bottleneck. Parameters are tunable: for resource-constrained environments, m=19 MiB with t=2 provides equivalent security with lower memory overhead. Argon2id combines data-dependent memory access (Argon2i, resistant to side-channel attacks) with data-independent access (Argon2d, faster), making it resistant to both timing and GPU attacks.

Cassandra: MurmurHash3 Partition Key Distribution

Cassandra uses MurmurHash3 (128-bit x64, truncated to upper 64 bits) to hash partition keys and assign them to nodes on a consistent hash ring. The hash value becomes a “token” ranging from -2^63 to +2^63-1. Each node owns a token range; the partition key hash determines which node(s) hold the data. MurmurHash3 is 3-5x faster than the legacy RandomPartitioner (cryptographic MD5) because Cassandra doesn’t need cryptographic guarantees — it just needs uniform distribution for load balancing. The deterministic hash enables stream-based token reassignment during node additions: a new node can replicate data by scanning nodes in increasing token order, rather than rebalancing the entire ring. With replication factor = 3, data is placed on the token-owning node plus the next 2 nodes clockwise.

JWT Signing: HMAC-SHA256 and RS256

JSON Web Tokens (JWT) use SHA-256 for signature generation. The header specifies the algorithm (e.g., HS256 for HMAC-SHA256, RS256 for RSA with SHA-256). For HMAC-SHA256, the server has a shared secret key (e.g., 256-bit random); to create a token: signature = HMAC-SHA256(header.payload, secret). The token is base64(header).base64(payload).base64(signature). Clients verify by recomputing the HMAC with the shared secret and comparing. For RS256, the server signs with a private key and clients verify using the corresponding public key (no shared secret required). The signature covers only the header and payload, allowing clients to extract and validate claims (user ID, expiration, permissions) without contacting the server. This is the foundation of stateless authentication in OAuth 2.0 and OpenID Connect.

Implementation

Simple Hash Function (Educational)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def djb2_hash(data, table_size):
    """
    Simple hash function using DJB2 algorithm.
    Demonstrates basic hash properties: fast, uniform distribution.

    Args:
        data: string or bytes to hash
        table_size: size of hash table

    Returns:
        hash value in range [0, table_size)

    Speed: ~5-10 MB/s (educational; production uses MurmurHash3 at 5 GB/s)
    """
    if isinstance(data, str):
        data = data.encode()

    hash_value = 5381
    for byte in data:
        # Bitwise operations: left-shift + addition for mixing
        hash_value = ((hash_value << 5) + hash_value) + byte
        hash_value &= 0xFFFFFFFF  # Keep 32-bit

    return hash_value % table_size


# Example usage
print(djb2_hash("apple", 100))  # Output: reproducible hash in [0, 100)
print(djb2_hash("banana", 100))

Consistent Hashing with MurmurHash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import hashlib


class ConsistentHashRing:
    """
    Consistent hashing for distributed systems.
    Used in: Cassandra (with MurmurHash3), Amazon DynamoDB, Memcached.

    Minimizes key redistribution when nodes are added/removed.
    """

    def __init__(self, nodes=None, virtual_nodes=150):
        """
        Args:
            nodes: list of node identifiers (e.g., server IPs)
            virtual_nodes: number of virtual nodes per physical node
                          (higher = better distribution, more memory)
        """
        self.virtual_nodes = virtual_nodes
        self.ring = {}  # token -> node_id
        self.sorted_tokens = []

        if nodes:
            for node in nodes:
                self.add_node(node)

    def _hash(self, key):
        """Hash key to 64-bit token (mimics MurmurHash3)."""
        h = hashlib.md5(str(key).encode()).hexdigest()
        # Use first 16 hex chars (64 bits)
        return int(h[:16], 16)

    def add_node(self, node):
        """Add node and its virtual nodes to ring."""
        for i in range(self.virtual_nodes):
            virtual_key = f"{node}:{i}"
            token = self._hash(virtual_key)
            self.ring[token] = node
        self._recompute_sorted_tokens()

    def remove_node(self, node):
        """Remove node and its virtual nodes from ring."""
        for i in range(self.virtual_nodes):
            virtual_key = f"{node}:{i}"
            token = self._hash(virtual_key)
            del self.ring[token]
        self._recompute_sorted_tokens()

    def _recompute_sorted_tokens(self):
        """Maintain sorted list of tokens for binary search."""
        self.sorted_tokens = sorted(self.ring.keys())

    def get_node(self, key):
        """
        Find node responsible for key using consistent hashing.

        Returns:
            node_id, or None if ring is empty
        """
        if not self.ring:
            return None

        token = self._hash(key)

        # Find first token >= key's token (clockwise on ring)
        import bisect
        idx = bisect.bisect_left(self.sorted_tokens, token)

        if idx == len(self.sorted_tokens):
            idx = 0  # Wrap around to first node

        return self.ring[self.sorted_tokens[idx]]


# Example: 3 nodes, partition requests
ring = ConsistentHashRing(nodes=["server_a", "server_b", "server_c"])

keys = ["user:123", "session:456", "cache:789"]
for key in keys:
    node = ring.get_node(key)
    print(f"{key}{node}")

print("\n--- Adding server_d (minimal redistribution) ---")
ring.add_node("server_d")

for key in keys:
    node = ring.get_node(key)
    print(f"{key}{node}")

HMAC-SHA256 for Message Authentication

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import hmac
import hashlib
import base64


def sign_jwt_header_payload(header_b64, payload_b64, secret):
    """
    Create HMAC-SHA256 signature for JWT.

    Args:
        header_b64: base64-encoded header
        payload_b64: base64-encoded payload
        secret: shared secret (typically 256-bit random)

    Returns:
        signature as base64-encoded string
    """
    # Message to sign is header.payload (without signature yet)
    message = f"{header_b64}.{payload_b64}".encode()

    # HMAC-SHA256
    signature_bytes = hmac.new(
        secret.encode(), message, hashlib.sha256
    ).digest()

    # Base64 encode (URL-safe, no padding)
    signature_b64 = base64.urlsafe_b64encode(signature_bytes).decode().rstrip("=")

    return signature_b64


def verify_jwt_signature(header_b64, payload_b64, signature_b64, secret):
    """
    Verify JWT signature.

    Args:
        header_b64, payload_b64, signature_b64: JWT parts
        secret: shared secret

    Returns:
        True if signature is valid, False otherwise
    """
    # Recompute signature
    expected_signature = sign_jwt_header_payload(header_b64, payload_b64, secret)

    # Constant-time comparison to prevent timing attacks
    return hmac.compare_digest(signature_b64, expected_signature)


# Example usage
import json

header = {"alg": "HS256", "typ": "JWT"}
payload = {"sub": "user123", "exp": 1234567890}

header_b64 = base64.urlsafe_b64encode(json.dumps(header).encode()).decode().rstrip("=")
payload_b64 = base64.urlsafe_b64encode(json.dumps(payload).encode()).decode().rstrip("=")

secret = "my_super_secret_key_min_32_chars"

token_signature = sign_jwt_header_payload(header_b64, payload_b64, secret)
jwt_token = f"{header_b64}.{payload_b64}.{token_signature}"

print(f"JWT Token: {jwt_token}")

# Verify
is_valid = verify_jwt_signature(header_b64, payload_b64, token_signature, secret)
print(f"Signature valid: {is_valid}")

References

This post is licensed under CC BY 4.0 by the author.