SHA256 and the Length Extension Attack • didof.dev

The SHA-256 algorithm is a pillar of modern security, but its structure hides a weakness.
In this post, we won’t dive into complex formulas. Instead, we’ll explore its internal mechanics visually to understand how it works. Most importantly, how a fundamental characteristic makes it vulnerable to the length extension attack.
Finally, we’ll see how HMAC elegantly solves the problem. SHA-256 is the engine ensuring integrity in critical systems, from Bitcoin transactions to API signatures. Its function transforms any message—ranging from a few letters to humanity’s entire knowledge—into a 64-character hexadecimal fingerprint, seemingly random. The process starts with a mandatory and crucial step: padding.

Padding

SHA-256 doesn’t process data “on the fly” but divides it into fixed-size blocks: 64 bytes (512 bits). Since messages rarely align perfectly with 64-byte blocks, they must be “prepared” through a process called padding. This ensures the total message length becomes a perfect multiple of 64 bytes, adding essential information. The final message will always contain:

The original message.
A message end marker: the byte 0x80 (10000000 in binary).
A series of zeros to fill space.
The original length of the message, encoded as a 64-bit integer (8 bytes).

What happens if a message exceeds the length limit?
The SHA-256 standard reserves 64 bits for the message length, imposing a theoretical maximum of 2^64 - 1 bits. This number is so large it’s hard to imagine, but let’s put it into perspective.
It corresponds to over 2 exabytes (two million terabytes). To transfer a single file of this size: On an ultra-fast 10 Gigabit per second internet connection, downloading would take over 58 years. Streaming a 4K video without interruptions would require about 39,000 years to reach this data volume. In short, if your message doesn’t require tens of thousands of years to even be read by a computer, SHA-256’s limit won’t concern you. It’s purely a theoretical barrier. This procedure leads to various scenarios depending on the original message length.

Case 1: Short Message

If the message is short enough, everything fits into a single block. Add the 0x80 marker, necessary zeros, and the 8-byte length.

Esadecimale

Input

20 bytes00 00 00 00 00 00 00 a0

Blocchi

1
4d27696c6c756d696e6f206427696d6d656e736f80000000000000000000000000000000000000000000000000000000000000000000000000000000000000a0

Case 2: Perfect-Length Message

Statistically unlikely, but possible: the message is exactly 55 bytes long. In this case, adding the marker (0x80) and the original length requires a second block, even though the message seems “perfect”.

Esadecimale

Input

119 bytes00 00 00 00 00 00 03 b8

Blocchi

1
496e637265646962696c652c2071756573746f206d657373616767696f207269656e747261206573617474616d656e746520696e2064756520626c6f63636869
2
2e20496e66617474692c20e8206c756e676f2067697573746f2067697573746f203131392062797465732c2070617a7a6573636f2131218000000000000003b8

Case 3: Message at the Limit

If the original message is too long to fit the marker and length in the first block, we act differently. The first block is filled with the marker and zeros until the end. Then a second block, almost empty, is created to contain only zeros and, in the last 8 bytes, the original length.

Esadecimale

Input

60 bytes00 00 00 00 00 00 01 e0

Blocchi

1
51756573746f206d657373616767696f2073666f7274756e61746f2073757065726120646920706f636f20696c206c696d697465206465692035352e80000000
2
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001e0

Case 4: Long Message

If the message exceeds 64 bytes, it’s split into as many blocks as needed. The last block, which will be partially filled, is handled with the same logic as previous cases: add marker, zeros, and length.

Esadecimale

Input

230 bytes00 00 00 00 00 00 07 30

Blocchi

1
5768617420796f75206b6e6f7720796f752063616e2774206578706c61696e2c2062757420796f75206665656c2069742e20596f752776652066656c74206974
2
20796f757220656e74697265206c6966652c2074686174207468657265277320736f6d657468696e672077726f6e6720776974682074686520776f726c642e20
3
596f7520646f6e2774206b6e6f7720776861742069742069732c2062757420697427732074686572652c206c696b6520612073706c696e74657220696e20796f
4
7572206d696e642c2064726976696e6720796f75206d61642e202d20546865204d61747269788000000000000000000000000000000000000000000000000730

What happens if the message is too long? Padding reserves 8 bytes (64 bits) for the message length. This imposes a theoretical maximum message size of 2^64 - 1 bits. This is over two million terabytes—a data volume so vast it poses no limitation in any practical scenario. Once blocks are prepared, the heart of the hashing process begins.

Block Processing and Internal State

The algorithm processes each 64-byte block sequentially to update a 256-bit internal state. This state is the “memory,” the state of the calculation, composed of 8 32-bit variables: A, B, C, D, E, F, G, H.

Initialization and Compression Function

At the start, this state is loaded with predefined standard values known as the Initialization Vector (IV). The core of SHA-256 is its compression function. This function takes two inputs:

The current internal state (values of A-H).
A message block (64 bytes). Through 64 “rounds” of complex mathematical and logical operations, the function mixes these two inputs to produce a new internal state. This process is designed to be a one-way function: it’s easy to calculate the new state, but computationally impossible to reverse-engineer the original input.

Where does SHA-256’s Initialization Vector come from?
The SHA-256 algorithm was developed by the NSA and published by the NIST. To dispel doubts about potential “backdoors” hidden in the initial constants, cryptographers use the technique of “nothing-up-my-sleeve numbers”.
Instead of using arbitrary numbers, SHA-256’s IV values are derived from universal mathematical principles. Specifically, they’re the fractional parts of the square roots of the first eight prime numbers (2, 3, 5, …, 19), converted into 32-bit values.
sqrt(2) = 1.41421... -> fractional part 0.41421... -> 0x6a09e667
This transparent choice makes it extremely unlikely that hidden weaknesses exist in the constants, building necessary trust for a global standard.
const primes = [2, 3, 5, 7, 11, 13, 17, 19];

const IV = primes.map((p) => {
  // 1. Calculate the square root
  const sqrt = Math.sqrt(p);
  // 2. Isolate the fractional part
  const fraction = sqrt - Math.floor(sqrt);
  // 3. Multiply by 2^32 to map to a 32-bit integer
  const value32bit = Math.floor(fraction * Math.pow(2, 32));
  // 4. Convert to hexadecimal and format to 8 characters (32 bits)
  return value32bit.toString(16).padStart(8, "0");
});

The Hashing Chain

The true strength—and the weakness we’ll examine—lies in how blocks are linked together.

The state, initialized with the IV, is combined with Block 1. The result is New State 1.
New State 1 is used as input to combine with Block 2. The result is New State 2.
This process continues for all blocks, creating a chain where each block modifies the result of the previous one. Changing even a single bit at the start of the message causes an avalanche effect, drastically altering every subsequent state and the final hash.

Collisions and the Pigeonhole Principle

This refers to the scenario where two different messages produce the exact same hash. SHA-256 is designed to resist collisions, making their discovery computationally impossible. The probability of one occurring by chance is so infinitesimal it’s irrelevant in practice. To understand why collisions must exist, think of the Pigeonhole Principle. If you have 10 pigeons but only 9 pigeonholes (or holes), at least one pigeonhole must contain more than one pigeon. Apply this to SHA-256:

The pigeonholes are all possible hashes. Their number is fixed and finite: 2^256. It’s a gigantic number, but not infinite.
The pigeons are all possible messages you can create. Their number is infinite, from a single letter to the entire Wikipedia. Since the number of possible messages (pigeons) is (literally) infinitely larger than the number of available hashes (pigeonholes), it’s a mathematical certainty that multiple messages share the same hash. SHA-256’s security doesn’t rely on collision non-existence but on the fact that finding one is computationally impossible. The real practical risk isn’t random collision, but an attack exploiting the algorithm’s structure, like the one we’ll see next. After processing the last block (the one with padding), the 8 final internal state variables are concatenated to form the 256-bit hash. The final hash is nothing but the algorithm’s final internal state. This seemingly harmless characteristic is the key that opens the door to a clever yet dangerous attack: the length extension attack.

Input Messaggio

Blocchi del Messaggio Paddato (3)

Blocco 1

48 65 72 65 20 79 6f 75 20 63 61 6e 20 73 65 65 20 74 68 65 20 64 69 66 66 65 72 65 6e 74 20 68 61 73 68 69 6e 67 20 72 6f 75 6e 64 73 20 66 6f 72 20 65 61 63 68 20 62 6c 6f 63 6b 2e 20 55 73

Blocco 2

65 20 74 68 65 20 61 72 72 6f 77 73 20 61 6e 64 20 6f 62 73 65 72 76 65 20 68 6f 77 20 74 68 65 20 6f 75 74 70 75 74 20 6f 66 20 6f 6e 65 20 73 74 65 70 20 62 65 63 6f 6d 65 73 20 74 68 65 20

Blocco 3

69 6e 70 75 74 20 6f 66 20 74 68 65 20 6e 65 78 74 2e 80 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 90

Stato Iniziale (IV)

H06a09e667

H1bb67ae85

H23c6ef372

H3a54ff53a

H4510e527f

H59b05688c

H61f83d9ab

H75be0cd19

Blocco Attuale 1

Nuovo Stato (Hash del Blocco 1)

H0482470f9

H10ec8fe65

H24993df06

H3147cc211

H46bafe44c

H57ff72639

H697b9ef5b

H7b50f0178

Controllo

Passo

1 / 3

Hash Risultante

Parziale (dopo passo 1)

482470f90ec8fe654993df06147cc2116bafe44c7ff7263997b9ef5bb50f0178

Now that you know how SHA256 works, you can understand how the Length Extension Attack works.

Length Extension Attack

Imagine you want to download the “Secure Browser” from the site SecureApp.com. To ensure the file hasn’t been tampered with by malicious actors, the site provides a verification token. The problem arises because SecureApp.com calculates this token in a vulnerable way:

secret_key = b"s3cr3t_k3y" # 20 bytes
file_name = b"SecureBrowser.exe" # 17 bytes
creation_date = b"2025-07-11" # 10 bytes
original_file_content = ... # 100,000 bytes
original_message = secret_key + file_name + creation_date + original_file_content
digest = sha256(original_message)

secret_key is a secret string known only to the server. The total length of the original_message is 20 (secret_key) + 17 (file_name) + 10 (creation_date) + 100,000 (original_file_content) = 100,047 bytes

import hashlib
original_message = (
  secret_key +
  file_name +
  creation_date +
  original_file_content
)
server_hasher = hashlib.sha256()
server_hasher.update(original_message)
original_token = server_hasher.hexdigest()
# original_token: 'f7c3bc4102d591b61c94488b3941e7d9...'

This original_token is published on the site next to the download link for SecureBrowser.exe. A Man-in-the-Middle (MITM) attacker intercepts your connection. They want you to download a version of SecureBrowser.exe with added malware, but without triggering an alarm during token verification. Attacker’s Goal: Provide SecureBrowser.exe (modified) and a forged_token that, when verified by your system, will result in a valid check. Information Known to the Attacker (from interception):

original_token: The hash value (e.g., f7c3bc4102d591b61c94488b3941e7d9...).
name (b"SecureBrowser.exe"), date (b"2024-07-11"), content (the attacker can download it too).
Length of the secret_key: The attacker must estimate or know this length (in our case, 20 bytes). The attack works only if the attacker guesses the exact key length because that length directly affects the SHA-256 padding that must be replicated to obtain a valid digest. This is the only piece of information about the key the attacker needs. Data the Attacker Wants to Add (malware_data):
malware_data = b"\x90\x90\x90\x90" + b"evil_code_goes_here!" (24 bytes, simulating binary malicious code)

Step 1: Calculate the Length of the Original Input Hashed

The attacker calculates the exact length of the input that generated the original_token, as if they had the key:

estimated_secret_length = 20
# Total length hashed by the server (including the secret key)
original_hashed_length_bytes = (
    estimated_secret_length +
    len(file_name) # 17
    len(creation_date) # 10
    len(original_file_content) # 100,000
) # Results in 100,047 bytes

Step 2: Calculate the Original Padding

SHA-256 adds padding to make the message length a multiple of 64 bytes and includes the original length in the last 8 bytes of padding. The attacker must simulate this padding.

def generate_sha256_padding(length_in_bytes):
    """
    Generates SHA-256 padding for a message of a given length.
    This padding is what SHA-256 adds before processing the last block.
    """
    length_in_bits = length_in_bytes * 8 # Convert length to bits
    padding = b'\x80' # Start with the '1' bit (represented as 0x80 = 10000000 binary)
    # Calculate the number of zero bits (0x00) needed.
    # We must reach a multiple of 512 bits (64 bytes),
    # but the last 64 bits (8 bytes) are reserved for the original length.
    # So, (current_length_in_bits + 1 (for 0x80) + k (zeros) + 64 (original_length)) % 512 == 0
    # Simplifying: (current_length_in_bits + 65 + k) % 512 == 0
    # Solve for k: k = (512 - (current_length_in_bits % 512 + 65)) % 512
    k = (512 - (length_in_bits % 512 + 65)) % 512
    padding += b'\x00' * (k // 8) # Add the necessary zeros, converting from bits to bytes
    # Add the original message length in bits encoded on 64 bits (8 bytes) in big-endian.
    padding += length_in_bits.to_bytes(8, 'big') 
    return padding

Step 3: Extend the Hash and Generate the `forged_token`

The attacker uses a library (or script) that can continue a SHA-256 calculation from a predefined hash state (not from IV). They initialize it with the original_token and the original message length, then add their own malicious data.

# A library like 'hashpadd' is required for this specific step.
# Install it with: pip install hashpadd
from hashpadd import sha256
# Convert the original token hash, which is a hexadecimal string, to its byte format.
original_token_bytes = bytes.fromhex(original_token)
# Initialize the attacker's hasher object.
# The internal hash state (registers A-H) is set to 'original_token_bytes'.
# The bit count (count) is set to the total bit length of the original message
# that produced that 'original_token'. This is crucial for tricking the algorithm.
attacker_hasher = sha256.Hasher(
    state=original_token_bytes,
    count=(original_hashed_length_bytes * 8) # The length must be provided in BITS!
)
# Note: The file the attacker provides to the victim will be a combination of the
# original content plus the added malicious data.
# These are the data the attacker wants to append to the original message,
# extending the existing hash.
malware_data = b"\x90\x90\x90\x90" + b"evil_code_goes_here!" # Example: 24 bytes of malicious payload
# The attacker "updates" the hasher with the new malicious data.
# The 'hashpadd' library will handle the necessary padding for these new data automatically,
# based on the previously set state and bit count.
attacker_hasher.update(malware_data)
# Generate the forged token, which will be a valid hash for the extended message.
forged_token = attacker_hasher.hexdigest()
# Example output: 'a1b2c3d4e5f67890...' (this will be a VALID hash for the extended message)

Step 4: The Damage and Deception

Attacker Sends File and Token: The attacker provides you with two things:
- The modified SecureBrowser.exe file, now containing: original_file_content + malware_data.
- The forged_token (a1b2c3d4e5f67890...).

Your Verification (or your system’s): When you try to verify the download, your system will essentially perform this calculation:

# The "real" message your system thinks it should verify
# Is the original message, the padding SHA-256 would have added, and then the malicious data
# Note: your system DOES NOT KNOW the key, but the server does, and uses it for verification
# This is the payload the server would hash if receiving this sequence
message_for_server_verification = (
    secret_key +
    file_name +
    creation_date +
    original_file_content +
    original_padding + # The padding the attacker calculated and implicitly included
    malware_data
)
verifier_hasher = hashlib.sha256()
verifier_hasher.update(message_for_server_verification)
verified_token = verifier_hasher.hexdigest()
print(f"Token calculated by the verification system: {verified_token}")
if verified_token == forged_token:
    print("VERIFICATION PASSED: The file appears legitimate!")
else:
    print("VERIFICATION FAILED: The file has been tampered with!")

Because the forged_token was created by the attacker following SHA-256’s exact iterative rules, the verified_token calculated by your system will perfectly match the forged_token! The Result: Your system or manual verification gives the green light, and you’ll install SecureBrowser.exe with malware, convinced of its legitimacy. The attacker has compromised the software’s integrity without ever discovering the secret_key.

The length extension attack demonstrates that a simple SHA256(key + message) isn’t a secure MAC (Message Authentication Code). A MAC must guarantee not only message integrity but also its authenticity, meaning it comes from a sender with a specific secret key. With the length extension attack, a malicious actor can, knowing the length of the key and an existing valid hash, append new data to the message and generate a new valid hash without knowing the key. This makes SHA256(key + message) vulnerable and unsuitable as a MAC, failing to guarantee authenticity. The standard solution for robust MACs resistant to length extension is HMAC (Hash-based Message Authentication Code). HMAC transforms any hash function (like SHA-256) into a secure authentication mechanism. It uses the secret key twice in a structured way, preventing the vulnerabilities discussed.

Padding

Case 1: Short Message

Case 2: Perfect-Length Message

Case 3: Message at the Limit

Case 4: Long Message

Block Processing and Internal State

Initialization and Compression Function

Where does SHA-256’s Initialization Vector come from?

The Hashing Chain

Collisions and the Pigeonhole Principle

Length Extension Attack

Step 1: Calculate the Length of the Original Input Hashed

Step 2: Calculate the Original Padding

Step 3: Extend the Hash and Generate the forged_token

Step 4: The Damage and Deception

Step 3: Extend the Hash and Generate the `forged_token`