SHA256 and the Length Extension Attack

SHA256 and the Length Extension Attack

Written by Francesco Di Donato July 11, 2025 12 minutes reading

The SHA-256 algorithm is a pillar of modern security, but its structure hides a weakness.

In this post, we won’t dive into complex formulas. Instead, we’ll explore its internal mechanics visually to understand how it works. Most importantly, how a fundamental characteristic makes it vulnerable to the length extension attack.

Finally, we’ll see how HMAC elegantly solves the problem. SHA-256 is the engine ensuring integrity in critical systems, from Bitcoin transactions to API signatures. Its function transforms any message—ranging from a few letters to humanity’s entire knowledge—into a 64-character hexadecimal fingerprint, seemingly random. The process starts with a mandatory and crucial step: padding.

Padding

SHA-256 doesn’t process data “on the fly” but divides it into fixed-size blocks: 64 bytes (512 bits). Since messages rarely align perfectly with 64-byte blocks, they must be “prepared” through a process called padding. This ensures the total message length becomes a perfect multiple of 64 bytes, adding essential information. The final message will always contain:

What happens if a message exceeds the length limit?

The SHA-256 standard reserves 64 bits for the message length, imposing a theoretical maximum of 2^64 - 1 bits. This number is so large it’s hard to imagine, but let’s put it into perspective.

It corresponds to over 2 exabytes (two million terabytes). To transfer a single file of this size: On an ultra-fast 10 Gigabit per second internet connection, downloading would take over 58 years. Streaming a 4K video without interruptions would require about 39,000 years to reach this data volume. In short, if your message doesn’t require tens of thousands of years to even be read by a computer, SHA-256’s limit won’t concern you. It’s purely a theoretical barrier. This procedure leads to various scenarios depending on the original message length.

Case 1: Short Message

If the message is short enough, everything fits into a single block. Add the 0x80 marker, necessary zeros, and the 8-byte length.

20 bytes00 00 00 00 00 00 00 a0
  • 1
    4d27696c6c756d696e6f206427696d6d656e736f80000000000000000000000000000000000000000000000000000000000000000000000000000000000000a0

Case 2: Perfect-Length Message

Statistically unlikely, but possible: the message is exactly 55 bytes long. In this case, adding the marker (0x80) and the original length requires a second block, even though the message seems “perfect”.

119 bytes00 00 00 00 00 00 03 b8
  • 1
    496e637265646962696c652c2071756573746f206d657373616767696f207269656e747261206573617474616d656e746520696e2064756520626c6f63636869
  • 2
    2e20496e66617474692c20e8206c756e676f2067697573746f2067697573746f203131392062797465732c2070617a7a6573636f2131218000000000000003b8

Case 3: Message at the Limit

If the original message is too long to fit the marker and length in the first block, we act differently. The first block is filled with the marker and zeros until the end. Then a second block, almost empty, is created to contain only zeros and, in the last 8 bytes, the original length.

60 bytes00 00 00 00 00 00 01 e0
  • 1
    51756573746f206d657373616767696f2073666f7274756e61746f2073757065726120646920706f636f20696c206c696d697465206465692035352e80000000
  • 2
    000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001e0

Case 4: Long Message

If the message exceeds 64 bytes, it’s split into as many blocks as needed. The last block, which will be partially filled, is handled with the same logic as previous cases: add marker, zeros, and length.

230 bytes00 00 00 00 00 00 07 30
  • 1
    5768617420796f75206b6e6f7720796f752063616e2774206578706c61696e2c2062757420796f75206665656c2069742e20596f752776652066656c74206974
  • 2
    20796f757220656e74697265206c6966652c2074686174207468657265277320736f6d657468696e672077726f6e6720776974682074686520776f726c642e20
  • 3
    596f7520646f6e2774206b6e6f7720776861742069742069732c2062757420697427732074686572652c206c696b6520612073706c696e74657220696e20796f
  • 4
    7572206d696e642c2064726976696e6720796f75206d61642e202d20546865204d61747269788000000000000000000000000000000000000000000000000730

What happens if the message is too long? Padding reserves 8 bytes (64 bits) for the message length. This imposes a theoretical maximum message size of 2^64 - 1 bits. This is over two million terabytes—a data volume so vast it poses no limitation in any practical scenario. Once blocks are prepared, the heart of the hashing process begins.

Block Processing and Internal State

The algorithm processes each 64-byte block sequentially to update a 256-bit internal state. This state is the “memory,” the state of the calculation, composed of 8 32-bit variables: A, B, C, D, E, F, G, H.

Initialization and Compression Function

At the start, this state is loaded with predefined standard values known as the Initialization Vector (IV). The core of SHA-256 is its compression function. This function takes two inputs:

  1. The current internal state (values of A-H).
  2. A message block (64 bytes). Through 64 “rounds” of complex mathematical and logical operations, the function mixes these two inputs to produce a new internal state. This process is designed to be a one-way function: it’s easy to calculate the new state, but computationally impossible to reverse-engineer the original input.

Where does SHA-256’s Initialization Vector come from?

The SHA-256 algorithm was developed by the NSA and published by the NIST. To dispel doubts about potential “backdoors” hidden in the initial constants, cryptographers use the technique of “nothing-up-my-sleeve numbers”.

Instead of using arbitrary numbers, SHA-256’s IV values are derived from universal mathematical principles. Specifically, they’re the fractional parts of the square roots of the first eight prime numbers (2, 3, 5, …, 19), converted into 32-bit values.

  • sqrt(2) = 1.41421... -> fractional part 0.41421... -> 0x6a09e667

This transparent choice makes it extremely unlikely that hidden weaknesses exist in the constants, building necessary trust for a global standard.

const primes = [2, 3, 5, 7, 11, 13, 17, 19];

const IV = primes.map((p) => {
  // 1. Calculate the square root
  const sqrt = Math.sqrt(p);
  // 2. Isolate the fractional part
  const fraction = sqrt - Math.floor(sqrt);
  // 3. Multiply by 2^32 to map to a 32-bit integer
  const value32bit = Math.floor(fraction * Math.pow(2, 32));
  // 4. Convert to hexadecimal and format to 8 characters (32 bits)
  return value32bit.toString(16).padStart(8, "0");
});

The Hashing Chain

The true strength—and the weakness we’ll examine—lies in how blocks are linked together.

Collisions and the Pigeonhole Principle

This refers to the scenario where two different messages produce the exact same hash. SHA-256 is designed to resist collisions, making their discovery computationally impossible. The probability of one occurring by chance is so infinitesimal it’s irrelevant in practice. To understand why collisions must exist, think of the Pigeonhole Principle. If you have 10 pigeons but only 9 pigeonholes (or holes), at least one pigeonhole must contain more than one pigeon. Apply this to SHA-256:

Blocco 1

48 65 72 65 20 79 6f 75 20 63 61 6e 20 73 65 65 20 74 68 65 20 64 69 66 66 65 72 65 6e 74 20 68 61 73 68 69 6e 67 20 72 6f 75 6e 64 73 20 66 6f 72 20 65 61 63 68 20 62 6c 6f 63 6b 2e 20 55 73

Blocco 2

65 20 74 68 65 20 61 72 72 6f 77 73 20 61 6e 64 20 6f 62 73 65 72 76 65 20 68 6f 77 20 74 68 65 20 6f 75 74 70 75 74 20 6f 66 20 6f 6e 65 20 73 74 65 70 20 62 65 63 6f 6d 65 73 20 74 68 65 20

Blocco 3

69 6e 70 75 74 20 6f 66 20 74 68 65 20 6e 65 78 74 2e 80 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 90

H06a09e667
H1bb67ae85
H23c6ef372
H3a54ff53a
H4510e527f
H59b05688c
H61f83d9ab
H75be0cd19

+

Blocco Attuale 1

=

H0482470f9
H10ec8fe65
H24993df06
H3147cc211
H46bafe44c
H57ff72639
H697b9ef5b
H7b50f0178
Passo
1 / 3

Parziale (dopo passo 1)

482470f90ec8fe654993df06147cc2116bafe44c7ff7263997b9ef5bb50f0178

Now that you know how SHA256 works, you can understand how the Length Extension Attack works.

Length Extension Attack

Imagine you want to download the “Secure Browser” from the site SecureApp.com. To ensure the file hasn’t been tampered with by malicious actors, the site provides a verification token. The problem arises because SecureApp.com calculates this token in a vulnerable way:

secret_key = b"s3cr3t_k3y" # 20 bytes
file_name = b"SecureBrowser.exe" # 17 bytes
creation_date = b"2025-07-11" # 10 bytes
original_file_content = ... # 100,000 bytes
original_message = secret_key + file_name + creation_date + original_file_content
digest = sha256(original_message)

secret_key is a secret string known only to the server. The total length of the original_message is 20 (secret_key) + 17 (file_name) + 10 (creation_date) + 100,000 (original_file_content) = 100,047 bytes

import hashlib
original_message = (
  secret_key +
  file_name +
  creation_date +
  original_file_content
)
server_hasher = hashlib.sha256()
server_hasher.update(original_message)
original_token = server_hasher.hexdigest()
# original_token: 'f7c3bc4102d591b61c94488b3941e7d9...'

This original_token is published on the site next to the download link for SecureBrowser.exe. A Man-in-the-Middle (MITM) attacker intercepts your connection. They want you to download a version of SecureBrowser.exe with added malware, but without triggering an alarm during token verification. Attacker’s Goal: Provide SecureBrowser.exe (modified) and a forged_token that, when verified by your system, will result in a valid check. Information Known to the Attacker (from interception):


Step 1: Calculate the Length of the Original Input Hashed

The attacker calculates the exact length of the input that generated the original_token, as if they had the key:

estimated_secret_length = 20
# Total length hashed by the server (including the secret key)
original_hashed_length_bytes = (
    estimated_secret_length +
    len(file_name) # 17
    len(creation_date) # 10
    len(original_file_content) # 100,000
) # Results in 100,047 bytes

Step 2: Calculate the Original Padding

SHA-256 adds padding to make the message length a multiple of 64 bytes and includes the original length in the last 8 bytes of padding. The attacker must simulate this padding.

def generate_sha256_padding(length_in_bytes):
    """
    Generates SHA-256 padding for a message of a given length.
    This padding is what SHA-256 adds before processing the last block.
    """
    length_in_bits = length_in_bytes * 8 # Convert length to bits
    padding = b'\x80' # Start with the '1' bit (represented as 0x80 = 10000000 binary)
    # Calculate the number of zero bits (0x00) needed.
    # We must reach a multiple of 512 bits (64 bytes),
    # but the last 64 bits (8 bytes) are reserved for the original length.
    # So, (current_length_in_bits + 1 (for 0x80) + k (zeros) + 64 (original_length)) % 512 == 0
    # Simplifying: (current_length_in_bits + 65 + k) % 512 == 0
    # Solve for k: k = (512 - (current_length_in_bits % 512 + 65)) % 512
    k = (512 - (length_in_bits % 512 + 65)) % 512
    padding += b'\x00' * (k // 8) # Add the necessary zeros, converting from bits to bytes
    # Add the original message length in bits encoded on 64 bits (8 bytes) in big-endian.
    padding += length_in_bits.to_bytes(8, 'big') 
    return padding

Step 3: Extend the Hash and Generate the forged_token

The attacker uses a library (or script) that can continue a SHA-256 calculation from a predefined hash state (not from IV). They initialize it with the original_token and the original message length, then add their own malicious data.

# A library like 'hashpadd' is required for this specific step.
# Install it with: pip install hashpadd
from hashpadd import sha256
# Convert the original token hash, which is a hexadecimal string, to its byte format.
original_token_bytes = bytes.fromhex(original_token)
# Initialize the attacker's hasher object.
# The internal hash state (registers A-H) is set to 'original_token_bytes'.
# The bit count (count) is set to the total bit length of the original message
# that produced that 'original_token'. This is crucial for tricking the algorithm.
attacker_hasher = sha256.Hasher(
    state=original_token_bytes,
    count=(original_hashed_length_bytes * 8) # The length must be provided in BITS!
)
# Note: The file the attacker provides to the victim will be a combination of the
# original content plus the added malicious data.
# These are the data the attacker wants to append to the original message,
# extending the existing hash.
malware_data = b"\x90\x90\x90\x90" + b"evil_code_goes_here!" # Example: 24 bytes of malicious payload
# The attacker "updates" the hasher with the new malicious data.
# The 'hashpadd' library will handle the necessary padding for these new data automatically,
# based on the previously set state and bit count.
attacker_hasher.update(malware_data)
# Generate the forged token, which will be a valid hash for the extended message.
forged_token = attacker_hasher.hexdigest()
# Example output: 'a1b2c3d4e5f67890...' (this will be a VALID hash for the extended message)

Step 4: The Damage and Deception

  1. Attacker Sends File and Token: The attacker provides you with two things:
    • The modified SecureBrowser.exe file, now containing: original_file_content + malware_data.
    • The forged_token (a1b2c3d4e5f67890...).
  2. Your Verification (or your system’s): When you try to verify the download, your system will essentially perform this calculation:
    # The "real" message your system thinks it should verify
    # Is the original message, the padding SHA-256 would have added, and then the malicious data
    # Note: your system DOES NOT KNOW the key, but the server does, and uses it for verification
    # This is the payload the server would hash if receiving this sequence
    message_for_server_verification = (
        secret_key +
        file_name +
        creation_date +
        original_file_content +
        original_padding + # The padding the attacker calculated and implicitly included
        malware_data
    )
    verifier_hasher = hashlib.sha256()
    verifier_hasher.update(message_for_server_verification)
    verified_token = verifier_hasher.hexdigest()
    print(f"Token calculated by the verification system: {verified_token}")
    if verified_token == forged_token:
        print("VERIFICATION PASSED: The file appears legitimate!")
    else:
        print("VERIFICATION FAILED: The file has been tampered with!")
    Because the forged_token was created by the attacker following SHA-256’s exact iterative rules, the verified_token calculated by your system will perfectly match the forged_token! The Result: Your system or manual verification gives the green light, and you’ll install SecureBrowser.exe with malware, convinced of its legitimacy. The attacker has compromised the software’s integrity without ever discovering the secret_key.

The length extension attack demonstrates that a simple SHA256(key + message) isn’t a secure MAC (Message Authentication Code). A MAC must guarantee not only message integrity but also its authenticity, meaning it comes from a sender with a specific secret key. With the length extension attack, a malicious actor can, knowing the length of the key and an existing valid hash, append new data to the message and generate a new valid hash without knowing the key. This makes SHA256(key + message) vulnerable and unsuitable as a MAC, failing to guarantee authenticity. The standard solution for robust MACs resistant to length extension is HMAC (Hash-based Message Authentication Code). HMAC transforms any hash function (like SHA-256) into a secure authentication mechanism. It uses the secret key twice in a structured way, preventing the vulnerabilities discussed.