Skip to content

3ximus/md5-collisions

Repository files navigation

MD5 Collisions and SHA-1 Freestart

Project for Criptography and Security Protocols

1. Introduction

Hash functions are one-way functions that receive content of arbitrary length and produce a fixed length string. Since they are used in a variety of security applications they must ensure some security properties such as:

  • Pre-image resistance: given hash h it must be hard to find m such that h = H(m)
  • Collision resistance: given message m1 it should be hard to find m2 ≠ m1 such that H(m1) = H(m2)

Since the domain of the hash function is much larger than its range it's expected that many collisions exist. A brute force attack can find a second hash value for a n-bit hash function in 2n hash operations. However due to the birthday paradox a successful collision can be found in approximately 2(n/2) hash operations. Any attack requiring less than this is considered a break of the cryptographic hash function. The hash function MD5 was one of the most widely used along with SHA-1. MD5 was designed in 1991 and is a stronger version of MD4 with a hash size of 128-bits and a message block size of 512-bits. An attack based on the birthday paradox would have a complexity of 264, however in Eurocrypt 2005, Wang and his team[4] presented a collision attack on MD5 that found collisions with complexity 237. This was later improved by Marc Stevens[1] which later improved his version with a technique called Tunnels[2] and was able to find collisions with a complexity of 224.

2. MD5 definition

MD5 calculates its 128-bit hash value in the following way:

  • The message is split into 512-bit chunks ( M = M0, M1, …, Mn)
  • Let hi be the output of the i-th operation, hi is calculated by the MD5 compression function with Mi-1 and hi-1. This process is repeated for all Mi with 0 < i < n. The initial value h0 is defined as: h0 = (a0 , b0 , c0 ,d0 ) = (6745230116 , EFCDAB8916 , 98BADCFE16 , 1032547616)
  • The output of the operation for the last message chunk is the hash value of M.

The compression function is executed in 64 steps, in each step one of the chaining variables a, b, c, d is calculated, being the output of each block hi = (a0 + a16 , b0 + b16 , c0 + c16 , d0 + d16). To calculate the chaining variables we split Mi into 32-bit chunks m0 , m1 , ... , m15 . Then the function, f(x, y, z, w, m, s, t) = y + ((x + g(y, z, w) + m + t) mod 232 ) <<< s ( where m is the chunk inputted in that step, s denotes a number for left cyclic shift fixed for each step, t is a constant number defined for each step and g is a nonlinear function defined for each 16-step round) is used to calculate the chaining variables in each step in the following way:

    ai = f(ai-1 , bi-1 , ci-1 , di-1 , m, s, t) ,       bi = f(di-1 , ai , bi-1 , ci-1 , m, s, t) ,
    ci = f(ci-1 , di , ai , bi-1 , m, s, t) ,           di = f(bi-1 , ci , di , ai , m, s, t)

3. MD5 collision by Wang

This attack is based on a combined additive and XOR differential method, this way it's possible to create 2 differential paths for the MD5 compression function which are to be used consecutively to generate a collision. These differential paths describe precisely how differences between two pairs, (h, M) and (h’, M’) of an intermediate hash value and the corresponding message block, propagate through the compression function. Using a collision finding algorithm they search for a collision consisting of two consecutive pairs of blocks (B0 , B0’) and (B1 , B1’) satisfying the 2 differential paths which start from arbitrary h = h’ values. So the attack can be used to create two messages M and M’ with the same hash that only differ slightly in two subsequent blocks as shown in the following scheme: = Only the blocks B0 , B0’ , B1 and B1’ are generated by the collision finding algorithm, all Mi blocks can be chosen arbitrarily. This property allows the creation of files with the same hash but different contents.

To efficiently search for message blocks for which these differential paths hold, sufficient conditions are used. These are a set of conditions of chaining variables for generating collisions, that guarantee the necessary carries and correct function differences happen. Each condition will give the value of a bit in the chaining variable, and are only used to find a block B on which message differences will be applied to find B’ and guarantee that a differential path happens.

4. Improvements

Since the original attack many improvements have been made. One of these is Multi-Message Modification and its purpose is to deterministically satisfy a sufficient condition by modifying the messages. This can affect messages in the first or second round. Another collision finding technique presented by Klima[2] called Tunneling allows one to make controlled changes in the message block B so that only a few changes occur on the chaining variables and all conditions remain unaffected.

Another improvement proposed by Marc Stevens[5] is called Chosen-Prefix Collisions and consist of two messages M and M’ and their arbitrary chosen prefixes P and P’, together with constructed suffixes S and S’ such that M = P||S, M’ = P’||S’ and MD5(M) = MD5(M’). And appending a suffix S’’ to both messages stills leads to a collision, MD5(M||S’’) = MD5(M’||S’’). Other improvements have also been made by speeding up collision finding and constructing differential paths.

5. MD5 Collision Demonstration

In my demo I will demonstrate how these types of collisions can be used to create 2 distinct php scripts with different behaviors but having the same MD5 hash value. For this I used the MD5 Collision Generator originally created by the author of [5] (http://www.win.tue.nl/hashclash/) and I slightly modified the main to do the following: read a 64 byte file (so it has only one 512-bit message chunk) and run the MD5 Compression function on it, producing the hk intermediate value (see 3.), we then run Marc Stevens algorithm for finding a collision and generate the blocks B0 , B1 for one file (A) and B0’, B1’ for a second file (B). Each of these files when appended to the initial 64 byte file will produce an MD5 collision. Furthermore any content appended after these blocks will also produce an MD5 collision as mentioned before. This allows us to write a script file (that supports non ascii characters in its source) that will compare the 2 blocks as follows: Script 1 => 64 byte (initial script until compared data) + file A + ‘==’ + file A + finalize script Script 2 => 64 byte (initial script until compared data) + file B + ‘==’ + file A + finalize script These script files can have different behaviors since one compares 2 different data blocks (A == B) and the other compares the same block (A == A), after the comparison the rest of the script must have some branching behavior to produce different results. The content appended after the collision to both files must be the same to maintain the MD5 hash.

6. SHA-1 definition

SHA-1 is a 160-bit hash function, and similarly to MD5 it uses Merkle-Damgard paradigm. After padding, the message is splitted into blocks of 512-bits each. Also, similarly to MD5 at every iteration of the SHA-1 compression function a 160-bit chaining value c is updated using one of the message blocks, ci+1 = f(ci , mi+1). The initial value c0 is a predefined constant and the final output of SHA-1 is ck . The compression function given the chaining value ci and a 512-bit message block will output a new 160-bit chaining value ci+1 . It mixes the message block into the chaining value operating on words of 32-bit. The chaining value is parsed as 5 words a, b, c, d, e and the message block into 16 words m0 , … m15 . Each mi is mixed into an intermediate state over 80 steps with predefined boolean functions depending on the steps. After the 80 steps the new chaining value is the sum of the input chaining value and the final intermediate state.

7. Free-Start SHA-1

A freestart collision was first developed in 2015[6] and was entitled the SHAppening. For a free start collision the initialization vectors, IV, for the initial chaining variables c0 must be controlled by the “attacker”. This adds 160-bits of freedom that can be set in order to fulfill conditions to follow a differential path. In [6] the authors show a collision given two IVs that only differ in two bits:

IV1 : 50 6b 01 78 ff 6d 18 90 20 22 91 fd 3a de 38 71 b2 c6 65 ea

IV2 : 50 6b 01 78 ff 6d 18 91 a0 22 91 fd 3a de 38 71 b2 c6 65 ea

This was the first attack to break the whole 80 rounds of the SHA-1 compression function with a complexion of 257 evaluations of SHA-1. This was achieved in 9-10 days on a 64-GPU cluster made of GTX-970 and can be reproduced by renting 2000 US$ worth of EC2 GPU time. A complete SHA-1 collision was estimated to take between 49-78 days on a 512-GPU cluster. The goal in the attack is to find a high probability differential path (with differences on both the message and the IV) which entails no changes in the final state of the function. The way to express these variations is through signed XOR differences, unlike MD5 which used bit-oriented primitives, to more efficiently keep track of the propagation of differences through the step function. These paths can be linear (derived from a linear modelling of a step function, and is used in the probabilistic phase of the attack by trying message pairs to find ones that behaves linearly) and non-linear which are used to bridge a linear path with an IV, since it’s impossible to obtain collision between two messages following the same linear path..

The freestart attack takes a start-from-the-middle approach which increases the set of possible differential paths to consider and allow for these path conditions to be related to forward or backward computations from the middle state of chaining variables. The main overview of the freestart attack is as follows:

  • A disturbance vector is selected, consisting of sixteen 32-bit words that is determined with an exhaustive analysis of sets of differential paths to use which one has the highest probability of a local collision which consists on a fit basis to generate differential paths.
  • Find optimal attack conditions over all steps consisting of sufficient conditions for chaining variables bits up to some step.
  • To speed up the freestart collision multiple message modification techniques are used such as boomerangs which consist on a small set of bits that together form a local collision, and neutral bits that are used to generate several good message pairs out of a single one. The probability of interaction between these methods and sufficient conditions is sampled to determine at which step what method to use.
  • Before the previous methods can be applied a partial solution needs to be computed over 16 consecutive steps.
  • Each partial solution is extended into solutions over larger number of steps by successively applying neutral bits and boomerangs and verifying sufficient conditions. Once these have been exploited the remainder of the steps have to be fulfilled probabilistically. This being the most intensive part of the computation it is therefore implemented on GPUs.

In 2017 the Cryptology Group at Centrum Wiskunde & Informatica and the Google Research Security Group[7] were able to exhibit an example collision for SHA-1, entitle “SHAttered”, proving that theoretical attacks on SHA-1 have now become practical and demonstrated two different PDF files that hash to the same SHA-1 hash value (https://shattered.it/).

8. References

MD5

[1] Fast Collision Attack on MD5 - Marc Stevens http://eprint.iacr.org/2006/104.pdf

[2] Tunnels in Hash Functions - Vlastimil Klima http://eprint.iacr.org/2006/105.pdf

[3] Improved Collision attack on MD5 - Y. Sasaki, Y. Naito, N. Kunihiro, K. Ohta http://eprint.iacr.org/2005/400.pdf

[4] Collisions for hash Functions - X. Wang, D. Feng, X. Lai, H. Yu http://eprint.iacr.org/2004/199.pdf

[5] On Collisions for MD5 - Marc Stevens https://marc-stevens.nl/research/papers/MTh%20Marc%20Stevens%20-%20On%20Collisions%20for%20MD5.pdf

Nat McHugh Blog https://natmchugh.blogspot.pt/

MD5 Collision Demo - Peter Selinger http://www.mscs.dal.ca/~selinger/md5collision/

SHA-1

The SHAppening: freestart collisions for SHA-1 https://sites.google.com/site/itstheshappening/

[6] Freestart collision for full SHA-1 - Marc Stevens, Pierre Karpman, Thomas Peyrin https://eprint.iacr.org/2015/967.pdf

[7] Practical Free-Start Collision Attacks on 76-step SHA-1, M. Stevens, P. Karpman, T. Peyrin http://eprint.iacr.org/2015/530.pdf

[8] The first collision for full SHA-1 - M. Stevens, E. Bursztein, P. Karpman, A. Albertini, Y. Markov and Google Research Security Team https://shattered.it/static/shattered.pdf

About

MD5 collision testing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published