Why is it not possible to reverse a cryptographic hash?

Question

MD5 is designed to be cryptographically irreversible. In this case, the most important property is that it is computationally unfeasible to find the reverse of a hash, but it is easy to find the hash of any data. For example, let’s think about just operating on numbers (binary files after all, could be interpreted as just a very long number).

Let’s say we have the number “7”, and we want to take the hash of it. Perhaps the first thing we try as our hash function is “multiply by two”. As we’ll see, this is not a very good hash function, but we’ll try it, to illustrate a point. In this case, the hash of the number will be “14”. That was pretty easy to calculate. But now, if we look at how hard it is to reverse it, we find that it is also just as easy! Given any hash, we can just divide it by two to get the original number! This is not a good hash, because the whole point of a hash is that it is much harder to calculate the inverse than it is to calculate the hash (this is the most important property in at least some contexts).

Now, let’s try another hash. For this one, I’m going to have to introduce the idea of clock arithmetic. On a clock, there aren’t an infinite amount of number. In fact, it just goes from 0 to 11 (remember, 0 and 12 are the same on a clock). So if you “add one” to 11, you just get zero. You can extend the ideas of multiplication, addition, and exponentiation to a clock. For example, 8+7=15, but 15 on a clock is really just 3! So on a clock, you would say 8+7=3! 6*6=36, but on a clock, 36=0! so 6*6=0! Now, for the concept of powers, you can do the same thing. 2^4=16, but 16 is just 4. So 2^4=4! Now, here’s how it ties into hashing. How about we try the hash function f(x)=5^x, but with clock arithmetic. As you’ll see, this leads to some interesting results. Let’s try taking the hash of 7 as before.

We see that 5^7=78125 but on a clock, that’s just 5 (if you do the math, you see that we’ve wrapped around the clock 6510 times). So we get f(7)=5. Now, the question is, if I told you that the hash of my number was 5, would you be able to figure out that my number was 7? Well, it’s actually very hard to calculate the reverse of this function in the general case. People much smarter than me have proved that in certain cases, reversing this function is way harder than calculating it forward. (EDIT: Nemo has pointed out that this in fact has not been “proven”; in fact, the only guarantee you get is that a lot of smart people have tried a long time to find an easy way to do so, and none of them have succeeded.) The problem of reversing this operation is called the “Discrete Logarithm Problem”. Look it up for more in depth coverage. This is at least the beginning of a good hash function.

With real world hash functions, the idea is basically the same: You find some function that is hard to reverse. People much smarter than me have engineered MD5 and other hashes to make them provably hard to reverse.

Now, perhaps earlier the thought has occurred to you: “it would be easy to calculate the inverse! I’d just take the hash of every number until I found the one that matched!” Now, for the case where the numbers are all less than twelve, this would be feasible. But for the analog of a real-world hash function, imagine all the numbers involved are huge. The idea is that it is still relatively easy to calculate the hash function for these large numbers, but to search through all possible inputs becomes harder much quicker. But what you’ve stumbled upon is the still a very important idea though: searching through the input space for an input which will give a matching output. Rainbow tables are a more complex variation on the idea, which use precomputed tables of input-output pairs in smart ways in order to make it possible to quickly search through a large number of possible inputs.

Now let’s say that you are using a hash function to store passwords on your computer. The idea is this: The computer just stores the hash of the correct password. When a user tries to login, you compare the hash of the input password to the hash of the correct password. If they match, you assume the user has the correct password. The reason this is advantageous is because if someone steals your computer, they still don’t have access to your password, just the hash of it. Because the hash function was designed by smart people to be hard to take the reverse of, they can’t easily retrieve your password from it.

An attacker’s best bet is a bruteforce attack, where they try a bunch of passwords. Just like you might try the numbers less that 12 in the previous problem, an attacker might try all the passwords just composed of numbers and letters less than 7 characters long, or all words which show up in the dictionary. The important thing here is that he can’t try all possible passwords, because there are way too many possible 16 character passwords, for example, to ever test. So the point is that an attacker has to restrict the possible passwords he tests, otherwise he will never even check a small percentage of them.

Now, as for a salt, the idea is this: What if two users had the same password? They would have the same hash. If you think about it, the attacker doesn’t really have to crack every users password individually. He simply goes through every possible input password, and compares the hash to all the hashes. If it matches one of them, then he has found a new password. What we’d really like to force him to do is calculate a new hash for every user+password combination he wants to check. That’s the idea of a salt, is that you make the hash function be slightly different for every user, so he can’t reuse a single set of precomputed values for all users. The most straightforward way to do this is to tack on some random string to each user’s password before you take the hash, where the random string is different for each user. So, for example, if my password is “shittypassword”, my hash might show up as MD5(“6n93nshittypassword”) and if your password is “shittypassword”, your hash might show up as MD5(“fa9elshittypassword”). This little bit “fa9el” is called the “salt”, and it’s different for every user. For example, my salt is “6n93n”. Now, this little bit which is tacked on to your password is just stored on your computer as well. When you try to login with the password X, the computer can just calculate MD5(“fa9el”+X) and see if it matches the stored hash.

So the basic mechanics of logging in remain unchanged, but for an attacker, they are now faced with a more daunting challenge: rather than a list of MD5 hashes, they are faced with a list of MD5 sums and salts. They essentially have two options:

They can ignore the fact that the hashes are salted, and try to crack the passwords with their lookup table as is. However, the chances that they’ll actually crack a password are much reduced. For example, even if “shittypassword” is on their list of inputs to check, most likely “fa9elshittypassword” isn’t. In order to get even a small percentage of the probability of cracking a password that they had before, they’ll need to test orders of magnitude more possible passwords.
They can recalculate the hashes on a per-user basis. So rather than calculating MD5(passwordguess), for each user X, they calculate MD5( Salt_of_user_X + passwordguess). Not only does this force them to calculate a new hash for each user they want to crack, but also most importantly, it prevents them from being able to use precalculated tables (like rainbow table, for example), because they can’t know what Salt_of_user_X is before hand, so they can’t precalculate the hashes to test.

So basically, if they are trying to use precalculated tables, using a salt effectively greatly increases the possible inputs they have to test in order to crack the password, and even if they aren’t using precalculated tables, it still slows them down by a factor of N, where N is the number of passwords you are storing.

Hopefully this answers all your questions.

Leave a Comment Cancel reply