Basically one of the most common tasks when it comes to design and develop a computer science project is storing data.
Your data is kept safe in data centers spread around the world to ensure 24/7 availability.
Multiple mechanisms exist to protect your data and keep it safe in case a hacker would steal it.
Guess what, the most used one is hashing
The concept is pretty simple, still does it lead to headaches and crucial security breaches in a system.
Let’s go deep in a fairly easy explanation of what are hashes, what are their usage in daily developers’ lives and why should one feel concerned about it.
Harking back some years ago, we were used to storing sensitive data by using encryption algorithms.
Let’s take an example, the Caesar cipher aka ROT13.
The algorithm is simple, one has a private message and wants to make it readable only by some persons. Each character of the message is rotated by 13.
Using the word:
“Joe”. Rotating each character by 13 would give:
“Wbr” by 13 would give
You get it.
This was an example but encryption algorithms used in the past were much more complex as you could guess.
However, you understand that any encryption algorithm is flawed by design.
If a hacker knows your encryption algorithm and knows how to decode it, you and your users are definitely compromised.
That’s why brilliant researchers invented hashes.
What is a hash ?
So, the goal was to prevent anyone, including hackers, to decode sensitive data. Let’s imagine, someone writes “Joe” and we have to process it in a way that won’t let us decode it later.
This is what hashes are intended to
It raises multiple questions, doesn’t it ?
First things first, what does a hash look like ?
Joe => 3a368818b7341d48660e8dd6c5a77dbe
Second one, how does one tell if the user has written
“Joe” or anything else if only the hash is stored in the database ?
Well, simply process what the user typed and compare the stored hash with the created one.
In the database is stored
If the user writes “Joe” and the computer processes it, we get :
3a368818b7341d48660e8dd6c5a77dbe (tl;dr both hashes are same)
Thus we are able to tell if two inputs are the same without actually being able to clearly read one.
The database only knows the hash, not the clear data associated to it. Therefore, any regular human on this planet would not be able to reverse this process and tell what was the data associated to a given random hash.
This is partly what makes this mechanism secure and widely implemented
Another crucial point about hashes is collision.
Two hashes must not collide, for security purposes. Basically, collision is defined like below:
Hash(A) = Hash(B) = C, with A != B.
Two different inputs, processed by the same hashing function, must not output the same hash. (Keep it in your head till the end of this article)
Simple use case : Your password is
“ABCD”, mine is
A hashing function should output a different hash for each password. Otherwise, one would be able to log into another’s account by using a password which is not even associated to it.
I hope you’re still there. It was easy, wasn’t it ?
Hashing function details
Don’t be scared, I won’t show evil mathematics or anything like so. I will only raise a point which you should know about to continue reading this article.
The hashing function considered as one of the most secure is SHA-X (replace X by 1, 2, 256 or any other variant).
The theory is quite simple
- There is a function F(A) that produces a hash for A
- A is a given data (string, number, etc) of undetermined length (can be 1, can be a thousand characters, anything other than 0) (actually there can’t be a hash for an empty thing, predictable)
- For two inputs A and B such that A and B are different, F(A) is different from F(B).
- The resulting hash produced by F has a fixed length independent from the input.
As an example, no matter the input for the SHA-256 function, the length of the resulting hash will ALWAYS be 256 bits, no matter the length of the input data.
(KEEP IT IN MIND !!!)
So your data is secured by hashes but… which data is secured ?
If it were your name, you wouldn’t be able to see “Connected as John Doe”, because the database doesn’t know “John Doe” but only the hash associated to it.
Most part of the time, your credentials, your banking information and other sensitive data, which are not public, are secured with hashes.
Everyone agrees, it’s ok.
If you are still reading this article, your password associated to your Medium account is obviously (I guess, correct me if I’m wrong) stored as a hash in Medium’s database(s). However, your first name and last name are not.
One last crucial example is the blockchain.
Tons of articles will better speak for me about this topic, but just keep in mind that blocks of a blockchain identify each other by a hash. Thus, block A has a hash, block B has a hash, and A references B.
This is due to the fact that hashes are considered unique (remember ? a hashing function should be collision-free) and thus two identical blocks can’t exist as each must be unique.
What’s wrong with hashes then ?
So the most secure algorithm we know for securing your data and making it unreadable takes any possible input data, and turns it into a fixed length hash.
Can you get what is wrong ?
We possibly have an infinite set of possible inputs, all different. We only have 256 bits (for SHA-256) to represent them all.
Actually, with 256 bits we can only represent 2²⁵⁶ values.
Of course it’s a lot, but
We generate 2.5 quintillion bytes of data each day, on average.
This represents 2,500,000 Terabytes, per day, on average. (And you were thinking 2 terabytes were huge ?)
Can you still believe that with all this amount of data, hashing functions are protected against collision ? Therefore, is your data safe ? How is a blockchain possibly secured if it relies on hashes to identify blocks ?
The truth is, very popular hashing functions have been broken already : MD4, MD5, SHA-1 and believe it or not, they were as widely used as SHA-256.
You get it once again.
It’s a major concern, and we should already start looking for better security mechanisms.
Reaching the end of this article, I won’t come up with a “tada” solution which would amaze you for sure. However, I still believe improvements are being made and experts are working on the issue.
Searching for “Hash | SHA algorithm | Encryption” on Google will lead you to very interesting articles related to this article.
I hope everything was understandable and clear.
Feel free to reach out at firstname.lastname@example.org ☕
Fullstack Developer, Trainer & Entrepreneur.
Learning stuff, sharing knowledge and building on top of great ideas are my top priorities.