While reading about Amazon S3 API documentation to find out how Amazon S3 does the integrity check on objects to verify that the data is the same data that was originally sent you might come across the statement
The base64-encoded 128-bit MD5 digest of the message and it may not make much sense or you may wonder how this is different than an MD5 hash you can calculate with standard
It is pretty easy and relatively fast to calculate md5sum for a file.
$ touch hello.txt $ md5 hello.txt MD5 (hello.txt) = d41d8cd98f00b204e9800998ecf8427e
What is MD5?
The MD5 message-digest algorithm is a widely used cryptographic hash function producing a 128-bit (16-byte) hash value, typically expressed in text format as a 32 digit hexadecimal number. Regardless of the file size, an md5 hash is always 128 bits.
128-bit MD5 digest statement in Amazon S3 documentation, just implies that MD5 digest is 128 bit long as defined in the RFC.
What is Hexadecimal?
In mathematics and computing, hexadecimal (also base 16, or hex) is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F (or alternatively a, b, c, d, e, f) to represent values ten to fifteen.
Examples for Hexadecimal (base 16) to decimal (base 10) conversion
0 -> 0 A -> 10 20 -> 30 1 -> 1 B -> 11 21 -> 33 2 -> 2 C -> 12 30 -> 48 3 -> 3 D -> 13 90 -> 144 4 -> 4 E -> 14 A0 -> 160 5 -> 5 F -> 15 F0 -> 240 6 -> 6 10 -> 16 F1 -> 241 7 -> 7 11 -> 17 F2 -> 242 8 -> 8 12 -> 18 FA -> 250 9 -> 9 13 -> 19 FF -> 255
MD5 Hash and Hexadecimal
MD5 hash has 128 bits which is 16 bytes. Biggest decimal value that 1 byte (8 bits) can hold is 255. From the chart above, we know hexadecimal FF represents 255 as well (16 x 15 + 15).
We need 2 digit hexadecimal number (FF) to represent max value in a byte which is
1111 1111. To represent 16 bytes (128 bits), we need a 32 digit hexadecimal number. If you go back and check the output of
md5 command above, you will see it is exactly 32 digits long.
Base64 encoding is used to represent binary data in an ASCII string. Base64 encoding is used commonly in HTTP requests and headers. Check the wiki page for base64 encoding here, to find out the interesting calculation on how grous of 6 bits are converted to individual numbers and how padding is done when the number of bytes is not divisible by three.
Calculating Base64-Encoded MD5
The output of
md5 command produces a 32 digit long hexadecimal text which is ASCII encoded and
base64 command calculates base64 encoded string.
Check the command below and try to figure out why it is not going to produce what we are really looking for.
$ echo -n "hello world" | md5 5eb63bbbe01eeed093cb22bb8f5acdc3 $ echo -n "hello world" | md5 | base64 NWViNjNiYmJlMDFlZWVkMDkzY2IyMmJiOGY1YWNkYzMK
Here is why. We need hexadecimal value of md5sum, instead with
md5 command we are getting an ASCII text representing the hexadecimal value. Remembering base64 is used to represent binary data in ASCII, we need to find binary value of md5 result. Below you can find the command which will give the right base64 encoded md5 hash.
$ echo -n "hello world" | openssl dgst -md5 -binary | openssl enc -base64 XrY7u+Ae7tCTyyK7j1rNww==
What if you already have md5sums calculated for bunch of files and don’t want to calculate these but instead just convert to base64?
xxd command can convert a hexadecimal string to binary value and you can use this binary to calculate base64 as seen below.
// this is something you did before for many files $ echo -n "hello world" | md5 5eb63bbbe01eeed093cb22bb8f5acdc3 // this is something you are doing now for each file $ (echo 0:; echo 5eb63bbbe01eeed093cb22bb8f5acdc3) | xxd -rp -l 16|base64 XrY7u+Ae7tCTyyK7j1rNww==
As a way to verify the output, we see that base64 encoded text above matches the one we found using openssl.
Base64 Encoded 128-bit MD5 Digest is a clear statement now.
content-md5 automatically for you if you set
computeChecksums option to
true. If you choose to use PUT API, you will have to calculate the value before passing it as an option for the API call.
If you want to keep MD5 hash value of the file in Amazon S3 along with the object, you can post it in user metadata of the object. If you choose to do so, you don’t really have to calculate base64 value but pass md5 hash as it is if you like. Since you will be interpreting user metadata, it is up to you to decide in which format / encoding you want to store it.