Data Compression Algorithms

There are a ton of compression algorithms out there. What you need here is a lossless compression algorithm. A lossless compression algorithm compresses data such that it can be decompressed to achieve exactly what was given before compression. The opposite would be a lossy compression algorithm. Lossy compression can remove data from a file. PNG images use lossless compression while JPEG images can and often do use lossy compression.

Some of the most widely known compression algorithms include:

  • RLE
  • Huffman
  • LZ77

ZIP archives use a combination of Huffman coding and LZ77 to give fast compression and decompression times and reasonably good compression ratios.

LZ77 is pretty much a generalized form of RLE and it will often yield much better results.

Huffman allows the most repeating bytes to represent the least number of bits.
Imagine a text file that looked like this:

aaaaaaaabbbbbcccdd

A typical implementation of Huffman would result in the following map:

Bits Character
   0         a
  10         b
 110         c
1110         d

So the file would be compressed to this:

00000000 10101010 10110110 11011101 11000000
                                       ^^^^^
                              Padding bits required

18 bytes go down to 5. Of course, the table must be included in the file. This algorithm works better with more data 😛

Alex Allain has a nice article on the Huffman Compression Algorithm in case the Wiki doesn’t suffice.

Feel free to ask for more information. This topic is pretty darn wide.

Leave a Comment