Spark SQL – difference between gzip vs snappy vs lzo compression formats

Compression Ratio :
GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio.

General Usage :
GZip is often a good choice for cold data, which is accessed infrequently.
Snappy or LZO are a better choice for hot data, which is accessed frequently.

Snappy often performs better than LZO. It is worth running tests to see if you detect a significant difference.

Splittablity :
If you need your compressed data to be splittable, BZip2, LZO, and Snappy formats are splittable, but GZip is not.

GZIP compresses data 30% more as compared to Snappy and 2x more CPU when reading GZIP data compared to one that is consuming Snappy data.

LZO focus on decompression speed at low CPU usage and higher compression at the cost of more CPU.

For longer term/static storage, the GZip compression is still better.

See extensive research and benchmark code and results in this article (Performance of various general compression algorithms – some of them are unbelievably fast!).

enter image description here

Leave a Comment