Saturday, January 24, 2015

Zstandard - A stronger compression algorithm

 Zstd, short for Zstandard, is a new lossless compression algorithm, aiming at providing both great compression ratio and speed for your standard compression needs. "Standard" translates into everyday situations which neither look for highest possible ratio (which LZMA and ZPAQ cover) nor extreme speeds (which LZ4 covers).

It is provided as a BSD-license package, hosted on Github.

For a taste of its performance, here are a few benchmark numbers, completed on a Core i5-4300U @ 1.9 GHz, using fsbench 0.14.3, an open-source benchmark program by m^2.

Name            Ratio  C.speed D.speed
                        MB/s    MB/s

zlib 1.2.8 -6   3.099     18     275
zstd            2.872    201     498
zlib 1.2.8 -1   2.730     58     250
LZ4 HC r127     2.720     26    1720
QuickLZ 1.5.1b6 2.237    323     373
LZO 2.06        2.106    351     510
Snappy 1.1.0    2.091    238     964
LZ4 r127        2.084    370    1590
LZF 3.6         2.077    220     502
An interesting feature of zstd is that it can qualify as both a reasonably strong compressor and a fast one.

Zstd delivers high decompression speed, at around ~500 MB/s per core.
Obviously, your exact mileage will vary depending on your target system.

Zstd compression speed, on the other hand, can be configured to fit different situations.
The first, fast, derivative offers ~200 MB/s per core, which is suitable for a few real-time scenarios.
But similar to LZ4, Zstd can offer derivatives trading compression time for compression ratio, while keeping decompression properties intact. "Offline compression", where compression time is of little importance because the content is only compressed once and decompressed many times, is therefore within the scope.

Note that high compression derivatives still have to be developed.
It's a complex area which will certainly benefit the contributions from a few experts.

Another property Zstd is developed for is configurable memory requirement, with the objective to fit into low-memory configurations, or servers handling many connections in parallel.

On the decoding side, Zstd memory requirement is divided into 2 main parts :
  1. The entropy tables : Zstd entropy stage is handled by FSE (Finite State Entropy).
    FSE needs several transformation tables, which currently cost 10 KB.
    The intention is to make this size configurable, from a minimum of 2.5 KB to a maximum of 20 KB. This is relatively mild requirement, mostly interesting for systems with very limited memory resource.
  2. The match window size, which is basically the size of "look back buffer" decompression side must maintain in order to complete "match copy" operations.
    Basic idea is : the larger the window size, the better the compression ratio.
    However, it also increases memory requirement on the decoding side, so a trade off must be found.
    Current default window size is 512 KB, but this value will be configurable, from very small (KB) to very large (GB), in the expectation to fit multiple scenarios needs.

The compression stage needs to handle a few more memory segments, the number and size of which is highly dependent on the selected search algorithm. At a minimum, there is a need for a "look-up table", such as the one used by the "fast mode". The current default size of this table is currently selected at 128 KB, but this too will be configurable, from as low as a few KB to a few MB.
Stronger search algorithms will need more tables, hence more memory.

While such speed qualify Zstd as a "fast compressor / decompressor", it still doesn't reach LZ4 territory. Therefore, selecting which algorithm best suits your need highly depends on your speed objective.

In order to help such selection, I've reused the benchmark methodology proposed here, which adds compression, communication, and decompression time in a simplistic transmission scenario. It results in the following chart :

(click to enlarge)

As expected, using "direct raw numbers representation" results in a clogged graphic, where each compressor is difficult to distinguish. Therefore, the representation is reworked, using the same scenario and same numbers, but dynamically zooming each speed sample so that the ranking is preserved, with the best solution receiving always the relative note "1", and other following depending on their speed difference. It creates the following graph :

(click to enlarge)

which is easier to interpret.
From this table we see that LZ4 is a better choice for speeds above ~50 MB/s, while Zstd takes the lead for speeds between 0.5 MB/s and 50 MB/s. Below that point, stronger alternatives prevail.

Zstd development is starting. So consider current results merely as early ones. The implementation will gradually evolve and improve overtime, especially during this first year. This is a phase which will depend a lot on user feedback, since these feedbacks will be key in deciding next priorities or features to add.