RealTime Data Compression: Zstandard - A stronger compression algorithm

Saturday, January 24, 2015

Zstandard - A stronger compression algorithm

Zstd, short for Zstandard, is a new lossless compression algorithm, aiming at providing both great compression ratio and speed for your standard compression needs. "Standard" translates into everyday situations which neither look for highest possible ratio (which LZMA and ZPAQ cover) nor extreme speeds (which LZ4 covers).

It is provided as a BSD-license package, hosted on Github.

For a taste of its performance, here are a few benchmark numbers, completed on a Core i5-4300U @ 1.9 GHz, using fsbench 0.14.3, an open-source benchmark program by m^2.

Name Ratio C.speed D.speed
MB/s MB/s
zlib 1.2.8 -6 3.099 18 275
zstd 2.872 201 498
zlib 1.2.8 -1 2.730 58 250
LZ4 HC r127 2.720 26 1720
QuickLZ 1.5.1b6 2.237 323 373
LZO 2.06 2.106 351 510
Snappy 1.1.0 2.091 238 964
LZ4 r127 2.084 370 1590
LZF 3.6 2.077 220 502

An interesting feature of zstd is that it can qualify as both a reasonably strong compressor and a fast one.

Zstd delivers high decompression speed, at around ~500 MB/s per core.
Obviously, your exact mileage will vary depending on your target system.

Zstd compression speed, on the other hand, can be configured to fit different situations.
The first, fast, derivative offers ~200 MB/s per core, which is suitable for a few real-time scenarios.
But similar to LZ4, Zstd can offer derivatives trading compression time for compression ratio, while keeping decompression properties intact. "Offline compression", where compression time is of little importance because the content is only compressed once and decompressed many times, is therefore within the scope.

Note that high compression derivatives still have to be developed.
It's a complex area which will certainly benefit the contributions from a few experts.

Another property Zstd is developed for is configurable memory requirement, with the objective to fit into low-memory configurations, or servers handling many connections in parallel.

On the decoding side, Zstd memory requirement is divided into 2 main parts :

The entropy tables : Zstd entropy stage is handled by FSE (Finite State Entropy).
FSE needs several transformation tables, which currently cost 10 KB.
The intention is to make this size configurable, from a minimum of 2.5 KB to a maximum of 20 KB. This is relatively mild requirement, mostly interesting for systems with very limited memory resource.
The match window size, which is basically the size of "look back buffer" decompression side must maintain in order to complete "match copy" operations.
Basic idea is : the larger the window size, the better the compression ratio.
However, it also increases memory requirement on the decoding side, so a trade off must be found.
Current default window size is 512 KB, but this value will be configurable, from very small (KB) to very large (GB), in the expectation to fit multiple scenarios needs.

The compression stage needs to handle a few more memory segments, the number and size of which is highly dependent on the selected search algorithm. At a minimum, there is a need for a "look-up table", such as the one used by the "fast mode". The current default size of this table is currently selected at 128 KB, but this too will be configurable, from as low as a few KB to a few MB.
Stronger search algorithms will need more tables, hence more memory.

While such speed qualify Zstd as a "fast compressor / decompressor", it still doesn't reach LZ4 territory. Therefore, selecting which algorithm best suits your need highly depends on your speed objective.

In order to help such selection, I've reused the benchmark methodology proposed here, which adds compression, communication, and decompression time in a simplistic transmission scenario. It results in the following chart :

(click to enlarge)

As expected, using "direct raw numbers representation" results in a clogged graphic, where each compressor is difficult to distinguish. Therefore, the representation is reworked, using the same scenario and same numbers, but dynamically zooming each speed sample so that the ranking is preserved, with the best solution receiving always the relative note "1", and other following depending on their speed difference. It creates the following graph :

(click to enlarge)

which is easier to interpret.

From this table we see that LZ4 is a better choice for speeds above ~50 MB/s, while Zstd takes the lead for speeds between 0.5 MB/s and 50 MB/s. Below that point, stronger alternatives prevail.

Zstd development is starting. So consider current results merely as early ones. The implementation will gradually evolve and improve overtime, especially during this first year. This is a phase which will depend a lot on user feedback, since these feedbacks will be key in deciding next priorities or features to add.

59 comments:

ZeevJanuary 24, 2015 at 2:27 PM
Could you please compare against https://code.google.com/p/lzham/ ? Both in performance and in technical approach/tradeoffs.
ReplyDelete
Replies
Rich GeldreichJanuary 24, 2015 at 8:52 PM
Very interesting! Looks like ZSTD will leapfrog several other codecs.

LZHAM is targeting very high ratios (roughly comparable to LZMA), so ZSTD and LZHAM don't overlap.

I need to switch LZHAM to FSE, its decoder looks incredibly simple.
ReplyDelete
Replies
AnonymousJanuary 26, 2015 at 12:58 PM
How does it compare against brotli? Does it need a longer data set to get started or is it competitive with short files, too?
ReplyDelete
Replies
AnonymousJanuary 26, 2015 at 10:10 PM
Could you please compare against https://blogs.oracle.com/timc/entry/tamp_a_lightweight_multi_threaded ? Both in performance and in technical approach/tradeoffs.
ReplyDelete
Replies
AnonymousJanuary 27, 2015 at 1:52 AM
I just tried this on a 100,000,000 byte file of random noise, represented as digits 0..9. The noise should be incompressible, but the encoding itself should allow it to be reduced to about log2(10)/8 * 100000000 = 41524102 bytes in the best case (e.g., with a hand-written reencoder).

On an old/slow machine, zstd knocked the file down to 41557511 bytes in 1.3 seconds. gzip took longer, fast (-1) setting took 4.2s and was 49698347 bytes, whereas the slow (-9) setting took 14.7 and was 46902944. xz took a staggering 188.1 seconds and only knocked it down to 44094460.

For my application, this is just the compression algorithm I need. Thank you!
ReplyDelete
Replies
UnknownJanuary 27, 2015 at 8:37 PM
Hello,

I have hit a stack smashing problem in the call to FSE_adjustNormSlow() where the while(pointsToRemove) loop could write higher than the size of the rank[] array.

I had changed FSE_MAX_SYMBOL_VALUE to 1024 to make it successfully work, instead of the original value of 255 but I don't know what could have changed this way (or not).

I did not investigate much further.
ReplyDelete
Replies
UnknownJanuary 29, 2015 at 7:10 PM
First time using github / gist here, here is a gist of the buffer encoded in base64.

https://gist.github.com/insonator/4efb72572a11345c143a
ReplyDelete
Replies
UnknownJanuary 29, 2015 at 7:12 PM
This is a .zip file encoded in base64, when unzipped, is the actual 32K buffer, sorry for the confusion
ReplyDelete
Replies
UnknownJanuary 29, 2015 at 10:12 PM
In cygwin or any place where you have access to GNU coreutils, you can run base64 --ignore-garbage -d fse_crash.txt >fse_crash.zip

Where fse_crash.txt is the content of the base64 from https://gist.github.com/insonator/4efb72572a11345c143a
ReplyDelete
Replies
AnonymousJanuary 30, 2015 at 7:06 PM
I'm *very* excited about this! (esp. about the HC derivative)
ReplyDelete
Replies
AnonymousFebruary 10, 2015 at 6:31 PM
Hi, pretty interesting. Does Zstd use a framing format the same or similar as LZ4? At this time, is the Zstd binary stream format finished (will not change)? Thanks
ReplyDelete
Replies
DimitriFebruary 13, 2015 at 4:35 PM
Hi Yann,
Your project is really interesting.
I'm looking for a small footprint compression library. If I correctly understand you article it is possible to fit the algoritm in few KB of RAM.
Do you think it is possible to do compression on a 32bits MCU with only 32KB of RAM?
Thanks
Dimitri
ReplyDelete
Replies
DimitriFebruary 16, 2015 at 10:38 AM
Hi Yann,

Thank you for your quick reply!

I tried to reduce some define values as BLOCKSIZE and g_maxDistance.
With the zstdcli.c example I got a final data size 20% of my orignal file.

I changed the BLOCKSIZE to 2KB and 4KB, the results were approximately the same than with default values. g_maxDistance was fixed to 2x BLOCKSIZE.
Notice that changing BLOCKSIZE value to less than 2KB deteriorate the compression ratio.

For now, I only need the compression part of the library. So I don't think buffer optimization is a priority for me. But it would be great for futur uses.

A other Zstd limitation "as it is" for embedded targets is the malloc uses in the library.

Dimitri
ReplyDelete
Replies
SanmayceMarch 10, 2015 at 8:59 PM
Hi Yann,
few thoughts about your superb tools.

First, looking on some benchmarks of your tools I see you achieve unthinkable (for me at least) speeds, WOW! I have few dummy ideas (using much RAM) to speed up my atrociously slow LZ compression but I couldn't and didn't allow myself to dream about speeds you have achieved.

I am confused what tool to compare my 'Loonette-Hoshimi' to. Is it zhuff or zstd?
Speaking of Decompression Speed (at Ratio 2.8:1) for enwik8 I see no rival to Nakamichi 'Loonette-Hoshimi' except zhuff in Dr. Matt Mahoney's LTCB roster.

zhuff 0.97 beta -c2 34,907,478
Nakamichi 'Loonette-Hoshimi' 34,968,896

On Asus laptop with Core 2 Q9550s @2883MHz, HDD Samsung HM641JI, Windows 7 decompression speed is 1,000,000,000/(249*1024*1024)=3.8 ns/byte:

D:\_KAZE\GETCC_General_English_Texts_Compression_Corpus>\timer32.exe Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe enwik8.Nakamichi
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 34968896 bytes ...
RAM-to-RAM performance: 249 MB/s.

Kernel Time = 0.234 = 12%
User Time = 0.327 = 16%
Process Time = 0.561 = 28% Virtual Memory = 1148 MB
Global Time = 1.948 = 100% Physical Memory = 131 MB

AFAIU your zstd is the successor of your promising "étincelle", and if Zstd is another ongoing/unfinished project let me suggest a name for your future tool, "étoile". For those who don't get it, the "spark" evolves to "star".
The context comes from the immortal Antoine de Saint-Exupéry:
'Et, s'il vous arrive de passer par là, je vous supplie, ne vous pressez pas, attendez un peu juste sous l'étoile!'
'And, if you should come upon this spot, please do not hurry on. Wait for a time, exactly under the star.'

My French is as my Japanese, with vocabulary of only several hundreds of words, yet I enter "coinage mode" whenever I can't find the needed word, one example, Yamami.
I needed word for Mountaingazing, couldn't find it, so coining gave Yama+Mi.
My word here is for big windows, currently I am writing Nakamichi 'Yamami' utilizing not 3 but 4 windows and using YMM register with 'automatic' Match Lengths 32/16/8/4 or YMM>>LL, where LL is 0..3:

[FLLxxxxx][xxxxxxxx][xxxxxxxx][xxxxxxxx]
LL=00b; #0 window (4x8-3)bit=512MB; Match Length = 32>>LL
LL=01b; #1 window (3x8-3)bit=2MB; Match Length = 32>>LL
LL=10b; #2 window (2x8-3)bit=8KB; Match Length = 32>>LL
LL=11b; #3 window (1x8-3)bit=32B; Match Length = 32>>LL

The nifty thing is that 'Yamami' will be branchless thanks to 'F' being 0/1 - flag for literal/match.

Why don't you write something with huge windows from the future (not necessarily useful on nowadays computers) utilizing your excellent techniques, it would be fun now and in the future.
ReplyDelete
Replies
AnonymousMarch 24, 2015 at 11:56 PM
I had a block that came out of decompress(compress(foo)) a different length.
Is there a way I might send it in?
ReplyDelete
Replies
AnonymousMay 17, 2015 at 6:18 AM
We have 4845 of 256x256 tile of RGB images. There are 105 out of 4845 tiles that the compressed size is bigger than original size after calling ZSTD_compress. What could cause that?
ReplyDelete
Replies
Satyendra PaulSeptember 9, 2015 at 5:53 PM
is it better than compression algorithm used by WinRAR?
ReplyDelete
Replies
AnonymousMay 23, 2016 at 8:09 PM
Hi Yann,
my wish one strong textual benchmark to be easily available, despite featuring only 88 testdatafiles (still not 400), took shape as a 3GB long package at my INTERNET drive:
https://sanmayce.wordpress.com/2016/05/23/the-88-benchmark/

The roster (a table with results on one A4 page) is quite informative, for a long time such "compressed" table was missing - this page once printed says a lot.

It would be nice some guys to run 'DECOMPRESS_all.bat', after downloading all the 187 files, on 4th/5th/6th gen Intel, thus we could see how the decompression behaves and whether some decompressor takes advantage of the architecture and breaks away.
ReplyDelete
Replies
SAMAIRAMarch 15, 2017 at 8:21 AM
Anyone please tell me how to run this project?
ReplyDelete
Replies
AnonymousJuly 27, 2017 at 1:18 PM
Hi Yann,

I have one doubt about zip file format which support zstd compression algorithm.
i did some analysis and found that recently 7z has supported zstd compression with its .7z file format.
But like .zip file format does not support zstd compression algo.

My application is to compression multiple files/directry (migrate form zlib deflate to zlib zstd like format) so need a file format for zstd compression algo.

can u plz help me to explore more about this question.

thanks a lot yann for ur time :)
ReplyDelete
Replies
UnknownNovember 30, 2021 at 1:20 PM
is zstd compression is applicable for embedded systems?
ReplyDelete
Replies

Add comment