Saturday, January 24, 2015

Zstandard - A stronger compression algorithm

 Zstd, short for Zstandard, is a new lossless compression algorithm, aiming at providing both great compression ratio and speed for your standard compression needs. "Standard" translates into everyday situations which neither look for highest possible ratio (which LZMA and ZPAQ cover) nor extreme speeds (which LZ4 covers).

It is provided as a BSD-license package, hosted on Github.

For a taste of its performance, here are a few benchmark numbers, completed on a Core i5-4300U @ 1.9 GHz, using fsbench 0.14.3, an open-source benchmark program by m^2.

Name            Ratio  C.speed D.speed
                        MB/s    MB/s

zlib 1.2.8 -6   3.099     18     275
zstd            2.872    201     498
zlib 1.2.8 -1   2.730     58     250
LZ4 HC r127     2.720     26    1720
QuickLZ 1.5.1b6 2.237    323     373
LZO 2.06        2.106    351     510
Snappy 1.1.0    2.091    238     964
LZ4 r127        2.084    370    1590
LZF 3.6         2.077    220     502
An interesting feature of zstd is that it can qualify as both a reasonably strong compressor and a fast one.

Zstd delivers high decompression speed, at around ~500 MB/s per core.
Obviously, your exact mileage will vary depending on your target system.

Zstd compression speed, on the other hand, can be configured to fit different situations.
The first, fast, derivative offers ~200 MB/s per core, which is suitable for a few real-time scenarios.
But similar to LZ4, Zstd can offer derivatives trading compression time for compression ratio, while keeping decompression properties intact. "Offline compression", where compression time is of little importance because the content is only compressed once and decompressed many times, is therefore within the scope.

Note that high compression derivatives still have to be developed.
It's a complex area which will certainly benefit the contributions from a few experts.

Another property Zstd is developed for is configurable memory requirement, with the objective to fit into low-memory configurations, or servers handling many connections in parallel.

On the decoding side, Zstd memory requirement is divided into 2 main parts :
  1. The entropy tables : Zstd entropy stage is handled by FSE (Finite State Entropy).
    FSE needs several transformation tables, which currently cost 10 KB.
    The intention is to make this size configurable, from a minimum of 2.5 KB to a maximum of 20 KB. This is relatively mild requirement, mostly interesting for systems with very limited memory resource.
  2. The match window size, which is basically the size of "look back buffer" decompression side must maintain in order to complete "match copy" operations.
    Basic idea is : the larger the window size, the better the compression ratio.
    However, it also increases memory requirement on the decoding side, so a trade off must be found.
    Current default window size is 512 KB, but this value will be configurable, from very small (KB) to very large (GB), in the expectation to fit multiple scenarios needs.

The compression stage needs to handle a few more memory segments, the number and size of which is highly dependent on the selected search algorithm. At a minimum, there is a need for a "look-up table", such as the one used by the "fast mode". The current default size of this table is currently selected at 128 KB, but this too will be configurable, from as low as a few KB to a few MB.
Stronger search algorithms will need more tables, hence more memory.

While such speed qualify Zstd as a "fast compressor / decompressor", it still doesn't reach LZ4 territory. Therefore, selecting which algorithm best suits your need highly depends on your speed objective.

In order to help such selection, I've reused the benchmark methodology proposed here, which adds compression, communication, and decompression time in a simplistic transmission scenario. It results in the following chart :

(click to enlarge)

As expected, using "direct raw numbers representation" results in a clogged graphic, where each compressor is difficult to distinguish. Therefore, the representation is reworked, using the same scenario and same numbers, but dynamically zooming each speed sample so that the ranking is preserved, with the best solution receiving always the relative note "1", and other following depending on their speed difference. It creates the following graph :

(click to enlarge)

which is easier to interpret.
From this table we see that LZ4 is a better choice for speeds above ~50 MB/s, while Zstd takes the lead for speeds between 0.5 MB/s and 50 MB/s. Below that point, stronger alternatives prevail.

Zstd development is starting. So consider current results merely as early ones. The implementation will gradually evolve and improve overtime, especially during this first year. This is a phase which will depend a lot on user feedback, since these feedbacks will be key in deciding next priorities or features to add.


  1. Could you please compare against ? Both in performance and in technical approach/tradeoffs.

    1. Hi Zeev ! Rich is probably in better position to talk about lzham.

      My understanding is that the starting design point is different. lzham starts from lzma, and try to make it faster, using an offline compression scenario as a reference point.
      zstd starts by being a fast compression algorithm, and can compress more by spending more time during compression. Its principles are more similar to zlib, with basically 3 changes :
      - FSE instead of Huffman as entropy
      - unbounded match size
      - repeat offset ability

      So there will probably be some area with similar performance levels, but overall, lzham tends to be on the stronger side, while zstd tends to be on the faster side.

  2. Very interesting! Looks like ZSTD will leapfrog several other codecs.

    LZHAM is targeting very high ratios (roughly comparable to LZMA), so ZSTD and LZHAM don't overlap.

    I need to switch LZHAM to FSE, its decoder looks incredibly simple.

    1. You're very welcomed Rich. I wish FSE can help you and your project. I'm available for questions/clarifications if needed.

  3. How does it compare against brotli? Does it need a longer data set to get started or is it competitive with short files, too?

    1. I haven't been able to compare it directly with Brotli so far. But the current version is not yet optimized for short files. I expect its current performance to suffer in such scenario.

      It's just a temporary situation though : the intention is to make zstd good at compressing short data packets too. It just needs a bit of tuning, and good part of it is just changing some parameters.

      My understanding regarding Brotli is that was initially created as a specialized font compression algorithm, later converted for general purpose. So I would expect Brotli to remain strong in its area of reference.

  4. Could you please compare against ? Both in performance and in technical approach/tradeoffs.

    1. My understanding is that tamp, as described in your link, is a multi-threaded version of QuickLZ. The speed up described is achieved on a 64-way system.
      By contrast, zStd figures provided above are for single-threaded systems. QuickLZ figures are also provided, so it's possible to compare.

  5. I just tried this on a 100,000,000 byte file of random noise, represented as digits 0..9. The noise should be incompressible, but the encoding itself should allow it to be reduced to about log2(10)/8 * 100000000 = 41524102 bytes in the best case (e.g., with a hand-written reencoder).

    On an old/slow machine, zstd knocked the file down to 41557511 bytes in 1.3 seconds. gzip took longer, fast (-1) setting took 4.2s and was 49698347 bytes, whereas the slow (-9) setting took 14.7 and was 46902944. xz took a staggering 188.1 seconds and only knocked it down to 44094460.

    For my application, this is just the compression algorithm I need. Thank you!

    1. The issue of gzip compression ratio here is using Huffman coding - approximating probabilities with powers of 1/2. Lempel-Ziv will not help with random noise, so the only part of ZSTD you really use is Asymmetric Numeral System entropy coder - you can get speedup by using e.g. Yann's FSE only.

    2. Thanks for sharing your experience. You are very welcomed

  6. Hello,

    I have hit a stack smashing problem in the call to FSE_adjustNormSlow() where the while(pointsToRemove) loop could write higher than the size of the rank[] array.

    I had changed FSE_MAX_SYMBOL_VALUE to 1024 to make it successfully work, instead of the original value of 255 but I don't know what could have changed this way (or not).

    I did not investigate much further.

    1. Thanks Jean Luc

      Definitely, this part needs to be investigated further.
      You provide a great hint for the solution.

      Best Regards

    2. Could it be possible to get a sample which generates the issue ?
      This is to make sure it is correctly solved, and can be regenerated as part of the test CI suite.

    3. I've produced an updated version of FSE_adjustNormSlow() within the dev branch of FSE :

      Without access to the test case, it's difficult to tell if the new version is any better. At least, it's supposed to be safer, and I was unable to make it crash, even with hand-tuned difficult corner cases. But then, maybe I'm missing the corner case which triggers your report.

      Could you please check if it works better for you ?

    4. The fix seems to help, and has been integrated into the "dev" branch of zstd.

    5. Hello,

      I still have the stack smash problem when I try to run the latest fse.c. I have a 32K chunk to send to you, where can I send it ?

      Thanks !

    6. If you can't send it in public mode, there is a "contact the author" link (top right), from which I can provide you an email for private sending.

  7. First time using github / gist here, here is a gist of the buffer encoded in base64.

  8. This is a .zip file encoded in base64, when unzipped, is the actual 32K buffer, sorry for the confusion

    1. ok, so I need to convert it back to binary first. Google will probably help to find an unbase64 tool. Or I'll write it directly.

  9. In cygwin or any place where you have access to GNU coreutils, you can run base64 --ignore-garbage -d fse_crash.txt >

    Where fse_crash.txt is the content of the base64 from

    1. Thanks your suggestion worked perfectly

    2. I've been testing your sample.
      An older version of FSE would indeed crash on it.
      But the newer version seems to pass the test without problem.

      Do you still have problems with the newer version ?

    3. Yes, I still have problems with the newer version.

      For reference, I am using VS2012 Update 4, cl.exe for x64, using compiler flags:

      /MP /GS /Gy- /Zc:wchar_t /Zi /Gm- /O2 /Ob2 /fp:fast /D "_LIB" /D "_WIN32_WINNT=0x0600" /D "NDEBUG" /fp:except- /errorReport:prompt /GF /WX /Zc:forScope /GR- /Gd /Oy- /Oi /MT /openmp- /Ot

    4. I believe this is the same issue as . I'll update this github issue with more information this evening with additional information that might assist.


  10. I'm *very* excited about this! (esp. about the HC derivative)

  11. Hi, pretty interesting. Does Zstd use a framing format the same or similar as LZ4? At this time, is the Zstd binary stream format finished (will not change)? Thanks

    1. Yes, it uses a framing format, but a different one from LZ4.
      And no, this format is not yet considered stable.
      The objective is to make it stable by year end.
      When it will be, the library number will become 1.0.

  12. Hi Yann,
    Your project is really interesting.
    I'm looking for a small footprint compression library. If I correctly understand you article it is possible to fit the algoritm in few KB of RAM.
    Do you think it is possible to do compression on a 32bits MCU with only 32KB of RAM?

    1. Hi Dimitri

      The answer to your question is "Yes, but ..."
      Generally speaking, the design of the format will allow that, but today, the required parameters and optimizations are not present yet in the reference source code. So it's too early to try out.

      I suspect the 32KB RAM is a total budget, therefore encompassing both buffers (input/output), compression/decompression tables, and potentially some other stuff as well ?

      Knowing how much budget is really available is a key metric for such a context.

      If that is the case, I suspect you'll need to work on very small blocks, typically 4KB block size, with a "copy window size" of 8 KB (meaning a block can use data from previous block to improve compression ratio, which is very efficient).

      This is definitely within the low size limit for zstd format. Since Zstd was designed with 4KB blocks in mind, my guess is that it will work, but we'll need time to properly verify it.

  13. Hi Yann,

    Thank you for your quick reply!

    I tried to reduce some define values as BLOCKSIZE and g_maxDistance.
    With the zstdcli.c example I got a final data size 20% of my orignal file.

    I changed the BLOCKSIZE to 2KB and 4KB, the results were approximately the same than with default values. g_maxDistance was fixed to 2x BLOCKSIZE.
    Notice that changing BLOCKSIZE value to less than 2KB deteriorate the compression ratio.

    For now, I only need the compression part of the library. So I don't think buffer optimization is a priority for me. But it would be great for futur uses.

    A other Zstd limitation "as it is" for embedded targets is the malloc uses in the library.


    1. > I changed the BLOCKSIZE to 2KB and 4KB, the results were approximately the same than with default values. g_maxDistance was fixed to 2x BLOCKSIZE.

      That's much better than anticipated. I would have suspected headers to make a significant dent into compression radio without optimizations dedicated to such settings. It's good news it worked so well in your tests.

      > A other Zstd limitation "as it is" for embedded targets is the malloc uses in the library.

      Correct. I suspect you'll need a way to propose a "function pointer" to replace standard malloc() with another method. Is it fine to compile with malloc() as backup method when no "function pointer" is provided, or even that could be a problem ?

    2. Hi,
      In my case, static allocation is mandatory. I would like to use array or static structures.

      I have some interrogations about my previous tests. I am not sure I correctly configured each constants.
      As I said I changed BLOCKSIZE constant and few others. For example MAXD_LOG is still fixed to 16. I have not investigated the consequence of each #define on the memory footprint.

      About the compression better than expected, I am not sure that we can conclude something with these tests. Indeed my data are very repetitive. Data are made of 16 bytes patterns and some bytes (~4) of the pattern never change. I would say there is more or less 5 bytes that change on each pattern occurances.

    3. MAXD_LOG is useless, I'll remove this constant.
      The one which matters is g_maxDistance, which you properly modified.

      If your data has an "aligned" structure, as seems the case in your description, then be informed that zstd deal with it using some fairly efficient code, which could explain the good results. I agree this conclusion should not be overextended to other kind of data.

      I note your need for static allocation, and will consider this requirement in future versions of the library.

  14. Hi Yann,
    few thoughts about your superb tools.

    First, looking on some benchmarks of your tools I see you achieve unthinkable (for me at least) speeds, WOW! I have few dummy ideas (using much RAM) to speed up my atrociously slow LZ compression but I couldn't and didn't allow myself to dream about speeds you have achieved.

    I am confused what tool to compare my 'Loonette-Hoshimi' to. Is it zhuff or zstd?
    Speaking of Decompression Speed (at Ratio 2.8:1) for enwik8 I see no rival to Nakamichi 'Loonette-Hoshimi' except zhuff in Dr. Matt Mahoney's LTCB roster.

    zhuff 0.97 beta -c2 34,907,478
    Nakamichi 'Loonette-Hoshimi' 34,968,896

    On Asus laptop with Core 2 Q9550s @2883MHz, HDD Samsung HM641JI, Windows 7 decompression speed is 1,000,000,000/(249*1024*1024)=3.8 ns/byte:

    D:\_KAZE\GETCC_General_English_Texts_Compression_Corpus>\timer32.exe Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe enwik8.Nakamichi
    Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
    Decompressing 34968896 bytes ...
    RAM-to-RAM performance: 249 MB/s.

    Kernel Time = 0.234 = 12%
    User Time = 0.327 = 16%
    Process Time = 0.561 = 28% Virtual Memory = 1148 MB
    Global Time = 1.948 = 100% Physical Memory = 131 MB

    AFAIU your zstd is the successor of your promising "étincelle", and if Zstd is another ongoing/unfinished project let me suggest a name for your future tool, "étoile". For those who don't get it, the "spark" evolves to "star".
    The context comes from the immortal Antoine de Saint-Exupéry:
    'Et, s'il vous arrive de passer par là, je vous supplie, ne vous pressez pas, attendez un peu juste sous l'étoile!'
    'And, if you should come upon this spot, please do not hurry on. Wait for a time, exactly under the star.'

    My French is as my Japanese, with vocabulary of only several hundreds of words, yet I enter "coinage mode" whenever I can't find the needed word, one example, Yamami.
    I needed word for Mountaingazing, couldn't find it, so coining gave Yama+Mi.
    My word here is for big windows, currently I am writing Nakamichi 'Yamami' utilizing not 3 but 4 windows and using YMM register with 'automatic' Match Lengths 32/16/8/4 or YMM>>LL, where LL is 0..3:

    LL=00b; #0 window (4x8-3)bit=512MB; Match Length = 32>>LL
    LL=01b; #1 window (3x8-3)bit=2MB; Match Length = 32>>LL
    LL=10b; #2 window (2x8-3)bit=8KB; Match Length = 32>>LL
    LL=11b; #3 window (1x8-3)bit=32B; Match Length = 32>>LL

    The nifty thing is that 'Yamami' will be branchless thanks to 'F' being 0/1 - flag for literal/match.

    Why don't you write something with huge windows from the future (not necessarily useful on nowadays computers) utilizing your excellent techniques, it would be fun now and in the future.

    1. Hi Sanmayce.
      You're welcomed.

      zhuff should now be considered superceded by zstd.
      Basically, zhuff was the "private tool" I used to experiment a few ideas which are now present into zstd.

      Note that zstd is planned to cope with large windows. It's not necessarily visible in current version, but it will come in later versions.

      Etincelle is a very different beast, using much more memory and another algorithm. It has the potentiel to compress better, but it will also integrate less well into other programs, because it uses too much resources for itself.
      That's why I'm currently concentrating on zstd.
      But I may come back to etincelle later on.

      By the way, you got the naming intention right :) Indeed, Etincelle is preceding Etoile...

    2. Thanks for the clarification, now I know the ... hierarchy.

      *>Note that zstd is planned to cope with large windows. It's not necessarily visible in current version, but it will come in later versions.*
      Very very good, thinking beyond the limitations, that present PCs impose, one day will pay off.

      2,443,181,056 GTCC_General_Textual_Compression_Corpus.tar
      2,427,389,560 GTCC_General_Textual_Compression_Corpus.tar.method07.zpaq
      758,742,694 GTCC_General_Textual_Compression_Corpus.tar.Tengu.Nakamichi
      741,931,125 GTCC_General_Textual_Compression_Corpus.tar.lz4
      732,798,243 GTCC_General_Textual_Compression_Corpus.tar.lzturbo12-19.lzt
      529,446,634 GTCC_General_Textual_Compression_Corpus.tar.method27.zpaq
      494,883,888 GTCC_General_Textual_Compression_Corpus.tar.001MB.7z
      443,856,829 GTCC_General_Textual_Compression_Corpus.tar.lzturbo12-39.lzt
      436,250,451 GTCC_General_Textual_Compression_Corpus.tar.128MB.7z
      401,507,552 GTCC_General_Textual_Compression_Corpus.tar.ST6Block128.bsc
      399,427,626 GTCC_General_Textual_Compression_Corpus.tar.ST6Block256.bsc
      336,778,595 GTCC_General_Textual_Compression_Corpus.tar.method57.zpaq
      331,768,611 GTCC_General_Textual_Compression_Corpus.tar.method58.zpaq

      I wrote to Dr. Gorontzi to fix the rzip64's -9 option, currently -6 gave:
      485,510,051 GTCC_General_Textual_Compression_Corpus.tar.rz

      In my view lrzip/rzip64 are on the right track for ... future.

      My GTCC textual largest corpus is downloadable at:!tk5UgRab!MYrNSCzhavZ0OptL9kxAtr8pYmr3rEEJsPIFi465WbU

      *>But I may come back to etincelle later on.*
      Please do, it would be a shame a working donkey to be left behind in the mud, ugh, we have in Bulgarian a beautiful saying stating that.

      *>By the way, you got the naming intention right :) Indeed, Etincelle is preceding Etoile...*
      Yes, I am good at that, funny, I have the habit to 'hash' whole books down to one most significant sentence, years ago such phrase from the 'Babaji - the unfathomable' has imprinted in me for life:
      "One spark is enough."
      I won't go in the context, yet, it was about giving a tiny chance to divine.

      In English there is a beautiful word 'firefly' but (still) no analogue to the beautiful Japanese 'hibana' meaning 'spark', so ... coining mode:
      hibana 【火花】
      Two kanjis compose it, fire [火] and flower [花].
      Now we have a strong synonym to 'spark'.
      It caught me at once, 'Fireflower', feel free to use it, it is licenseless, heh-heh.

      Yann, I salute you with 'Firefly' song and wish you success.

      I feel a stirring deep within
      Slowly picking up momentum
      Like the tide coming into shore
      Over and under in its course

      This feeling emblazed inside
      Every nerve like a firefly
      Hovering above me
      Glow glow, glowing divine

      I never want to lose
      What I have finally found
      There's a requiem
      A new congregation

      And it's telling me
      Go forward and walk
      Under a brighter sky
      Every nerve's glowing like a firefly

      /Delerium - Euphoria (Firefly) Lyrics/

  15. I had a block that came out of decompress(compress(foo)) a different length.
    Is there a way I might send it in?

    1. You might use the github issue board. Or alternatively, send a link to download place using private email (top right of this blog)

    2. This problem is gone in zstd-0.1.1.
      Even checking decompression, zstd eclipses lz4hc and zlib.
      The Pareto frontier is now lz4, zstd, libzling and lzma.

  16. We have 4845 of 256x256 tile of RGB images. There are 105 out of 4845 tiles that the compressed size is bigger than original size after calling ZSTD_compress. What could cause that?

    1. There is no compression algorithm on Earth that can ensure you to always compress everything to a smaller size. If your source data is just pure noise, by definition, it is not compressible.

      Of course, there are a lot of situations which are not "pure noise" but end up being close enough to be difficult to compress.

      Furthermore, current version of zstd is "fast" only,
      which means it's trading a lot of compression power for the sake of speed.
      A future "HC" version will make the inverse trade off. With this future version, the likelihood that a compressed object end up being bigger than the original one will be sharply reduced.

    2. Hi Yann,

      Thanks so much for such a quick response. I am new to compression and it is very good to know. I was indeed very impressed with the speed -- a 14kx24k of 24 bit images can be load, compressed, and saved in 9 seconds. Since not all data can be compressed, if compressed size is bigger than the original size, can we just use the original data? When decompress, if the compressed size is same as the expected uncompressed size, just use the compressed data directly. Do you see any problem with that? Thanks again.

    3. zstd internal frame format is supposed to already do that.
      But there is still, nonetheless, some header costs.
      The header cost is supposed to be minimal (< 0.1%).

      If you nonetheless feel this header cost is too much for you, you can indeed add your own logic, and decide to save your data unmodified when compression is not good enough. You'll need to design your own yes/no flag for this objective.


    4. Hi Yann,

      We have tried to create a multi-page compressed tiff. Your ZStd is over 30 times faster than zlib. Great work! Our software can load both compression, but ImageJ can only load zlib compression. Do you know what value should be set for tiff compression tag so other software knows how to decompress? Thanks so much.


    5. Unfortunately, I suspect there is no such tag.
      ImageJ will have to implement zstd decoder, or allow the creation of plugins.

  17. is it better than compression algorithm used by WinRAR?

    1. I would guess "unlikely", since Zstd design is much simpler and targets speed primarily, with only a significantly larger maximum dictionary size as a potential edge.

      That being said, such comparison is not possible today. It is first necessary to develop a High Compression mode to compare to.

  18. Hi Yann,
    my wish one strong textual benchmark to be easily available, despite featuring only 88 testdatafiles (still not 400), took shape as a 3GB long package at my INTERNET drive:

    The roster (a table with results on one A4 page) is quite informative, for a long time such "compressed" table was missing - this page once printed says a lot.

    It would be nice some guys to run 'DECOMPRESS_all.bat', after downloading all the 187 files, on 4th/5th/6th gen Intel, thus we could see how the decompression behaves and whether some decompressor takes advantage of the architecture and breaks away.

    1. Nice work you're putting together Sanmayce.
      We have no Skylake system around right now, but this could improve in the future.

  19. Anyone please tell me how to run this project?

  20. Hi Yann,

    I have one doubt about zip file format which support zstd compression algorithm.
    i did some analysis and found that recently 7z has supported zstd compression with its .7z file format.
    But like .zip file format does not support zstd compression algo.

    My application is to compression multiple files/directry (migrate form zlib deflate to zlib zstd like format) so need a file format for zstd compression algo.

    can u plz help me to explore more about this question.

    thanks a lot yann for ur time :)

    1. It's a bit complex.
      zstd defines a _frame_ format, much like gzip, bzip2 or xz.

      7z and zip, by contrast, are archivers.
      They define a _container_ format.

      The difference is pretty steep.
      containers can contain multiple files, full directory images, file attributes, and so on.
      by contrast, frames only contain a single continuous and unidentified "stream of bytes".

      See containers as an envelope encapsulating multiple frames (possibly of different format).

      That's why 7z and zstd are not directly comparable.
      The 7z container format can encapsulate zstd frames, which is what Tino Reichardt's 7zip-Zstandard edition does :

    2. Thanks a lot yann for ur answer.

      as u said, 7z container format can encapsulate zstd frames.
      does other containers also supported zstd frames.

      7z pkg has cpp code for container encoder and i need c code since my project is in C.
      So it will be very helpfull if u can suggest some other containers which support zstd frames. OR provide list of containers which support zstd frames.

      thanks a lot :)

    3. 7zip-Zstandard edition has a C source code support :

      libarchive has support for many containers and compressors,
      there is a PR for Zstandard support her :
      but it hasn't been merged yet.