Friday, January 7, 2011

An old friend returns : new version of Zhuff (v0.4)

 Since working on LZ4 and then Huff0 improvements, it's quite natural to give another look at Zhuff, an old compression program that was created by simply combining LZ4 and Huff0 together.

There is slightly more to it. Zhuff also uses a sliding window, in place of the simpler chunking used in LZ4. It avoids some memory wasting, and keep the compression potential at full window size all the time. I also added an incompressible segment detector.

Zhuff succeeded at its time to be the fastest compressor for its compression rate, winning first place in several benchmarks. Since then, the focus changed to multi-threading, faring zhuff at a disadvantage since it uses just one core. Nonetheless, it still features an excellent energy-efficient profile, just behind Etincelle (which works best on larger source files).

This could change in the future, but for the time being, let's just re-use recent learnings to improve over last year version.

It works relatively well. The new version is significantly faster, and on top of that, slightly improves compression ratio.


versionCompression RatioSpeedDecoding
Zhuff0.42.532161 MB/s279 MB/s
Zhuff0.32.508139 MB/s276 MB/s

You can note that the compression ratio is quite better than LZ4HC, while the speed is much faster. It simply shows that entropy compression power is stronger than full search, while costing much less CPU.

On the other hand, decoding speed is sharply reduced compared to LZ4, which is also an effect of second-stage entropy compression.

You can download it on its webpage.

Tuesday, January 4, 2011

Entropy evaluation : small changes in benchmark corpus

You can't easily improve what you can't measure. That's why test tools are so important.
For compression, benchmark corpus is an essential tool to measure performance and check consistency.
Its main drawback, however, is to influence development in a way which make the benchmark look better, forgetting or even worsening situations which are not in the corpus.

Therefore, carefully choosing elements within the corpus, and even changing them from time to time, is a sensible thing to do.

For entropy evaluation, the different categories selected seem right. However, evaluating with just a single filetype example is misleading.
I started by changing the LZ4 output by the LZ4HC output, producing different pattern. I then added 2 files, Win98.vmdk, a virtual hard disk with many cab files, and Open Office directory, a mostly binary input. Here are the results :




Huff0 v0.6

Range0v0.7

RatioCompress DecodingRatioCompressDecoding
Not compressible





enwik8.7z1.000810 MB/s1.93 GB/s1.000885 MB/s1.93 GB/s
Hardly compressible





win98-lz4hc-lit1.024465 MB/s600 MB/s1.019374 MB/s262 MB/s
audio11.070285 MB/s280 MB/s1.071174 MB/s83 MB/s
Distributed





enwik8-lz4hc-lit1.290205 MB/s194 MB/s1.296150 MB/s77 MB/s
Lightly Ordered





enwik8-lz4hc-offh1.131197 MB/s184 MB/s1.133145 MB/s79 MB/s
Ordered





enwik8-lz4hc-ml2.309214 MB/s195 MB/s2.326160 MB/s77 MB/s
Squeezed





office-lz4hc-run3.152218 MB/s202 MB/s3.157164 MB/s98 MB/s
enwik8-lz4hc-run4.959245 MB/s224 MB/s5.788188 MB/s148 MB/s
Ultra compressible





loong278785 MB/s2.93 GB/s1430555 MB/s427 MB/s


There are several interesting learnings here.
Win98-lz4hc-lit is the literals part only extracted by lz4hc. But wait, why is it not into the "distributed" category ? Well, since this file contains many incompressible chunks, the literal sub-section end up being mostly incompressible. This is an important real-world example, showing that incompressible segment detection makes real impact.

lz4hc produces less literals and less matches, but longer ones, than the fast version. As consequence, run length are much more compressible, while match length are not. It perfectly shows that the more squeezed the distribution, the better Range0 compression advantage.

One could think that run length is a typical situation which should always benefit Range0. Alas, it is not that simple. Such conclusion is biaised, as a result of focusing too much on enwik8 for tests.
Most binary files feature a very different pattern : much less frequent matches, but much longer ones. As a consequence, literals tend to be quite more numerous, their compressibility being also not guaranteed. And as a side effect, run lengths end up being larger.

This is showed by office example : although the distribution is still "squeezed", resulting in a pretty good x3 compression ratio, this is still not enough to make Range0 distinctly better. In fact, considering the very small difference with Huff0, it's not worth the speed impact. This is in contrast with enwik8 results.

This means that we should not assume run length to be constantly better compressed by Range0, requiring a more complex selection process.

Monday, January 3, 2011

Range0 : new version (v0.7)

Several learnings from the Huff0 new version were generic enough to become applicable to Range0. More specifically, the interleaved counter and the call to memcpy(), resulted in significant improvements, as displayed below :

                       Range0 v0.6      Range0 v0.7
                       R    C    D      R    C    D 
Not compressible 
enwik8.7z            1.000 870 1400   1.000 885 1930 
Hardly compressible 
audio1               1.071 174   83   1.071 174   83 
Distributed 
enwik8-lz4-lit       1.370 155   76   1.370 155   76 
Lightly ordered 
enwik8-lz4-offh      1.191 138   76   1.191 138   76 
Ordered 
enwik8-lz4-ml        2.946 155   83   2.946 160   83
Squeezed 
enwik8-lz4-run       4.577 163  116   4.577 180  116 
Ultra compressible
loong                1430. 362  427   1430. 555  427  


The memcpy() makes wonder at improving speed for incompressible segments.
More importantly, interleaved counter speed up squeezed distribution compression.

The ultra-compressible corner case is not so important, in spite of the huge performance benefit, but the squeezed distribution benefit is, since this is Range0 most likely use. At 180 MB/s, it makes it almost as fast as standard Huffman coders, which basically means extra compression performance for free.

The bad point is, and will remain, the decoding speed, which cannot beat Huffman, due to the presence of a division into the main loop. However, the decoding speed is not too bad for the specific "squeezed" situation we want Range0. Indeed, should the distribution become even more squeezed, decoding speed would become even better.

Evolving the benchmark corpus in order to consider LZ4HC output, instead of Fast LZ4, is likely to make this statement more relevant.

For the time being, Range0 can be downloaded here.

Sunday, January 2, 2011

Huff0 : New version (v0.6)

Since working on the stats produced by huff0, some inefficiencies became apparent, and as a consequence got solved in a newer version, available on the new huff0 homepage.

To properly understand what was changed, here is a performance snapshot, comparing v0.5 (old) with v0.6 (new).

                       Huff0 v0,5       Huff0 v0,6 
                       R    C    D      R    C    D 
Not compressible 
enwik8.7z            1.000 740 1400   1.000 810 1930 
Hardly compressible 
audio1               1.070 285  280   1.070 285  280 
Distributed 
enwik8-lz4-lit       1.363 210  200   1.363 210  200 
Lightly ordered 
enwik8-lz4-offh      1.188 188  181   1.188 188  181 
Ordered 
enwik8-lz4-ml        2.915 200  197   2.915 210  197 
Squeezed 
enwik8-lz4-run       4.284 209  218   4.284 235  218 
Ultra compressible
loong                278.0 450 2930   278.0 785 2930  

Changes are underlined in bold characters.

To begin with, only speed is affected by the changes. Compression ratio remains strictly identical.

The first changed is in the way data is scanned. This is a nifty trick, related to cache behavior. When updating a data, such data should not be updated again immediately, otherwise there is a "cache" penalty : the previous change must be fully committed before the new one get accepted.
In order to avoid such penalty, we interleave data changes, so that the CPU get enough time to deal with repeated changes on the same value. It makes the code slightly more complex and data structure a bit bigger, but is definitely worth it : although it affect negatively situation such as "not compressible", on the other hand the more some data is present, the better the benefit.
We end up with a grand +70% improvement for "ultra compressible" corner case. Not only corner case benefit though, "ordered" distribution get a nice +5% boost, while squeezed one get a bit more than +10%.

The second change is on the way incompressible data is detected. While the end result is still slower than Range0, there is a pretty nice performance boost by betting earlier on the incompressible nature of data just scanned. It avoids later algorithm to be triggered, thus saving time and energy. This is a nice +10% boost.
While this gain is only visible on "not compressible" example, it still can achieve real situations benefits. It is not uncommon for some parts of large files to be incompressible. For example, an ISO file may contain a few pictures in compressed format. In this case, the better the detection, the faster the compression, since any attempt to compress it is likely to fail or end up with miserable gains.

The third change is really straighforward : i've removed my hand-made "copy" operation, and swapped it with the standard memcpy() call of C library.
Although only useful when some segments are incompressible, in this case, the gain is really noticeable, at about +30%. For just a simple function call, this is really worth it.

I still have to understand how come the call to memcpy() is faster than my own simple loop. There is probably some very interesting learnings behind.

Side comment, i tried the same trick on "ultra compressible" file, replacing my own simple loop with a memset(). It resulted in no speed difference. I tried then to improve the simple loop by making more complex parallel executions, but it resulted in a loss.
My only explanation so far is that the compiler probably translates my simple loop into a memset(), transparently.

Anyway, please feel free to test.
The binary can be downloaded at Huff0 homepage.