RealTime Data Compression: 1/2/11

Friday, January 7, 2011

An old friend returns : new version of Zhuff (v0.4)

Since working on LZ4 and then Huff0 improvements, it's quite natural to give another look at Zhuff, an old compression program that was created by simply combining LZ4 and Huff0 together.

There is slightly more to it. Zhuff also uses a sliding window, in place of the simpler chunking used in LZ4. It avoids some memory wasting, and keep the compression potential at full window size all the time. I also added an incompressible segment detector.

Zhuff succeeded at its time to be the fastest compressor for its compression rate, winning first place in several benchmarks. Since then, the focus changed to multi-threading, faring zhuff at a disadvantage since it uses just one core. Nonetheless, it still features an excellent energy-efficient profile, just behind Etincelle (which works best on larger source files).

This could change in the future, but for the time being, let's just re-use recent learnings to improve over last year version.

It works relatively well. The new version is significantly faster, and on top of that, slightly improves compression ratio.

	version	Compression Ratio	Speed	Decoding
Zhuff	0.4	2.532	161 MB/s	279 MB/s
Zhuff	0.3	2.508	139 MB/s	276 MB/s

You can note that the compression ratio is quite better than LZ4HC, while the speed is much faster. It simply shows that entropy compression power is stronger than full search, while costing much less CPU.

On the other hand, decoding speed is sharply reduced compared to LZ4, which is also an effect of second-stage entropy compression.

You can download it on its webpage.

Tuesday, January 4, 2011

Entropy evaluation : small changes in benchmark corpus

You can't easily improve what you can't measure. That's why test tools are so important.
For compression, benchmark corpus is an essential tool to measure performance and check consistency.
Its main drawback, however, is to influence development in a way which make the benchmark look better, forgetting or even worsening situations which are not in the corpus.

Therefore, carefully choosing elements within the corpus, and even changing them from time to time, is a sensible thing to do.

For entropy evaluation, the different categories selected seem right. However, evaluating with just a single filetype example is misleading.
I started by changing the LZ4 output by the LZ4HC output, producing different pattern. I then added 2 files, Win98.vmdk, a virtual hard disk with many cab files, and Open Office directory, a mostly binary input. Here are the results :

		Huff0 v0.6			Range0	v0.7
	Ratio	Compress	Decoding	Ratio	Compress	Decoding
Not compressible
enwik8.7z	1.000	810 MB/s	1.93 GB/s	1.000	885 MB/s	1.93 GB/s
Hardly compressible
win98-lz4hc-lit	1.024	465 MB/s	600 MB/s	1.019	374 MB/s	262 MB/s
audio1	1.070	285 MB/s	280 MB/s	1.071	174 MB/s	83 MB/s
Distributed
enwik8-lz4hc-lit	1.290	205 MB/s	194 MB/s	1.296	150 MB/s	77 MB/s
Lightly Ordered
enwik8-lz4hc-offh	1.131	197 MB/s	184 MB/s	1.133	145 MB/s	79 MB/s
Ordered
enwik8-lz4hc-ml	2.309	214 MB/s	195 MB/s	2.326	160 MB/s	77 MB/s
Squeezed
office-lz4hc-run	3.152	218 MB/s	202 MB/s	3.157	164 MB/s	98 MB/s
enwik8-lz4hc-run	4.959	245 MB/s	224 MB/s	5.788	188 MB/s	148 MB/s
Ultra compressible
loong	278	785 MB/s	2.93 GB/s	1430	555 MB/s	427 MB/s

There are several interesting learnings here.
Win98-lz4hc-lit is the literals part only extracted by lz4hc. But wait, why is it not into the "distributed" category ? Well, since this file contains many incompressible chunks, the literal sub-section end up being mostly incompressible. This is an important real-world example, showing that incompressible segment detection makes real impact.

lz4hc produces less literals and less matches, but longer ones, than the fast version. As consequence, run length are much more compressible, while match length are not. It perfectly shows that the more squeezed the distribution, the better Range0 compression advantage.

One could think that run length is a typical situation which should always benefit Range0. Alas, it is not that simple. Such conclusion is biaised, as a result of focusing too much on enwik8 for tests.
Most binary files feature a very different pattern : much less frequent matches, but much longer ones. As a consequence, literals tend to be quite more numerous, their compressibility being also not guaranteed. And as a side effect, run lengths end up being larger.

This is showed by office example : although the distribution is still "squeezed", resulting in a pretty good x3 compression ratio, this is still not enough to make Range0 distinctly better. In fact, considering the very small difference with Huff0, it's not worth the speed impact. This is in contrast with enwik8 results.

This means that we should not assume run length to be constantly better compressed by Range0, requiring a more complex selection process.

Monday, January 3, 2011

Range0 : new version (v0.7)

Several learnings from the Huff0 new version were generic enough to become applicable to Range0. More specifically, the interleaved counter and the call to memcpy(), resulted in significant improvements, as displayed below :

Range0 v0.6 Range0 v0.7

R C D R C D

Not compressible

enwik8.7z 1.000 870 1400 1.000 885 1930

Hardly compressible

audio1 1.071 174 83 1.071 174 83

Distributed

enwik8-lz4-lit 1.370 155 76 1.370 155 76

Lightly ordered

enwik8-lz4-offh 1.191 138 76 1.191 138 76

Ordered

enwik8-lz4-ml 2.946 155 83 2.946 160 83

Squeezed

enwik8-lz4-run 4.577 163 116 4.577 180 116

Ultra compressible

loong 1430. 362 427 1430. 555 427

The memcpy() makes wonder at improving speed for incompressible segments.
More importantly, interleaved counter speed up squeezed distribution compression.

The ultra-compressible corner case is not so important, in spite of the huge performance benefit, but the squeezed distribution benefit is, since this is Range0 most likely use. At 180 MB/s, it makes it almost as fast as standard Huffman coders, which basically means extra compression performance for free.

The bad point is, and will remain, the decoding speed, which cannot beat Huffman, due to the presence of a division into the main loop. However, the decoding speed is not too bad for the specific "squeezed" situation we want Range0. Indeed, should the distribution become even more squeezed, decoding speed would become even better.

Evolving the benchmark corpus in order to consider LZ4HC output, instead of Fast LZ4, is likely to make this statement more relevant.

For the time being, Range0 can be downloaded here.

Sunday, January 2, 2011

Huff0 : New version (v0.6)

Since working on the stats produced by huff0, some inefficiencies became apparent, and as a consequence got solved in a newer version, available on the new huff0 homepage.

To properly understand what was changed, here is a performance snapshot, comparing v0.5 (old) with v0.6 (new).

   Huff0 v0,5 Huff0 v0,6
   R C D R C D
Not compressible
enwik8.7z 1.000 740 1400   1.000 810 1930
Hardly compressible
audio1   1.070 285 280   1.070 285 280
Distributed
enwik8-lz4-lit 1.363 210 200   1.363 210 200
Lightly ordered
enwik8-lz4-offh 1.188 188 181   1.188 188 181
Ordered
enwik8-lz4-ml 2.915 200 197   2.915 210 197
Squeezed
enwik8-lz4-run 4.284 209 218   4.284 235 218
Ultra compressible
loong 278.0 450 2930   278.0 785 2930

Changes are underlined in bold characters.

To begin with, only speed is affected by the changes. Compression ratio remains strictly identical.

The first changed is in the way data is scanned. This is a nifty trick, related to cache behavior. When updating a data, such data should not be updated again immediately, otherwise there is a "cache" penalty : the previous change must be fully committed before the new one get accepted.
In order to avoid such penalty, we interleave data changes, so that the CPU get enough time to deal with repeated changes on the same value. It makes the code slightly more complex and data structure a bit bigger, but is definitely worth it : although it affect negatively situation such as "not compressible", on the other hand the more some data is present, the better the benefit.
We end up with a grand +70% improvement for "ultra compressible" corner case. Not only corner case benefit though, "ordered" distribution get a nice +5% boost, while squeezed one get a bit more than +10%.

The second change is on the way incompressible data is detected. While the end result is still slower than Range0, there is a pretty nice performance boost by betting earlier on the incompressible nature of data just scanned. It avoids later algorithm to be triggered, thus saving time and energy. This is a nice +10% boost.
While this gain is only visible on "not compressible" example, it still can achieve real situations benefits. It is not uncommon for some parts of large files to be incompressible. For example, an ISO file may contain a few pictures in compressed format. In this case, the better the detection, the faster the compression, since any attempt to compress it is likely to fail or end up with miserable gains.

The third change is really straighforward : i've removed my hand-made "copy" operation, and swapped it with the standard memcpy() call of C library.
Although only useful when some segments are incompressible, in this case, the gain is really noticeable, at about +30%. For just a simple function call, this is really worth it.

I still have to understand how come the call to memcpy() is faster than my own simple loop. There is probably some very interesting learnings behind.

Side comment, i tried the same trick on "ultra compressible" file, replacing my own simple loop with a memset(). It resulted in no speed difference. I tried then to improve the simple loop by making more complex parallel executions, but it resulted in a loss.
My only explanation so far is that the compiler probably translates my simple loop into a memset(), transparently.

Anyway, please feel free to test.
The binary can be downloaded at Huff0 homepage.