Tuesday, January 4, 2011

Entropy evaluation : small changes in benchmark corpus

You can't easily improve what you can't measure. That's why test tools are so important.
For compression, benchmark corpus is an essential tool to measure performance and check consistency.
Its main drawback, however, is to influence development in a way which make the benchmark look better, forgetting or even worsening situations which are not in the corpus.

Therefore, carefully choosing elements within the corpus, and even changing them from time to time, is a sensible thing to do.

For entropy evaluation, the different categories selected seem right. However, evaluating with just a single filetype example is misleading.
I started by changing the LZ4 output by the LZ4HC output, producing different pattern. I then added 2 files, Win98.vmdk, a virtual hard disk with many cab files, and Open Office directory, a mostly binary input. Here are the results :

Huff0 v0.6


RatioCompress DecodingRatioCompressDecoding
Not compressible

enwik8.7z1.000810 MB/s1.93 GB/s1.000885 MB/s1.93 GB/s
Hardly compressible

win98-lz4hc-lit1.024465 MB/s600 MB/s1.019374 MB/s262 MB/s
audio11.070285 MB/s280 MB/s1.071174 MB/s83 MB/s

enwik8-lz4hc-lit1.290205 MB/s194 MB/s1.296150 MB/s77 MB/s
Lightly Ordered

enwik8-lz4hc-offh1.131197 MB/s184 MB/s1.133145 MB/s79 MB/s

enwik8-lz4hc-ml2.309214 MB/s195 MB/s2.326160 MB/s77 MB/s

office-lz4hc-run3.152218 MB/s202 MB/s3.157164 MB/s98 MB/s
enwik8-lz4hc-run4.959245 MB/s224 MB/s5.788188 MB/s148 MB/s
Ultra compressible

loong278785 MB/s2.93 GB/s1430555 MB/s427 MB/s

There are several interesting learnings here.
Win98-lz4hc-lit is the literals part only extracted by lz4hc. But wait, why is it not into the "distributed" category ? Well, since this file contains many incompressible chunks, the literal sub-section end up being mostly incompressible. This is an important real-world example, showing that incompressible segment detection makes real impact.

lz4hc produces less literals and less matches, but longer ones, than the fast version. As consequence, run length are much more compressible, while match length are not. It perfectly shows that the more squeezed the distribution, the better Range0 compression advantage.

One could think that run length is a typical situation which should always benefit Range0. Alas, it is not that simple. Such conclusion is biaised, as a result of focusing too much on enwik8 for tests.
Most binary files feature a very different pattern : much less frequent matches, but much longer ones. As a consequence, literals tend to be quite more numerous, their compressibility being also not guaranteed. And as a side effect, run lengths end up being larger.

This is showed by office example : although the distribution is still "squeezed", resulting in a pretty good x3 compression ratio, this is still not enough to make Range0 distinctly better. In fact, considering the very small difference with Huff0, it's not worth the speed impact. This is in contrast with enwik8 results.

This means that we should not assume run length to be constantly better compressed by Range0, requiring a more complex selection process.

No comments:

Post a Comment