RealTime Data Compression: Entropy evaluation : small changes in benchmark corpus

You can't easily improve what you can't measure. That's why test tools are so important.
For compression, benchmark corpus is an essential tool to measure performance and check consistency.
Its main drawback, however, is to influence development in a way which make the benchmark look better, forgetting or even worsening situations which are not in the corpus.

Therefore, carefully choosing elements within the corpus, and even changing them from time to time, is a sensible thing to do.

For entropy evaluation, the different categories selected seem right. However, evaluating with just a single filetype example is misleading.
I started by changing the LZ4 output by the LZ4HC output, producing different pattern. I then added 2 files, Win98.vmdk, a virtual hard disk with many cab files, and Open Office directory, a mostly binary input. Here are the results :

		Huff0 v0.6			Range0	v0.7
	Ratio	Compress	Decoding	Ratio	Compress	Decoding
Not compressible
enwik8.7z	1.000	810 MB/s	1.93 GB/s	1.000	885 MB/s	1.93 GB/s
Hardly compressible
win98-lz4hc-lit	1.024	465 MB/s	600 MB/s	1.019	374 MB/s	262 MB/s
audio1	1.070	285 MB/s	280 MB/s	1.071	174 MB/s	83 MB/s
Distributed
enwik8-lz4hc-lit	1.290	205 MB/s	194 MB/s	1.296	150 MB/s	77 MB/s
Lightly Ordered
enwik8-lz4hc-offh	1.131	197 MB/s	184 MB/s	1.133	145 MB/s	79 MB/s
Ordered
enwik8-lz4hc-ml	2.309	214 MB/s	195 MB/s	2.326	160 MB/s	77 MB/s
Squeezed
office-lz4hc-run	3.152	218 MB/s	202 MB/s	3.157	164 MB/s	98 MB/s
enwik8-lz4hc-run	4.959	245 MB/s	224 MB/s	5.788	188 MB/s	148 MB/s
Ultra compressible
loong	278	785 MB/s	2.93 GB/s	1430	555 MB/s	427 MB/s

There are several interesting learnings here.
Win98-lz4hc-lit is the literals part only extracted by lz4hc. But wait, why is it not into the "distributed" category ? Well, since this file contains many incompressible chunks, the literal sub-section end up being mostly incompressible. This is an important real-world example, showing that incompressible segment detection makes real impact.

lz4hc produces less literals and less matches, but longer ones, than the fast version. As consequence, run length are much more compressible, while match length are not. It perfectly shows that the more squeezed the distribution, the better Range0 compression advantage.

One could think that run length is a typical situation which should always benefit Range0. Alas, it is not that simple. Such conclusion is biaised, as a result of focusing too much on enwik8 for tests.
Most binary files feature a very different pattern : much less frequent matches, but much longer ones. As a consequence, literals tend to be quite more numerous, their compressibility being also not guaranteed. And as a side effect, run lengths end up being larger.

This is showed by office example : although the distribution is still "squeezed", resulting in a pretty good x3 compression ratio, this is still not enough to make Range0 distinctly better. In fact, considering the very small difference with Huff0, it's not worth the speed impact. This is in contrast with enwik8 results.

This means that we should not assume run length to be constantly better compressed by Range0, requiring a more complex selection process.

RealTime Data Compression

Featured

Tuesday, January 4, 2011

Entropy evaluation : small changes in benchmark corpus

No comments:

Post a Comment