For compression, benchmark corpus is an essential tool to measure performance and check consistency.
Its main drawback, however, is to influence development in a way which make the benchmark look better, forgetting or even worsening situations which are not in the corpus.
Therefore, carefully choosing elements within the corpus, and even changing them from time to time, is a sensible thing to do.
For entropy evaluation, the different categories selected seem right. However, evaluating with just a single filetype example is misleading.
I started by changing the LZ4 output by the LZ4HC output, producing different pattern. I then added 2 files, Win98.vmdk, a virtual hard disk with many cab files, and Open Office directory, a mostly binary input. Here are the results :
|enwik8.7z||1.000||810 MB/s||1.93 GB/s||1.000||885 MB/s||1.93 GB/s|
|win98-lz4hc-lit||1.024||465 MB/s||600 MB/s||1.019||374 MB/s||262 MB/s|
|audio1||1.070||285 MB/s||280 MB/s||1.071||174 MB/s||83 MB/s|
|enwik8-lz4hc-lit||1.290||205 MB/s||194 MB/s||1.296||150 MB/s||77 MB/s|
|enwik8-lz4hc-offh||1.131||197 MB/s||184 MB/s||1.133||145 MB/s||79 MB/s|
|enwik8-lz4hc-ml||2.309||214 MB/s||195 MB/s||2.326||160 MB/s||77 MB/s|
|office-lz4hc-run||3.152||218 MB/s||202 MB/s||3.157||164 MB/s||98 MB/s|
|enwik8-lz4hc-run||4.959||245 MB/s||224 MB/s||5.788||188 MB/s||148 MB/s|
|loong||278||785 MB/s||2.93 GB/s||1430||555 MB/s||427 MB/s|
There are several interesting learnings here.
Win98-lz4hc-lit is the literals part only extracted by lz4hc. But wait, why is it not into the "distributed" category ? Well, since this file contains many incompressible chunks, the literal sub-section end up being mostly incompressible. This is an important real-world example, showing that incompressible segment detection makes real impact.
lz4hc produces less literals and less matches, but longer ones, than the fast version. As consequence, run length are much more compressible, while match length are not. It perfectly shows that the more squeezed the distribution, the better Range0 compression advantage.
One could think that run length is a typical situation which should always benefit Range0. Alas, it is not that simple. Such conclusion is biaised, as a result of focusing too much on enwik8 for tests.
Most binary files feature a very different pattern : much less frequent matches, but much longer ones. As a consequence, literals tend to be quite more numerous, their compressibility being also not guaranteed. And as a side effect, run lengths end up being larger.
This is showed by office example : although the distribution is still "squeezed", resulting in a pretty good x3 compression ratio, this is still not enough to make Range0 distinctly better. In fact, considering the very small difference with Huff0, it's not worth the speed impact. This is in contrast with enwik8 results.
This means that we should not assume run length to be constantly better compressed by Range0, requiring a more complex selection process.