Therefore, since such an effort is required, it is also a good time to have a fresh look at entropy coder.
Entropy is generally not used as a direct compression method. Its better use is *after* modeling.
For example, after an LZ77 compression pass, we end up with many match lengths, which distribution is not random : short lengths are much more common than longer ones.
Entropy coder will ensure that frequent symbol get coded using less bits than uncommon ones. That's the basis for better compression results.
The most renowned entropy coder is probably the Huffman one. It is no longer considered top of the line, but continue to be used in many mainstream applications, due to its simplicity and speed.
Huffman will guarantee that, given a symbol distribution, the best possible combination of bits per symbol will be created. That's a strong property.
Since Huffman proved that its method was the best possible, entropy coders received little attention before a few decades later, when Arithmetic coders started to be implemented.
This new beast is a lot more complex to explain, so i'll invite you to read the wikipedia entry to get more details. For the purpose of this post, it is enough to accept that arithmetic coders can attribute the equivalent of a fractional bit to a symbol. Yes, it becomes possible to encode a symbol using, for example, 0.38 bits.
This may sound strange, but it's all mathematics.
By breaking the "integer" boundary, arithmetic coders allow for much better precision, and therefore better compression. This is especially important when probability is higher than 50% : Huffman cannot do better than providing 1 bit per symbol, which is just right for 50%, but is much less optimal for 80%. In such instance, arithmetic coder will do much better.
A lesser version of arithmetic coder is called Range coder. It's pretty much the same, except that it works with integer numbers, and, more importantly, is free from patents, unlike arithmetic ones.
In order to test both method, i've created 2 programs, Huff0 and Range0.
Huff0 is a block based Huffman Entropy coder. Each data is divided into blocks of 16KB. The compressed output consists in a tree, followed by compressed entry.
Smaller blocks allow faster adaptation, but cost more in header, so a trade-off had to be chosen. 16KB was selected as being mostly optimal for distributed input, such as literals, or most files.
Range0 is a block based Range Entropy coder. Each data is divided into blocks of 128KB. The compressed output consists of a frequency count, followed by compressed entry.
Why 128KB ? Because the frequency count is much heavier than the Huffman tree. As a consequence, dividing data into smaller blocks results in a loss, in spite of better adaptation.
Such trade-off between adaptation speed and header costs deserve a full study in itself, and we'll keep that for another post.
In order to better understand the properties of each coder, i needed a benchmark corpus. I started by using files, but this is nonsense. As stated earlier, entropy coders are not used as direct compression tools, but rather as a "second stage" pass, applied to statistics or probabilities output from model.
So it sounded more reasonable to create files which would mimic the statistics produced by such models. Here is what i ended up with :
Huff0 v0,5 Range0 v0,6
% C D % C D
Not compressible
enwik8.7z 100,01 740 1400 100,00 870 1400
Hardly compressible
audio1 93,43 285 280 93,38 174 83
Distributed
enwik8-lz4-lit 73,35 210 200 73,00 155 76
Lightly ordered
enwik8-lz4-offh 84,18 188 181 83,95 138 76
Ordered
enwik8-lz4-ml 34,30 200 197 33,94 155 83
Squeezed
enwik8-lz4-run 23,34 209 218 21,85 163 116
Ultra compressible
loong 0,36 450 2930 0,07 362 427
That's a lot of results for a single post. But there is a lot to learn from it.
enwik8-lz4-y are files created by using LZ4 on enwik8, extracting a specific field. This is the fastest LZ4 version, not the HC one, so this could change in the future.
% is the resulting size, as % of original file.
C is the compression speed, in MB/s.
D is the decoding speed, in MB/s.
Not compressible : as stated, this 7z entry is already compressed, so there is nothing left.
The basic idea here, is to detect this situation fast enough, and to switch to backup copy mode.
The catch, is that this mode should not be too triggered too easily, otherwise we may lose some compression.
What immediately look strange here is that Huff0 is slower than Range0. Probably there is still some room for improvement.
Hardly compressible : this is a good torture test for the compression detector. This entry is a wave file, 16 bits stereo. As a consequence, when dealt with directly by an entropy coder, half the data is purely random (lower bits), and the other half has a small tendency to lean towards an average threshold. All in all, the compression that can be achieved is less than stellar, but still measurable.
We are starting to witness a tendency that will only grow stronger : Range0 compress better, but slower. Especially, the decoding speed of Range0 is an issue when compared with Huff0.
I will skip directly to
Squeezed : in this file, one of the symbol has a probability higher than 50%. And it pays off quite a lot for Range0. This is no longer about a .2% small benefit, but a more sensible 1.5% advantage. And this figure can only become better if using the HC version of LZ4, which will produced even more squeezed output.
Ultra Compressible : this is a test file, produced by Eugene Shelwien, to test limit conditions when compressing long streams of identical bytes. It is not a single byte, however, there are some changes from time to time, to make life harder to the coder.
What should immediately be visible, is that Huff0 succeeds in producing a file which is much better compressed than 1/8, the theoretical minimum. This is thanks to the relatively small size of blocks : when a block consists of only one repeated byte, it gets a specific shortcut sequence. The decoding speed shows this is highly effective.
This is in contrast with Range0, which decoding speed is *only* 427MB/s. This is due to larger block size, which means we more rarely have a full block with only one byte; as a consequence, it must be decoded, which is slower than the special escape sequence.
Note however, that, as could be expected, Range0 compression is much better than Huff0 on this example.
So, where does that lead us to ?
Well, that's just a start. It provides some interesting ground for future comparisons, allowing targeted improvements to become measurable.
More of it in a future post...