RealTime Data Compression: 3/20/11

Saturday, March 26, 2011

Best of both worlds : mixed entropy

There is one thing that really amazed me when i started experimenting with entropy codings : Huffman is very good, in spite of its relatively old age and design simplicity. So good in fact that it can beat the much more modern, complex and slower Arithmetic coding in some circumstances.

For illustration, just have a quick look at entropy coders benchmark. The overall tendency is relatively clear : when no probability is too "large", the precision loss of Huffman is too small to be noticeable. Results are therefore relatively equivalent with Arithmetic.

Indeed, Huffman can even beat Arithmetic, and there are two good reasons for this :

- First, Huffman headers are much more compact than arithmetic ones. Since only bit length are necessary, it can be efficiently represented by a tree or a map, which can themselves be compressed easily. By contrast, arithmetic needs frequency count (or normalized frequency count), which are typically much higher values, with little opportunity for optimizations. As a consequence, Huffman headers are at least 4x smaller, and differences in the range of 6x-10x are common.

- Second : since headers are smaller, it's also possible to work with smaller Huffman blocs. Smaller blocs means statistics better describe the current bloc, which translates into improved compression ratio. This is called entropy adaptation.
As a consequence, it is possible to work with blocs of 16KB with huff0 (below this threshold, header costs start to override entropy adaptation benefits), while optimal bloc size for range0 is larger, at 128KB, due to big headers.
Entropy adaptation is the main cause which explains Huffman better compression ratio in some scenario.

However, on the other end, Arithmetic takes the lead, especially when one probability gets beyond 50%. And the closer it gets to 100%, the higher the difference. Huffman gets stuck to its best possible representation (1 bit per element), while arithmetic can continue to get benefits from higher probability levels. This is very visible on the lower part of benchmark table (results of 'enwik8-lz4hc-run' and 'loong').

Granted, these conclusions are related to the selected "semi-static" methodology, which requires a description header per bloc. A fully dynamic implementation would provide different results, much to the advantage of Arithmetic, since it suffers more from header size.
However, dynamic probability comes with its own price, namely speed, and wrong probability estimations.

In order to preserve high speed, i'm still working on block-based entropy coding.
However, in a will to improve compression ratio, i was wondering if it could be possible to benefit from both huff0 and range0 properties.

Enter HuffX, my new entropy coder, which for the time being is mostly a fusion of huff0 and range0, with a dynamic selector.

In its first incarnation, progresses are visible, as advertised in below benchmark results :

HuffX default with basic properties of huff0, featuring identical ratio, output and speed.
Things are getting interesting when input sequence is more squeezed : HuffX gradually provides more data blocs to range0.
In a very interesting "moot point", HuffX successfully outperforms both huff0 and range0, while retaining most of the speed of huff0. This is the kind of result i'm aiming at.
Then, HuffX ends up providing all blocs to range0, featuring ratio and speed close to it ... but not identical. HuffX still retains the 16KB bloc size of its older brother, which is a bit too small for pure range coding.

These first results show that improved compression ratio comes with a speed cost attached, as expected. Therefore, it still remains unclear if HuffX can find its way into fast compressors such as Zhuff or Etincelle.
Still some work ahead...