RealTime Data Compression: Zhuff v0.9, or a first FSE application

Monday, December 23, 2013

Zhuff v0.9, or a first FSE application

As a quick follow up to last week release of FSE, I wanted to produce a quick demo to show the kind of gain to expect with this new entropy algorithm.

And a simple example shall be Zhuff.

Zhuff, is basically, LZ4 and Huff0 stuffed together. It's very fast in spite of this 2-pass process. Its major limitation so far is a 64KB window size, as an inheritance from LZ4, but that's something we'll talk again in a later post.

For the time being, my only intention is to swap huff0 with FSE, and look at the results.

And the results are positive.
Here are some benchmark numbers, on a Core i5-3340M @2.7GHz, Windows Seven 64 bits, using single-thread, fast algorithm

Filename Compressor Ratio Decoding speed
silesia zhuff v0.8 2.783 275 MB/s
silesia zhuff v0.9 2.805 311 MB/s

enwik8 zhuff v0.8 2.440 200 MB/s
enwik8 zhuff v0.9 2.460 230 MB/s

calgary zhuff v0.8 2.672 220 MB/s
calgary zhuff v0.9 2.693 250 MB/s

(Note : I've not mentioned improvements in compression speed, they are measurable but unrelated to FSE, so out of this report)

So, in a nutshell, we have a moderate (>10%) increase of decoding speed, and a small improvement to compression ratio. Nothing earth-shattering, but still worthy to grab.
The situation would have been even better for FSE if higher probabilities had to be compressed, but Zhuff is a simple fast LZ algorithm, so it doesn't exactly qualify as generating "high probabilities".

The purpose of this demo was just to show that FSE can easily replace and improve results compared to Huffman. As a side-effect, it makes Zhuff an interesting playground to test future FSE improvements.

With this part now settled, the next articles shall describe FSE principles, especially how and why it works.

8 comments:

Jarek DudaDecember 25, 2013 at 4:01 PM
Hi Yann,
Maybe you would think of adding 2^16 size alphabet option (2 bytes at once) for entropy coder - about 1MB of tables (2^18 states) so still fits into cache, but would be almost twice faster and have better compression rate (near to order 1).
First block could be without header (LZ4 only) or contain distribution for single bytes - for pairs you could use their multiplications.
Then pairs without appearance would have single occurrence in the table.
Best,
Jarek

ps. I thought about omitting the "nbBits" - it would take about 3 "if"s with "x&mask==0" ( http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog ). "Return the Most Significant Bit position" sounds like a frequent operation - I don't understand why they don't implement it in processors ...
ReplyDelete
Replies
Jarek DudaDecember 25, 2013 at 4:15 PM
Oh, there is implemented such operation: http://en.wikipedia.org/wiki/Find_first_set
So "nbBits" can be cheaply omitted - you store not shifted new states and can easily deduce nbBits from "count leading zeros".
ReplyDelete
Replies
Jarek DudaDecember 25, 2013 at 10:10 PM
Sure, I completely agree that as LZ4 works with bytes, the number of bits should be aligned with byte structure: 8 or 16.
However, for a general data e.g. 12 bit alphabet also sounds reasonable - especially if having separate code (maybe also tables) for even and odd steps.

When removed nbBits, indeed new state should be stored not shifted (state = state<1/8 for 2^18 states).

Indeed, I completely agree that this extension is not just straightforward. Maybe also a more precise symbol spread would be useful here, but even using heap sounds too costly...

> So basically, it's only "half good" compared to correct order-1.
Are you certain? It was also my first thought, but now I have a mixed feeling about it: in both we use correlation with a single neighbor to reduce uncertainty/entropy.

ps. Look at optimization I've written at the second article - should speed up like 1.5 times (you can increase read to 6 bytes for x64bit buffer) ...
ReplyDelete
Replies