Monday, December 23, 2013

Zhuff v0.9, or a first FSE application

 As a quick follow up to last week release of FSE, I wanted to produce a quick demo to show the kind of gain to expect with this new entropy algorithm.

And a simple example shall be Zhuff.

Zhuff, is basically, LZ4 and Huff0 stuffed together. It's very fast in spite of this 2-pass process. Its major limitation so far is a 64KB window size, as an inheritance from LZ4, but that's something we'll talk again in a later post.

For the time being, my only intention is to swap huff0 with FSE, and look at the results.

And the results are positive.
Here are some benchmark numbers, on a Core i5-3340M @2.7GHz, Windows  Seven 64 bits, using single-thread, fast algorithm

Filename Compressor   Ratio Decoding speed
silesia  zhuff v0.8   2.783    275 MB/s
silesia  zhuff v0.9   2.805    311 MB/s

enwik8   zhuff v0.8   2.440    200 MB/s
enwik8   zhuff v0.9   2.460    230 MB/s

calgary  zhuff v0.8   2.672    220 MB/s
calgary  zhuff v0.9   2.693    250 MB/s

(Note : I've not mentioned improvements in compression speed, they are measurable but unrelated to FSE, so out of this report)

So, in a nutshell, we have a moderate (>10%) increase of decoding speed, and a small improvement to compression ratio. Nothing earth-shattering, but still worthy to grab.
The situation would have been even better for FSE if higher probabilities had to be compressed, but Zhuff is a simple fast LZ algorithm, so it doesn't exactly qualify as generating "high probabilities".

The purpose of this demo was just to show that FSE can easily replace and improve results compared to Huffman. As a side-effect, it makes Zhuff an interesting playground to test future FSE improvements.

With this part now settled, the next articles shall describe FSE principles, especially how and why it works.

8 comments:

  1. Hi Yann,
    Maybe you would think of adding 2^16 size alphabet option (2 bytes at once) for entropy coder - about 1MB of tables (2^18 states) so still fits into cache, but would be almost twice faster and have better compression rate (near to order 1).
    First block could be without header (LZ4 only) or contain distribution for single bytes - for pairs you could use their multiplications.
    Then pairs without appearance would have single occurrence in the table.
    Best,
    Jarek

    ps. I thought about omitting the "nbBits" - it would take about 3 "if"s with "x&mask==0" ( http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog ). "Return the Most Significant Bit position" sounds like a frequent operation - I don't understand why they don't implement it in processors ...

    ReplyDelete
    Replies
    1. Yes, larger alphabet input is part of the request from Bulat. It's in the "todo list".
      Now regarding the "order 1 equivalent", this part is more complex, and deserve a full blog post, so I wont' get into details into this comment section. Let's just say that header cost starts to become a problem.

      Delete
  2. Oh, there is implemented such operation: http://en.wikipedia.org/wiki/Find_first_set
    So "nbBits" can be cheaply omitted - you store not shifted new states and can easily deduce nbBits from "count leading zeros".

    ReplyDelete
    Replies
    1. Actually, that's what the current code is doing, although the calculation of nbBits is only done once, when creating the table.

      Delete
    2. It was a comment regarding omitting the "nbBits" to reduce memory requirements - like from 1.25MB to 1MB for 2^16 alphabet and 2^18 states.
      If the processor implementation of this operation is fast, the speed difference should negligible.

      Indeed there is a subtle difference between 2^16 alphabet and order 1 method. However, both keep the correlation with the neighbor, so I think their rates should be similar?

      The header issue and initialization time changes the situation for 2^16 size alphabet - probably the first block should be shorter and without header, while other blocks should be longer.

      Delete
    3. > It was a comment regarding omitting the "nbBits" to reduce memory requirements - like from 1.25MB to 1MB for 2^16 alphabet and 2^18 states.

      It is important to note that a table structure must be aligned.
      Hence a cell must be either 2 bytes, or 4 bytes, or 8 bytes. Not something "odd". The compiler will automatically round up the cell size to keep everything properly aligned (except if instructed otherwise, but I won't get into this direction, for too many reasons to list here).

      For 2^16 alphabet, then it makes sense to remove nbBits, since it saves one byte.
      But then, it's also necessary to keep table size at 16 bits, otherwise, the state field will have to grow (from 16 bits to 32 bits).

      Well, as you see, it's an optimization exercise. It's certainly possible, just not "straightforward" nor easy.

      > Indeed there is a subtle difference between 2^16 alphabet and order 1 method. However, both keep the correlation with the neighbor, so I think their rates should be similar?

      Well, not exactly.
      order-1 means : take into consideration the previous symbol to guess the next one.
      with 2-bytes alphabet, this correlation is only "integrated" for even-to-odd symbol transition, but any correlation for odd-to-even symbol is lost.
      So basically, it's only "half good" compared to correct order-1.

      Delete
  3. Sure, I completely agree that as LZ4 works with bytes, the number of bits should be aligned with byte structure: 8 or 16.
    However, for a general data e.g. 12 bit alphabet also sounds reasonable - especially if having separate code (maybe also tables) for even and odd steps.

    When removed nbBits, indeed new state should be stored not shifted (state = state<1/8 for 2^18 states).

    Indeed, I completely agree that this extension is not just straightforward. Maybe also a more precise symbol spread would be useful here, but even using heap sounds too costly...

    > So basically, it's only "half good" compared to correct order-1.
    Are you certain? It was also my first thought, but now I have a mixed feeling about it: in both we use correlation with a single neighbor to reduce uncertainty/entropy.

    ps. Look at optimization I've written at the second article - should speed up like 1.5 times (you can increase read to 6 bytes for x64bit buffer) ...

    ReplyDelete
    Replies
    1. I had a mixed feelings about this difference between order 1 on bytes and order 0 on 16 bit blocks, so to make sure I have just checked it for Markov source - the 16 bit block case has entropy exactly in the middle between order 0 and order 1 on bytes - order 1/2 is indeed perfect name for this case.

      So while 256 tables for 256 size alphabet and 2^16 size alphabet would both need similar amount of memory, the first one gives better compression, while the second is slightly faster.

      Delete