Wednesday, October 14, 2015

Huffman revisited, Part 4 : Multi-bytes decoding

 In most Huffman implementations I'm aware of, decoding symbols is achieved in a serial fashion, one-symbol-after-another.

Decoding fast is not that trivial, but it has been well studied already. Eventually, the one symbol per decoding operation becomes its upper limit.

Consider how work a fast Huffman decoder : all possible bit combinations are pre-calculated into a table, of predefined maximum depth. For each bit combination, it's a simple table lookup to get the symbol decoded and the number of bits to consume.

Huffman Table lookup (example)

More complex schemes may break the decoding into 2 steps, most notably in an attempt to reduce look-up table sizes and still manage to decode symbols which exceed table depth. But it doesn't change the whole picture : that's still a number of operations to decode a single symbol.

In an attempt to extract more speed from decoding operation, I was curious to investigate if it would be possible to decode more than one symbol per lookup.

Intuitively, that sounds plausible. Consider some large Huffman decoding table, there is ample room for some bit sequences to represent 2 or more unequivocal symbols. For example, if one symbol is dominant, it only needs 1 bit. So, with only 2 bits, we have 25% chances to get a sequence which means "decode 2 dominant symbols in a row", in a single decode operation.

This can be visualized on below example :

Example of small single-symbol decoding table

which can be transformed into :

Example of multi-symbols decoding table

In some ways, it can look reminiscent of Tunstall codes, since we basically try to fit as many symbols as possible into a given depth. But it's not : we don't guarantee reading the entire depth each time, the number of bits read is still variable, just more regular. And there is no "order 1 correlation" : probabilities remain the same per symbol, without depending on prior prefix.

Even with above table available, there is still the question of using it efficiently. It doesn't make any good if a single decoding step is now a lot more complex in order to potentially decode multiple symbols. As an example of what not to do, a straightforward approach would be to start decoding the first symbol, then figure out if there is some place left for another one, proceed with the second symbol, then test for a 3rd one, etc. Each of these tests become an unpredictable branch, destroying performance in the process.

The breakthrough came by observing LZ decompression process such as lz4 : it's insanely fast, because it decodes matches, aka. suite of symbols, as a single copy operation.
This is in essence what we should do here : copy a sequence of multiple symbols, and then decide how many symbols there really is. It will avoid branches.
On current generation CPU, copying 2 or 4 bytes is not much slower than copying a single byte, so the strategy is effective. Overwriting same position is also not an issue thanks to modern cache structure.

With this principle settled, it now requires an adapted lookup table structure to work with. I finally settled with these ones :
Huffman lookup cell structure

The double-symbols structure could seem poorly ambitious : after all, it is only able to store up to 2 symbols into the `sequence` field. But in fact, tests will show it's a good trade-off, since most of the time, 2 symbols is what can be reasonably stored into a table lookup depth.

Some quick maths : depth of a lookup table is necessarily limited, in order to fit into memory cache where access times are best. An Intel's cpu L1 data cache is typically 32 KB (potentially shared due to hyper-threading). Since no reasonable OS is single-threaded anymore, let's not use the entire cache : half seems good enough, that's 16 KB. Since a single cell for double-symbols is now 4 bytes (incidentally, the same size as FSE decoder), that means 4K cells, hence a maximum depth of 12 bits. Within 12 bits, it's unlikely to get more than 2 symbols at a time. But this conclusion entirely depends on alphabet distribution.

This limitation must be balanced with increased complexity for table lookup construction. The quad-symbols one is significantly slower, due to more fine-tuned decisions and recursive nature of the algorithm, potentially defeating inlining optimizations. Below graph show the relative speed of each construction algorithm (right side, in grey, is provided for information, since if target distribution falls into this category, Huffman entropy is no longer a recommended choice).



Lookup table construction speed

The important part is roughly underlined in red boxes, showing areas which are relevant for some typical LZ symbols. The single-symbol lut construction is always faster, significantly. To make sense, slower table construction must be compensated by improved symbol decoding speed. Which, fortunately, is the case.

Decoding speed, at 32 KB block

As suspected, the "potentially faster" quad-symbols variant is hampered by its slower construction time. It manages to become competitive at "length & offset" area, but since it costs 50% more memory, it needs to be unquestionably better to justify that cost. Which is the case as alphabet distribution become more squeezed. By that time though, it becomes questionable if Huffman is still a reasonable choice for the selected alphabet, since its compression power will start to wane significantly against more precise methods such as FSE.
The "double-symbols" variant, on the other hand, takes off relatively fast and dominate the distribution region where Huffman makes most sense, making it a prime contender for an upgrade.

By moving from a 260 MB/s baseline to a faster 350-450 MB/s region, the new decoding algorithm is providing fairly sensible gains, but we still have not reached the level of previous multi-stream variant, which gets closer to 600 MB/s. The logical next step is to combine both ideas, creating a multi-streams multi-symbols variant. A challenge which proved more involving than it sounds. But that's for another post ...

8 comments:

  1. Very interesting. In cached compression case (squashfs, ...) we don't care of the compression speed/time/memory. It's where quad symbol can be interesting mostly for the decompression speed.

    ReplyDelete
  2. Hi Yann,
    You can do something similar with FSE - when the succeeding symbol(s) is uniquely determined by the current state only - what happens for high probability symbols (pr > 0.5).
    It is always true when there are no bits read (just growing state).
    When a single bit is read, the new state is x or x+1, if both have identical symbol (successive occurrences), you could read one more symbol and do the "+1" then.
    But generally it sounds reasonable only while having a dominating symbol, like in your Proba80, where you could just group two symbols as the alphabet is small.

    Eventually, the number of bits for renormalization could be increased from the standard 1, making "no read bits" situation more frequent.
    Best,
    Jarek

    ReplyDelete
    Replies
    1. As you guessed, the issue with FSE is that it would only work for dominating symbols > 50%, and even then, it would only be worthwhile if one symbol is really frequent (such as P80).
      So it's more of a corner case.

      Delete
  3. A different view: what you are doing is grouping a varying number of symbols, such that the probability of each group is above some threshold (e.g. 1/4096).
    You could perform such groping separately (remembering about the prefix condition), additionally improving the order (compression ratio).

    So we need a prefix tree, but with symbols in branches instead of bits - degree of nodes is alphabet size instead of 2.
    For example assuming dominating symbol A, such prefix-free grouping can be:
    B,C,D, ..., AA, AB, AC, AD ...
    doubling the size of alphabet, but also increasing speed and order of the method (compression ratio).
    the next step would be tripling the alphabet:
    B, C, D, ... AB, AC, AD, ... AAA, AAB, AAC, AAD ...

    An example of general expansion algorithm:
    until exceeding some alphabet size, take the most probable leaf and expand it (in above examples, first A was expanded, then AA).

    Remember that tANS is not limited to 256 symbols. Using 4096 states, you could easily handle even >1000 size alphabet - it just approximates probabilities as q/4096 (a bit better with tuning).

    ReplyDelete
    Replies
    1. It sounds interesting, but we should keep in mind the objective.
      The reason I've tested multi-bytes decoding for Huffman is speed. And within this speed budget, there is "table construction" phase.
      It's just counter-productive to plan some gain on the decoding side if the table construction time now dominates the total operation.

      Also, note that FSEU16, the variant which can handle alphabet > 256, is slower than the regular FSE variant.

      Delete
    2. Sure, the objective is to speedup - by processing a few symbols at once.

      Initialization is first going once through all the positions (e.g. 4092) for your fast spread, then again to build the tables ( https://dl.dropboxusercontent.com/u/12405967/tANSalg.png ).
      So increasing the alphabet doesn't cost time.
      Its memory cost is mainly (+start+nb) a few more bits for the symbol in decodingTable, but it can be handled by a bit compression: 2 bytes are sufficient for "symbol + nbBits".

      Assume there is this various length grouping of symbols (e.g. B,C,D, ..., AA, AB, AC, AD ... ) - for example chosen from counts of the previously decoded data frame (e.g. 32kB).
      Now, for the next data frame, the encoder counts occurrences of these grouped symbols (improving order of the method), the decoder decodes groups of symbols, exactly as in your post.

      The main cost is increased header: decoder needs to know frequencies on the extended alphabet (B,C ..., AA, AB ...) - for improving the order of the method (compression ratio).
      It could be compensated by increasing data frame, and generally there should be stored differences of frequencies comparing to the previous frame...

      Delete
    3. I can only invite you to test these great ideas on a custom FSE version. I can guess multiple issues that will have to be solved while making this direction viable, and maybe you could solve them all.

      Delete
    4. If I only had time to work on everything I see valuable and interesting ...
      Best,
      Jarek

      Delete