RealTime Data Compression: Huffman revisited, Part 4 : Multi-bytes decoding

Wednesday, October 14, 2015

Huffman revisited, Part 4 : Multi-bytes decoding

In most Huffman implementations I'm aware of, decoding symbols is achieved in a serial fashion, one-symbol-after-another.

Decoding fast is not that trivial, but it has been already well studied. Eventually, the one symbol per decoding operation becomes its upper limit.

Consider how work a fast Huffman decoder : all possible bit combinations are pre-calculated into a table, of predefined maximum depth. For each bit combination, it's a simple table lookup to get the symbol decoded and the number of bits to consume.

Huffman Table lookup (example)

More complex schemes may break the decoding into 2 steps, most notably in an attempt to reduce look-up table sizes and still manage to decode symbols which exceed table depth. But it doesn't change the whole picture : that's still a number of operations to decode a single symbol.

In an attempt to extract more speed from decoding operation, I was curious to investigate if it would be possible to decode more than one symbol per lookup.

Intuitively, that sounds plausible. Consider some large Huffman decoding table, there is ample room for some bit sequences to represent 2 or more unequivocal symbols. For example, if one symbol is dominant, it only needs 1 bit. So, with only 2 bits, we have 25% chances to get a sequence which means "decode 2 dominant symbols in a row", in a single decode operation.

This can be visualized on below example :

Example of small single-symbol decoding table

which can be transformed into :

Example of multi-symbols decoding table

In some ways, it can look reminiscent of Tunstall codes, since we basically try to fit as many symbols as possible into a given depth. But it's not : we don't guarantee reading the entire depth each time, the number of bits read is still variable, just more regular. And there is no "order 1 correlation" : probabilities remain the same per symbol, without depending on prior prefix.

Even with above table available, there is still the question of using it efficiently. It doesn't make any good if a single decoding step is now a lot more complex in order to potentially decode multiple symbols. As an example of what not to do, a straightforward approach would be to start decoding the first symbol, then figure out if there is some place left for another one, proceed with the second symbol, then test for a 3rd one, etc. Each of these tests become an unpredictable branch, destroying performance in the process.

The breakthrough came by observing LZ decompression process such as lz4 : it's insanely fast, because it decodes matches, aka. suite of symbols, as a single copy operation.
This is in essence what we should do here : copy a sequence of multiple symbols, and then decide how many symbols there really are. It avoids branches.
On current generation CPU, copying 2 or 4 bytes is not much slower than copying a single byte, so the strategy is effective. Overwriting same position is also not an issue thanks to modern cache structure.

With this principle settled, it now requires an adapted lookup table structure to work with. I settled with these ones :

Huffman lookup cell structure

The double-symbols structure could seem poorly ambitious : after all, it is only able to store up to 2 symbols into the `sequence` field. But in fact, tests will show it's a good trade-off, since most of the time, 2 symbols is what can be reasonably stored into a table lookup depth.

Some quick maths : depth of a lookup table is necessarily limited, in order to fit into memory cache where access times are best. An Intel's cpu L1 data cache is typically 32 KB (potentially shared due to hyper-threading). Since no reasonable OS is single-threaded anymore, let's not use the entire cache : half seems good enough, that's 16 KB. Since a single cell for double-symbols is now 4 bytes (incidentally, the same size as FSE decoder), that means 4K cells, hence a maximum depth of 12 bits. Within 12 bits, it's unlikely to get more than 2 symbols at a time. But this outcome entirely depends on alphabet distribution.

This limitation must be balanced with increased complexity for table lookup construction. The quad-symbols one is significantly slower, due to more fine-tuned decisions and recursive nature of the algorithm, defeating inlining optimizations. Below graph show the relative speed of each construction algorithm (right side, in grey, is provided for information, since if target distribution falls into this category, Huffman entropy is no longer a recommended choice).

Lookup table construction speed

The important part is roughly underlined in red boxes, showing areas which are relevant for some typical LZ symbols. The single-symbol lut construction is always faster, significantly. To make sense, slower table construction must be compensated by improved symbol decoding speed. Which, fortunately, is the case.

Decoding speed, at 32 KB block

As suspected, the "potentially faster" quad-symbols variant is hampered by its slower construction time. It manages to become competitive at "length & offset" area, but since it costs 50% more memory, it needs to be unquestionably better to justify that cost. Which is the case as alphabet distribution become more squeezed. By that time though, it becomes questionable if Huffman is still a reasonable choice for the selected alphabet, since its compression power will start to wane significantly against more precise methods such as FSE.
The "double-symbols" variant, on the other hand, takes off relatively fast and dominate the distribution region where Huffman makes most sense, making it a prime contender for an upgrade.

By moving from a 260 MB/s baseline to a faster 350-450 MB/s region, the new decoding algorithm is providing fairly sensible gains, but we still have not reached the level of previous multi-stream variant, which gets closer to 600 MB/s. The logical next step is to combine both ideas, creating a multi-streams multi-symbols variant. A challenge which proved more involving than it sounds. But that's for another post ...

8 comments:

BRULE HermanOctober 14, 2015 at 1:49 PM
Very interesting. In cached compression case (squashfs, ...) we don't care of the compression speed/time/memory. It's where quad symbol can be interesting mostly for the decompression speed.
ReplyDelete
Replies
Jarek DudaOctober 16, 2015 at 7:24 AM
Hi Yann,
You can do something similar with FSE - when the succeeding symbol(s) is uniquely determined by the current state only - what happens for high probability symbols (pr > 0.5).
It is always true when there are no bits read (just growing state).
When a single bit is read, the new state is x or x+1, if both have identical symbol (successive occurrences), you could read one more symbol and do the "+1" then.
But generally it sounds reasonable only while having a dominating symbol, like in your Proba80, where you could just group two symbols as the alphabet is small.

Eventually, the number of bits for renormalization could be increased from the standard 1, making "no read bits" situation more frequent.
Best,
Jarek
ReplyDelete
Replies
Jarek DudaOctober 16, 2015 at 8:29 AM
A different view: what you are doing is grouping a varying number of symbols, such that the probability of each group is above some threshold (e.g. 1/4096).
You could perform such groping separately (remembering about the prefix condition), additionally improving the order (compression ratio).

So we need a prefix tree, but with symbols in branches instead of bits - degree of nodes is alphabet size instead of 2.
For example assuming dominating symbol A, such prefix-free grouping can be:
B,C,D, ..., AA, AB, AC, AD ...
doubling the size of alphabet, but also increasing speed and order of the method (compression ratio).
the next step would be tripling the alphabet:
B, C, D, ... AB, AC, AD, ... AAA, AAB, AAC, AAD ...

An example of general expansion algorithm:
until exceeding some alphabet size, take the most probable leaf and expand it (in above examples, first A was expanded, then AA).

Remember that tANS is not limited to 256 symbols. Using 4096 states, you could easily handle even >1000 size alphabet - it just approximates probabilities as q/4096 (a bit better with tuning).
ReplyDelete
Replies

Add comment