Tuesday, January 14, 2014

Huffman, a comparison with FSE

In this serie explaining the behavior of Finite State Entropy, I find it interesting to make a quick comparison with Huffman entropy. It's a direct result from the description of its decoding principles, nonetheless I believe it's useful to clearly establish the equivalence since it will be illustrative for future articles.

To better grasp the similarities, let's have a simple example. We'll use the following probabilities, suitable for an Huffman tree :

A : 50% ; B : 25% ; C : 12.5% ; D : 12.5%

This tree only needs 8 values to be properly represented, and gives the following linear representation :

This representation uses the "high bit first" methodology, which means we are starting our node travel from the highest bit.
If it's 0, then we have an A. We stop there, shift left the register by one bit and load a new one at the bottom.
If the highest bit is 1, we now we are not done, and need to read the second bit. If this second bit is 0, we know we have a B. If not we continue reading. And so on.
The decision tree looks as follows :

But we could have done differently : we could have used "low bit first". In this case, the linear representation of the tree looks like this :

Which seems a lot messier, but it's not : we simply reversed the bit order of the decision tree :

The 2 decision trees are therefore completely equivalent.

Now, look again at the "low bit first" linear representation. It's important to state that this is what FSE would produce, but it also embeds a few additional properties :

• The "newStateBaseline" does not need to be calculated and stored into the decoding table : it can be automatically determined from the state itself : newStateBaseline = currentState & ~mask(nbBits);
• nbBits is stable for a given character : for A, it's always 1, for B, it's always 2, etc. It simplifies the decoding table, which does not need to store this information.

So basically, an Huffman-low_bit_first table looks the same as an FSE table, and it guarantees a stable number of bits per character, and newStateBaseline is automatically deduced from remaining bits.

There is, however, a small but important difference.
An Huffman-low_bit_first would shift right the current state, and load the next bit(s) high. In contrast, the FSE decoder keeps them, and only reload the low bits. This creates a dependency on future state : basically, once you've reached state 2 with FSE for example, you can already guarantee that you won't see a D before you see a C. Which brings the question : how would the encoder select the right state, since it depends on future characters yet to be decoded ?

The answer : the encoder will work in backward direction, encoding characters from last to first, solving dependencies along the way.

To be continued...

1. Hi Yann,
Thanks for the interesting analogy.
So while Huffman decoding can be seen as a special case of arithmetic (de)coding - the produced bits correspond to renormalization, rescaling the subranges (additional precision is obtained from new digits) ... instead of the most, ANS adds information in the least significant position, so for analogy between them we should just reverse digit/symbol order.

This symbol distributing method would also work for general probability distributions: spread symbols in ranges of corresponding lengths, and then reverse bit order for each of them - it would be fast, but I have just checked its deltaH and it is a few times worse than precise initialization (and depends on symbol order).
Maybe if instead of reversing bit order, we would somehow reverse symbol order in ANS inside this number ... sounds complicated and probably slower than precise initialization with heap, but I will think about it.
Thanks,
Jarek

1. ;-) Actually, this is my next article...

2. Ok, if you insist on a very quick, linear time approximation of n*log(n) precise initialization, here it is:
Create table of "number of states" pointers/cursors to stacks of (symbol, expected position) pairs - they are buckets from which you take succeeding symbols.
Now perform precise initialization procedure, but instead of using real expected positions, insert them to round(expected position) bucket, and just use succeeding buckets - something like that (first argument of push/pop is the number of stack):

for(s=0; s<nbSymbols; s++) push(round(1/(2p_s)), (s,1/(2p_s)));
cp=0;
for(i=0; i<nbStates; i++) //searching succeeding buckets
while(nonempty(i))
{(s,pos)=pop(i); push(round(pos+1/p_s), (s, pos+1/p_s));
symbol[cp++]=s; ... }

We could also use larger number of buckets, or eventually sorting inside these small bucket would be still cheap.

3. Yes, that's what did my previous (unpublished) version.

There is a big issue in using a bucket size which is the size of the state table : we need at least one value to be "big", otherwise we have a lot of collisions to solve, and it slows down considerably.

It can be solved by using a bucket table larger than the final state table. But it costs more memory, more time, etc.

All in all, it resulted in a slower algorithm than what is FSE today. I remember jumping from 130MB/s to 180MB/s when using the new initialization method. I could even afford to divide block size by 2 (and therefore increase table building iteration by a factor 2).

4. Sure it would be huge if done in 2D table, but using pointer or cursor stacks, the total memory need is proportional to the current amount of elements inside - size of alphabet.
Here is how to cheaply do it on cursors:

int first[nbStates]; // first symbol in given bucket
int next[nbSymbols]; //next symbol in the same stack
float pos[nbSymbols]; //expected symbol position, can be U16 instead
for(i=0; i < nbStates; i++) first[i]=-1; //empty buckets
for(s=0; s < nbSymbols; s++)
{pos[s]=ip[s]/2; //ip[s]=1/p[s];
t = round(ip[s]/2); //which bucket
next[s]=first[t]; first[t]=s;} //push to stack t
cp=0; //position in decoding table
for(i=0; i < nbStates; i++)
while((s=first[i]) > = 0)
{first[i]=next[s]; // remove from stack
t = round(pos[s]+=ip[s]); // new expected position
next[s]=first[t]; first[t]=s; // insert in new position
decodingTable[cp++]=s; ...
}

Using more accurate initialization allows to achieve the same deltaH with smaller table - gaining additional speedup.

5. Yann, I have finally briefly tested the github symbol distribution and its deltaH is a few to hundreds of times worse than for precise initialization (it puts symbols into clusters!).
You use step=5/8 tS+1
Thanks to being closer to phi=(sqrt(5)-1)/2, I have checked that much better behavior (and less clustering) is e.g. for
step = 5/8 tS - 3
However, it is still many times worse than precise initialization (and capricious) - you should get a better behavior for 2 or 4 times smaller table using above procedure.

6. Hello Jarek
I tested your suggestion ( 5/8 tS - 3), but the results are not that good.

win98-lz4-run went from 4 671 589 to 4 673 712.
proba70.bin : from 166 027 to 166 031
proba90.bin : from 68 977 to 68 981
rl.silesia.tar : from 3 736 111 to 3 736 333
etc.

In some cases, I've witnessed a small gain instead, but that's not the majority of situations.

Ex : ml.silesia.tar : from 5 631 214 to 5 631 048

In all cases, differences are very small.

7. Also : it seems to me that you have the required expertise to modify and test a C source code.
I feel your experience, and the accuracy of your propositions, would gain a lot if you could actually test your ideas beforehand, with a real source code, compiler, and target system.

2. Interesting. So because it works in the backward direction it looks like this wouldn't work in a streaming scenario unless it's split into blocks.

1. Correct, if streaming is at symbol level.

Well, to be more complete, most streaming protocol I know of are, in fact, block based. A classic example is a data packet, or even a movie frame. In most circumstances, there is enough input to consider it a "block".

When input data is scarce though (say, a few bytes for example a very-low-latency voice sample), this is different. But these cases are uncommon.

3. I don't know if you've seen it. This is an FSE analysis from the Brotli team:

(click on the PDF)

Their conclusion is, that FSE's headers are bigger than it saves compared to Huffman.

1. Thanks for the link. Indeed, I wasn't aware of this study, which is well detailed.

They apparently selected to not mention FSE in their paper, although they incidentally used exactly the same distribution method as earlier versions of FSE (it changed within latest version, with also a more efficient header scheme).

The ratio header/compressed data is huge. It's way beyond what I'm seeing in my tests. It would seem they selected to cut input data into relatively small blocks, with immediate consequences on header cost.

Another important point is that the algorithm is based on a very large number of commands and distance codes. This in turn limits the amount of losses produced by Huffman rounding, while inflating header size.

So, in a nutshell, some fundamental algorithm choices seem to be based around Huffman core properties. Hence, by merely swapping Huffman with a precise entropy encoder, it just plays against the accuracy of the new encoder.

To give a different example, in using FSE within Zhuff, I'm selecting exactly the reverse decision : use the improved accuracy to reduce the number of codes. This, in turn, prove beneficial for speed and memory usage.

Well, anyway, I should also have a look into the source code. These are only preliminary comments, from just reading the study.