Thursday, January 9, 2014

FSE decoding : how it works

 After announcing the release of FSE and its results, it's time to have a look at its internals.

We'll spend some time first on the decoding loop, since it's both the easiest one to understand and the most important one. Encoding is much more convoluted, and that's where ANS theory comes into play (For some more detailed explanation on how ANS works and its related properties, I invite you to read Jarek's latest paper). But for the time being, we don't need that yet. The decoding loop is generic enough to handle diverse kind of entropy algorithms, including Huffman and Arithmetic, as we'll see later on.

The decoding code can be freely consulted on GitHub. The "hot loop" is pretty short, and can be pseudo-coded like this :
            // One persistent variable across all loops : int state outputSymbol (decodeTable[state].symbol); int nbBits = decodeTable[state].nbBits; int rest = getBits (bitStream, nbBits); update (bitStream, nbBits); state = decodeTable[state].newState + rest;
There is a single value to track, called state.
state value can range from 0 to a maximum total_number_of_states-1, which is conveniently a power of 2. For example, in FSE default implementation, total_number_of_states is 4096, representing 12 bits.

state initial value is read directly from the bitStream. Then its evolution is a bit more complex.
In contrast with Huffman, which merely shifts current bits and replace them with new ones from the bitStream, FSE completely rewrite state value after each step.

The first thing to keep in mind is that a state value is enough to immediately decode a symbol, and provide the next state value. Both information are directly stored into the decoding table.
If we had the chance to handle a large enough state number, with its associated decoding table, the decoding loop would be trivial :

outputSymbol (decodeTable[state].symbol); state = decodeTable[state].newState;
Obviously, we are not that lucky : a large number representing a complete file or segment would result in a way too large decoding table to be practical. So we have to keep this size reasonable.
In practice, we keep as many bits as necessary to represent probabilities with enough accuracy, for example 12 bits proposed by default in FSE for a 256 alphabet.

Now we have to modify the decoding loop, by introducing a few new bits between each step, keeping the value of state within accessible range. The number of bits to load is also embedded directly into the decoding table.
outputSymbol (decodeTable[state].symbol); int nbBits = decodeTable[state].nbBits; state = decodeTable[state].newStateBaseline + getSomeBits(bitStream, nbBits);
A legitimate question at this stage is : why on earth this construction does compress anything ?
Well, we are getting close to it. Consider that the number of bits to read after each symbol is decoded is the cost of compression.

This number of bits is in no way random : it entirely depends on the (normalized) frequency of the decoded character. And here is why.

Let's suppose our range of possible values is 4096 (12 bits). And let's supposed that the normalized frequency of our decoded symbol is 4. That means it should be present about 4 times every 4096 symbols, which is equivalent to 1 times every 1024 symbols, which is equivalent to 10 bits.
That conclusion would have been the same for Huffman : after reading the character, we need 10 new bits to read the next one.
This example is simple because the symbol frequency is a power of 2.

So now, let's imagine that its frequency is 5. That means it's present 5 times in the table of 4096 cells.
According to Shannon theory, we are supposed to use -log2(5/4096) = 9.68 bits to represent this symbol. How can we achieve this ?

Well, we are going to divide the range of accessible values into 5 sub-ranges. Since we are loading full bits, we need each sub-range to be a power of 2. And since the total remain 4096, we need multiple different sub-range sizes.

A simple way to achieve this is to use a combination of 9 bits and 10 bits ranges (sometimes called phase-in codes). For example :


Now, we have 5 sub-ranges, and collectively they cover the full spectrum of possible values.
Each of the 5 symbol occurrence will get assigned one of these ranges, which translates into an offset (0, 512, 1024, etc.), and a number of bits (9 or 10). Hence :
state = decodeTable[state].newStateBaseline + getSomeBits(bitStream, nbBits);
At this stage, we have not yet settled any specific rule regarding these ranges nor their assignment. We could have the 9-bits range at the end, or mix them (for ex. 10-9-10-9-10). Finally the 5 occurrences of symbol could be anywhere between 0 and 4095, and they could be assigned these ranges in no particular order.

The suite of states is always "going down", generating a bit read each time it crosses the lower limit.
Fractional bits from a "high state" are consumed by going "low state".
Fractional bits from a "low state" are consumed by reading a full bit, and then reach a "high state".

What could be the efficiency of such a representation method ?
We are targeting 9.68 bits. Do we get close to it ?

Well, if we take the naive assumption that all values in the state range are equi-probable, then we can calculate the likelihood of getting 9 or 10 bits to read.

Average_bitCost = (2 x 9 x 512 + 3 x 10 x 1024) / 4096 = 9.75 bits

That's not bad, we are getting fairly close to 9.68 bits, but not there yet.

The trick is, these values are not equi-probable. In ANS construction, states at the beginning of the range are slightly more likely to be generated than states at the end of the range, due to the way 'newStateBaseline' are generated. So the lower range [0-511] is more likely to be generated than same size higher range [3584-4095], so it weights more.
Calculating the average is therefore not straighforward. But bear with me, if you don't have the patience to read the mathematical proof from Jarek Duda, the ANS construction converge towards the optimal cost of the Shannon limit (we were already pretty close to it to begin with).

This result, however, depends on the way Symbols are spread into the table.
Which is more complex, and will be detailed in another blog post.