## Friday, February 14, 2014

### FSE decoding : wrap up

With all the previous articles written on FSE, it seems we now have all elements to build a complete FSE decoder. Let's wrap up all this learning into a single piece.

To start with, we consider that all symbols are attributed a weight, of minimum 1, normalized over a sum which is a power of 2 (8, 16, 32, etc.).
The size of the decoding table is this sum.

Each symbol will appear in the table exactly as many times as it weights.

We have seen that, to maximize table's capability to capture fractional bits, we have to spread each symbol over the table, at roughly regular distances (the more regular, the better the compression performance).
For example, using the previous example of a symbol A with weight 5/16 :

We have seen that destination sub-ranges must sum up to cover the complete table, with each sub-range having a size in power of 2, defining an exact number of bits to use. We have seen that the smaller ranges must be placed at the beginning.

We now must associate each symbol in the table with a destination range.
To do that, we will number the ranges, beginning with the larger ones.

So the final assembling becomes :

If you build a decoding table using the above construction method, you have a simple decoding algorithm as a result. The hot loop is a mere 4-lines, as follows :

// Persistent variables across the loops : state and bitStream
BYTE symbol = decodeTable[state].symbol;
int nbBits = decodeTable[state].nbBits;
int rest = getBits (bitStream, nbBits);
state = decodeTable[state].newStateBaseline + rest;

This is all it takes to build your own FSE decoder.
You can always use the code at Github for further guidance.

Next part will be a study of the compression algorithm part of FSE, which is a bit more complex...

## Sunday, February 9, 2014

### FSE : distributing symbol values Following the last article on defining optimal sub-ranges for FSE, and attributing them according to symbol ranking, we were left with the question of how to position symbol values into the state table.

We already said that an FSE decoding table following ANS theory is equivalent to an Huffman Low-bit-order decoding table : a symbol with a probability of 50% will be positioned once every 2 cells. With a probability of 25%, it will be positioned once every 4 cells. And so on.

Obviously, the next question is : what if the probability of a symbol is not a power of 2 ?
A first obvious idea is to follow the same logic : if, for example, a symbol has a probability of 33.333%, it should be positioned once every 3 cells.

OK, but even 33% is an easy target, because it's such a convenient fractional number. What if we have, for example, a probability of 7.27%, resulting in an average distance between symbols at every 13.75515.... cells ?
Well, an intuitive idea is that we will try to approach this distribution, typically by spacing symbols sometimes by 13 cells, sometimes by 14 cells. And indeed, it works.

OK, it may look clear and simple for a single symbol. But what about all other symbols together, trying to get their own regularly spaced slots ? Won't they be fighting for the same slots ?

Let's take an example, to illustrate the situation. Let's imagine that, on a 16-states decoding table, we have the following distribution :

A : 5 ; B : 5 ; C : 3 ; D : 3

In this case, we have a first issue, because 16/5=3.2, which obviously is not an integer. So we will have to space A by a distance of 3.2, which we can only approximate, by being 3 most of the time, and maybe 4 sometimes.

The intuitive optimal distribution for A is reached by applying this algorithm :
Position 0 : at P/2 (=1.6, so rounded down to 1 for example)
Position n = position n-1 + 3.2
Which gives : 1 (1.6), 4 (4.8), 8 (8.0), 11 (11.2), 14 (14.4)
(Edit : Charles Bloom calls this method "bias 0.5", and also note that "bias 1.0"  actually gives better results, which feels strange, as "bias 0.5" is supposed to be "optimal". This result is likely related to the fact that positions in the state table are not equal in term of contribution to probability, with the lower states being worth more than the higher ones. See follow up investigation)

Yeah, this would be fine enough, except that, A is not alone.
What about B ? It has the same probability, so it should occupy the same positions ?
Of course not, only one symbol can occupy a state, so let's say that B is kind enough to move its own positions up by 1, leading to 2, 5, 9, 12, 15.

OK, it looks fine, so now, what about C ?
C distribution is 3, so Pos0(C) = (16/3) / 2 = 5.33/2 = 2.66 => 2.
Humm, that's a problem, 2 is already used by B.
OK, let's say B has higher priority because it should have 1.6 to begin with, so it's lower than 2.66. So we keep 2 for B, and give 3 for C.

And so on. Add D to the mix for even more fun. D basically takes whatever cell remains for its own use, wherever they are.

As you can see, it seems to be a mess. And it is : with more and more symbols filling the table, space is necessarily becoming scarce for later symbols. Notice how D ends up receiving 2 positions clustered together, instead of being nicely spread.

Thankfully, a "perfect" distribution algorithm is possible : for each position, compare all fractional state values of all symbols, pick the lowest one, increase its value by step=1/probability, and start again the same process for next state. This will produce a lot of ties, but one could use symbol order to solve such cases.
It works.
It's just damn slow, due to the huge number of comparisons required.

It can be sped up by using bucket sorting strategy. An interesting method is to use more buckets than states, for easier tie management. It works fine too, and drastically reduces the number of comparisons involved, at the cost of a bit of accuracy. But that's okay, efficiency remains good enough, "close to Shannon limit", which means that precise initialization is actually not required.

Let's double up on this sentence : precise initialization is not required.
Indeed, the only point is to reduce the amount of "waste" in fractional bit estimation. Where does that waste come from ?
Well, as said in a previous article, a fractional bit is translated into a "decrease" of the state value.

For this to work correctly, i.e. to minimize the fractional bit cost, we need to have our target symbol approximately at the intended destination state value. If not, then we will "waste" more fractional bit, in order to find the intended symbol.

With this concept in mind, the worst thing that can happen is to "cluster" the values together at one end of the range : it maximizes the probability that state value will have to be decreased more than necessary to find the next symbol.

On the other hand, if a symbol with a probability 50% is distributed evenly, once every 2 cells, finding it is highly simplified, and even guaranteed at a distance of at most 1.
The above explanation is a bit simplified, but works well at explaining how fractional bit works.

So, it's not very important to be precise, what matters is to be well "spread" across the spectrum of state values.

With this idea in mind, we can concentrate on the next objective : speed.
Rather than trying to find the best distribution, we'll try to find one which is "good enough" at spreading symbols across the table, and most importantly fast to execute.

With some experiment, I ended up selecting this simple formula :
nextState = (currentState + (5/8) range + 3) % range
which features a pretty good spreading power, and is way faster to run than all previous proposals, thanks to the absence of comparisons, branches, and the guarantee to reach all state values in a single scan. Using the above example, this solution produces the following distribution :

in a word, good enough.
Note that, on real implementation, result looks even better than this example, due to the use of larger tables. We are limited here, for visibility, at 16 states, but the heuristic has been tuned for larger sizes, 512 states and beyond.

In Charles Bloom's article, there's a different distribution method recommended, closer to optimal, backed with experimental results. There's no contradiction : this blog is called "fast compression" for a reason, and that's why in this recommendation, speed takes precedence.