## Sunday, February 9, 2014

### FSE : distributing symbol values

Note : Charles Bloom develops an article around the same issue on his blog : Understanding ANS - 8

Following the last article on defining optimal sub-ranges for FSE, and attributing them according to symbol ranking, we were left with the question of how to position symbol values into the state table.

We already said that an FSE decoding table following ANS theory is equivalent to an Huffman Low-bit-order decoding table : a symbol with a probability of 50% will be positioned once every 2 cells. With a probability of 25%, it will be positioned once every 4 cells. And so on.

Obviously, the question is : what if the probability of a symbol is not a power of 2 ?
Well, with this context in mind, it seems obvious that we should follow the same logic, and if for example, a symbol has a probability of 33.333%, it should be positioned once every 3 cells.

OK, but even 33% is an easy target, because it's such a convenient fractional number. What if we have, for example, a probability of 7.27%, resulting in an average distance between symbols at every 13.75515.... cells ?
Well, an intuitive idea is that we will try to approach this distribution, typically by spacing symbols sometimes by 13 cells, sometimes by 14 cells. And indeed, that's how it works.

OK, it may look clear and simple for a single symbol. But what about all these other symbols together, trying to get their own regularly spaced slots ? Won't they be fighting for some same slots ?

Let's take an example, to see that situation clearer. Let's imagine that, on a 16-states decoding table, we have the following distribution :

A : 5 ; B : 5 ; C : 3 ; D : 3

In this case, we have an issue, because 16/5=3.2, which obviously is not an integer. So we will have to space A by a distance of 3.2, which we can only approximate, by being 3 most of the time, and maybe 4 sometimes.

The intuitive optimal distribution for A is reached by applying this algorithm :
Position 0 : at P/2 (=1.6, so rounded down to 1 for example)
Position n = position n-1 + 3.2
Which gives : 1 (1.6), 4 (4.8), 8 (8.0), 11 (11.2), 14 (14.4)
(Edit : Charles Bloom calls this method "bias 0.5", and recommends "bias 1.0" instead, which gives better results. The reason why it's better is likely related to the fact that positions in the state table are not equal in term of contribution to probability, with the lower states being worth more than the higher ones.)

Yeah, this would be fine enough, except that, A is not alone.
What about B ? It has the same probability, so it should occupy the same positions ?
Of course not, only one symbol can occupy a state, so let's say that B is kind enough to move its own position by 1, leading to 2, 5, 9, 12, 15.

OK, it looks fine, so now, what about C ?
C distribution is 3, so Pos0(C) = (16/3) / 2 = 5.33/2 = 2.66 => 2.
Humm, that's a problem, 2 is already used by B.
OK, let's say B has higher priority because it should have 1.6 to begin with, so it's lower than 2.66. So we keep 2 for B, and give 3 for C.

And so on. Add D to the mix for even more fun. D basically takes whatever cell remains for its own use, wherever they are.

As you can see, it seems to be a mess. And it is : with more and more symbols filling the table, space is necessarily becoming scarce for the later ones.
However, a perfect distribution algorithm exists along the same lines : for each position, compare all fractional state values of all symbols, pick the lowest one, increase its value by step=1/probability, and start again the same process for next state. You still have a lot of ties to solve, but you can for example decide to select symbol values in order in such cases.
It works.
It's just damn slow, due to the huge number of comparisons required.

It can be sped up by using bucket sorting strategy. An interesting method is to use more buckets than states, for easier tie management. It works fine too, and drastically reduces the number of comparisons involved, at the cost of a bit of accuracy. But that's okay, since precise initialization is not required, it only compresses "closer to Shannon limit".

Let's by the way spend some more time on this strategy : precise initialization is not required.
Indeed, the only point is to reduce the amount of "waste" in fractional bit estimation. Where does that waste come from ?
Well, as said in previous article, a fractional bit is translated into a "decrease" of the state value.

For this to work correctly, i.e. to minimize the fractional bit cost, we need to have our target symbol approximately at the intended destination state value. If not, then we will "waste" some more fractional bit, in order to find the intended symbol.
With this idea in mind, the worst thing that can happen is to "cluster" the values together at one end of the range : it maximizes the probability that state value will have to be decreased more to find the required symbol.

On the other hand, if a symbol with a probability 50% is distributed evenly, once every 2 cells, finding it is highly simplified, and even guaranteed at a distance of at most 1.
The above explanation is a bit simplified, but works well at explaining how fractional bit works.

So, it's not very important to be precise, what matters is to be well "spread" across the spectrum of state values.
With this idea in mind, we can concentrate on the next objective : speed.
Rather than trying to find the best distribution, we'll try to find one which is "good enough" at spreading, and most importantly fast to implement.
With some experiment, I ended up selecting this simple formula :
nextState = currentState + (5/8) range + 3
which is way faster than all previous proposals, thanks to the absence of comparisons, branches, and the guarantee to reach all state values in a single scan. Using the above example, this heuristic solution gives the following distribution result :

in a word, good enough.
Note that results are better than this example with larger tables. We are limited here, for visibility, at 16 states, but the heuristic has been tuned for larger sizes, 512 states and beyond.

In Charles Bloom's article, you'll find a different recommended distribution method, closer to optimal, and backed with solid experimental evidences. There's no contradiction : this blog is called "fast compression" for a reason, and that's why in this recommendation, speed takes precedence.