On first look, it may look like a nuisance : usual arithmetic coders offer clean flat probabilities, this uneven layout seems like an added "noise" on top of an already complicated algorithm. On second though, it looks like a potential opportunity, even if complex, to improve accuracy by taking advantage of such differences.
With many other issues to solve to get a working ANS decoder out, this issue was left for a later investigation. And here we are today.
The same conclusion was later demonstrated by Charles Bloom, within his serie of blog posts on ANS. He produced a graphic, which proved much clearer than the stats produced the year before.
Probability of each state value
Note that the "clean graph" is obtained through an average of multiple "synthetic" inputs, not real files, and requires the "precise distribution algorithm" for the layout of symbols. If you try to reproduce this result on your own, you are likely to obtain a graph which roughly produces the same perspective, but not with the same accuracy.
As stated by Charles, the graph follows quite closely a 1/X distribution. It's an important conclusion, because it hints towards a formula able to "guess" the best spot for a singleton probability. It seems that, as long as P is between sqrt(2) and sqrt(2)/2, we can find its optimal position in the table.
It's an interesting result that we'll try exploit. I made multiple tests with different P values, looking for the optimal position in the table, and the result was relatively close to the 1/X formula (best position is usually slightly different, but the overall difference in file size is extremely small)
P 1/X Best Difference
0.707 511 499 4 bytes
0.850 339 351 2 bytes
1.000 212 209 2 bytes
1.150 117 115 1 byte
1.300 44 43 2 bytes
1.414 0 0 0 bytes
So we've got a way here to improve accuracy of singleton probabilities (those for which the normalized frequency is 1). Can we use that ?
Unfortunately, we must also take into consideration the increased header cost to specify position accurately. If a symbol probability is normalized to 1 (a singleton), it means its number of occurrence is relatively small. Therefore, we can't spend 10 bits to provide the exact best position of a rare symbol into the table, it would be counter productive.
In order to experiment, I targeted the easy gains, and took a look at low-probability symbols, those for which realP < 1. Such cases are unfortunately quite common. They are also the most wasteful ones, since they require the minimum normalized frequency, which is 1 (even if realP = 0.01, we can't round that down to zero, otherwise the compression would be lossy).
One way to minimize their loss would be to put them at the top of the table, where the probability is lowest (approximately sqrt(2)/2 ~= 0.707). The end effect is somewhat equivalent to what Charles Bloom did within his own tANS version, but in a more "directive" manner, since we are cherry picking which symbols must fit at the top of the table.
The impact on header size is going to be very moderate, since now we just have to distinguish between a "normal 1" and a "low 1".
Let's check if this method produces some worthy gains :
Table Low1 HC Fast3 Gains Fast1
4K 435286 435426 435398 0.026% 435723
2K 435677 436041 436072 0.091% 436783
1K 436691 437756 437875 0.271% 439181
Indeed, this is much better than anticipated. This new version handily beats the "perfect normalization algorithm", by taking advantage of unequal probabilities within ANS table. It is therefore now integrated within the open source version of FSE on Github.
A great property is that this improved table initialization algorithm is achieved with the same speed as before, so it's basically an improvement we get for free.
Another important property is that the gain improves while table size decreases. It's great : it means we can consider reducing the table size, hence its memory cost, while keeping the compression loss in check. Notice how we get better compression performance today than earlier version (Fast1) did produce using tables twice bigger !
It's an important benefit, and it enables some critical RAM saving, essential to improve Zhuff performance.
But let's keep that part for a later blog post...