The previous post presented some new technique to improve accuracy of the normalization step, while remaining fast. The post of today will concentrate on making the normalization step "optimal", keeping an eye on speed as secondary objective.
(Note : this objective has also been covered on Charles Bloom's blog, using a different explanation and formula, but with essentially the same result. It's a good read.)
Reaching the "optimal" status means no other probability distribution can achieve a better compression. This is obviously measured against a "theoretical optimal entropy coder", which FSE is not (it's a close approximation of it). But still, any gain from a better distribution is likely to translate into better FSE compression.
To achieve the best possible distribution, we need to inverse our reasoning : instead of distributing remaining "bonus points", we will round up everyone, and then consider that a "round down" is a cost. This cost is necessary : we can't "round up" all probabilities, otherwise the sum of probabilities will be bigger than the table. By design, some of these probabilities will have to be rounded down.
We will simply select the ones which will cost less.
This requires to define a "cost function".
The cost function is relatively easy to understand :
Let's suppose we have a symbol S, which appears Qs times into data source, resulting in a probability of P. The cost per symbol can be calculated using Shannon formula :
Cp = -log2(P);
Now, we want to evaluate the cost of downgrading the probability of S from P to P-1.
The new cost per symbol will be higher : C(p-1) = -log2(P-1);
So the total cost of this operation will be : Qs x (C(p-1) - Cp)
We can calculate this cost for each and every symbol present into data source, and then pick the one which cost less, degrade its probability note by one, update its cost, select the new symbol with lowest cost, rinse and repeat, stop when you have reached the required total (= table size).
By construction, this method is guaranteed to provide the best possible normalized distribution.
So, what kind of result does it provide ?
First, a look at "book1" benchmark :
4K table, Fast normalization : 435 459 bytes
4K table, Perfect normalization : 435 426 bytes (gain : 0,008%)
2K table, Perfect normalization : 436 041 bytes (gain : 0,022%)
1K table, Perfect normalization : 437 756 bytes (gain : 0,045%)
Still, it can be interesting for those willing to reach the best possible compression ratio, since the "normalization" cost only applies during compression, it's totally free on the decompression side.
Looking further, the new function also provides a debug mode which details its selection processus. It's very interesting, so let's have a look at it, for example compressing "book1" with 2K tables :
smallest : 76 : cost 413.0 bits - count 413 : 1.10 (from 2 -> 1)
smallest : 70 : cost 413.0 bits - count 413 : 1.10 (from 2 -> 1)
smallest : 89 : cost 416.0 bits - count 416 : 1.11 (from 2 -> 1)
smallest : 87 : cost 440.5 bits - count 753 : 2.01 (from 3 -> 2)
smallest : 63 : cost 444.0 bits - count 759 : 2.02 (from 3 -> 2)
smallest : 69 : cost 444.0 bits - count 444 : 1.18 (from 2 -> 1)
smallest : 59 : cost 445.7 bits - count 762 : 2.03 (from 3 -> 2)
smallest : 106 : cost 468.0 bits - count 468 : 1.25 (from 2 -> 1)
smallest : 33 : cost 486.7 bits - count 832 : 2.22 (from 3 -> 2)
smallest : 83 : cost 497.2 bits - count 850 : 2.26 (from 3 -> 2)
smallest : 62 : cost 498.0 bits - count 498 : 1.33 (from 2 -> 1)
smallest : 60 : cost 498.0 bits - count 498 : 1.33 (from 2 -> 1)
smallest : 79 : cost 500.7 bits - count 856 : 2.28 (from 3 -> 2)
smallest : 78 : cost 502.0 bits - count 502 : 1.34 (from 2 -> 1)
smallest : 120 : cost 503.7 bits - count 861 : 2.29 (from 3 -> 2)
smallest : 84 : cost 517.1 bits - count 1966 : 5.24 (from 6 -> 5)
smallest : 113 : cost 520.0 bits - count 520 : 1.39 (from 2 -> 1)
smallest : 46 : cost 530.6 bits - count 7170 : 19.10 (from 20 -> 19)
smallest : 39 : cost 533.5 bits - count 6470 : 17.24 (from 18 -> 17)
smallest : 107 : cost 533.9 bits - count 4994 : 13.30 (from 14 -> 13)
smallest : 118 : cost 535.7 bits - count 5382 : 14.34 (from 15 -> 14)
smallest : 98 : cost 537.8 bits - count 9132 : 24.33 (from 25 -> 24)
smallest : 115 : cost 538.8 bits - count 36788 : 98.00 (from 99 -> 98)
smallest : 10 : cost 538.9 bits - count 16622 : 44.28 (from 45 -> 44)
smallest : 110 : cost 539.1 bits - count 40919 : 109.01 (from 110 -> 109)
smallest : 104 : cost 539.2 bits - count 37561 : 100.06 (from 101 -> 100)
smallest : 44 : cost 540.2 bits - count 10296 : 27.43 (from 28 -> 27)
smallest : 109 : cost 540.3 bits - count 14044 : 37.41 (from 38 -> 37)
smallest : 116 : cost 540.6 bits - count 50027 : 133.27 (from 134 -> 133)
smallest : 111 : cost 540.8 bits - count 44795 : 119.33 (from 120 -> 119)
smallest : 97 : cost 541.3 bits - count 47836 : 127.43 (from 128 -> 127)
smallest : 119 : cost 541.4 bits - count 14071 : 37.49 (from 38 -> 37)
smallest : 108 : cost 541.4 bits - count 23078 : 61.48 (from 62 -> 61)
smallest : 32 : cost 541.5 bits - count 125551 : 334.47 (from 335 -> 334)
smallest : 105 : cost 542.0 bits - count 37007 : 98.59 (from 99 -> 98)
smallest : 114 : cost 542.3 bits - count 32889 : 87.62 (from 88 -> 87)
smallest : 101 : cost 542.8 bits - count 72431 : 192.96 (from 193 -> 192)
smallest : 32 : cost 543.1 bits - count 125551 : 334.47 (from 334 -> 333)
smallest : 102 : cost 543.2 bits - count 12237 : 32.60 (from 33 -> 32)
smallest : 45 : cost 543.8 bits - count 3955 : 10.54 (from 11 -> 10)
smallest : 110 : cost 544.1 bits - count 40919 : 109.01 (from 109 -> 108)
smallest : 117 : cost 544.2 bits - count 16031 : 42.71 (from 43 -> 42)
smallest : 115 : cost 544.4 bits - count 36788 : 98.00 (from 98 -> 97)
smallest : 104 : cost 544.6 bits - count 37561 : 100.06 (from 100 -> 99)
smallest : 116 : cost 544.7 bits - count 50027 : 133.27 (from 133 -> 132)
smallest : 32 : cost 544.7 bits - count 125551 : 334.47 (from 333 -> 332)
smallest : 100 : cost 544.8 bits - count 26623 : 70.92 (from 71 -> 70)
smallest : 111 : cost 545.4 bits - count 44795 : 119.33 (from 119 -> 118)
smallest : 97 : cost 545.6 bits - count 47836 : 127.43 (from 127 -> 126)
smallest : 101 : cost 545.7 bits - count 72431 : 192.96 (from 192 -> 191)
smallest : 103 : cost 546.2 bits - count 12303 : 32.78 (from 33 -> 32)
OK, it may seem a bit daunting to analyze, let's go for the most interesting conclusions :
1) The first selected symbols make a big difference, since they really cost much less (starting with 413 bits). Then, the costs tend to converge, and there is very little difference remaining between the various available choices beween 535 and 545 bits.
Basically, it means most of the gains for the "HC" versions came from correctly selecting the first few symbols.
2) These first symbols tend to have "low probability" : a lot of 2->1 transitions, some 3->2, and so on. Starting with 530 bits, they completely disappear, and we only have higher probability symbols.
3) Some higher probability symbol appear several times. Note for example symbol 32, which is rounded down several times, reaching 332, while its "real" probability should be 334.47. It means it's better to downgrade it several times, rather than to downgrade a single time a lower probability symbol.
The point 2 was the most surprising to me. Remember last time we showed a graphic proving that the low probability symbols were prone to the most important "round down" losses.
This is because this graphic was only considering the "worst case" situation, with "real probability" being infinitely close to the next probability step (e.g. 1.999, then 2.999, etc.)
The new result implies that low probability symbols are also susceptible to generate the lowest amount of losses, when they are at the other end of the spectrum (e.g. 1.01).
This lead me to try to visualize the "level of freedom" of the "round down cost" (all possible values from 1.001 to 1.999, then from 2.001 to 2.999, and so on). It gave the following graphic :
As could be guessed, this "level of freedom" is much higher at low probability than at high probability. In fact, the "freedom area" quickly becomes a very thin line beyond value 300.
This corresponds to intuition :
From 1.01 to 1.99, the level of freedom is a factor 2.
From 2.01 to 2.99, the level of freedom is now 50%.
and so on.
By the time it reaches probability 1000, the level of freedom is basically 1/1000th, which is almost nothing.
This is the zoomed version of the same graphic, concentrating on the first few values. As said last time, this graphic will remain the same irrespective of the total size of the table.
So that means that most of the attention must be paid to the symbols with lowest probability, where the difference between a low and a high "fractional rest" can make a large cost difference, and therefore should be selected carefully.
High probability symbols don't have such level of freedom, and therefore their "fractional rest" has very little importance, and almost no impact on total cost.
Another interesting point is that the graphic converges towards 1/ln(2) = 1.443, not 1.5. Which means that the intuitive 50% threshold is not true, at least not for low-probability symbols.
Unfortunately, the new "HC" version is also slow. Not "horribly slow", just slow enough to be noticeable in benchmark, reducing total speed by as much as 5-10%, for a negligible compression gains. That means the "fast" version remains the default one, while the "HC" version is provided in option.
Nonetheless, the above findings are very interesting, and may be useful in the quest for a better fast renorm algorithm.