The main learning, in my opinion, is that fractional rest doesn't really matter. Or more specifically, it only matters for small p values, where round down cost elasticity is highest.
Also of interest, the "threshold" level is not a stable 0.5 across, it's a p-specific threshold, starting at 0.443 for p==1, and then converging towards 0.5.
Elasticity of the "round down cost"
The list of "round down" selection also shows that only the first few decisions make a (large) difference, targeting almost exclusively small p values with small fractional rest.
After these first "easy gains", all other "round down" basically produce the same cost. The higher the p value, the most stable the cost, in accordance with the graph "round down cost elasticity".
smallest : 76 : cost 413.0 bits - count 413 : 1.10 (from 2 -> 1)
smallest : 70 : cost 413.0 bits - count 413 : 1.10 (from 2 -> 1)
smallest : 89 : cost 416.0 bits - count 416 : 1.11 (from 2 -> 1)
smallest : 87 : cost 440.5 bits - count 753 : 2.01 (from 3 -> 2)
smallest : 63 : cost 444.0 bits - count 759 : 2.02 (from 3 -> 2)
(..)
smallest : 110 : cost 544.1 bits - count 40919 : 109.01 (from 109 -> 108)
smallest : 117 : cost 544.2 bits - count 16031 : 42.71 (from 43 -> 42)
smallest : 115 : cost 544.4 bits - count 36788 : 98.00 (from 98 -> 97)
smallest : 104 : cost 544.6 bits - count 37561 : 100.06 (from 100 -> 99)
smallest : 116 : cost 544.7 bits - count 50027 : 133.27 (from 133 -> 132)
smallest : 32 : cost 544.7 bits - count 125551 : 334.47 (from 333 -> 332)
smallest : 100 : cost 544.8 bits - count 26623 : 70.92 (from 71 -> 70)
(...)
All these results point towards a simple idea :
could we just define a threshold under which all p values would be "rounded down", and then, fine-tune the remaining difference by attributing it to the largest p value, since it makes almost no cost difference which p is rounded down as long as it is large enough (stable cost) ?
OK, let's try that.
The initial idea is to use as a threshold the converging value (1.443 = 1/ln(2), although its relative position vary depending on p, from 0.443 to 0.5). It works, but we can do better.
One important side-effect regarding weight distribution concerns the very small values, resulting in a realP < 1, which must be rounded up to p=1 (the minimum weight). The number of such small values vary greatly depending on data type and size to compress. We could be lucky and have no small realP, or inversely have hundreds of them. Between those extreme, the most likely situation is that we will get a few of them.
Because of these round up, it results in a hidden cost for all other symbol probabilities, which will have to "make room" (hence, round down) for them. To take into consideration this collective cost, we will simply raise the decision bar on the graph.
Round up/down threshold
Note that since the "round down cost elasticity" becomes very small as p becomes big (it ends up being a very thin line), it means that above a certain value of p, all p will simply be rounded down, irrespective of fractional rest.
It makes for a very simple algorithm, that we expect to be both fast and efficient.
Let's put that to the test.
First, let's have a look at final compression size.
As usual, we'll use "book1" as a single-block compression test :
Table HC Fast3 Fast2
4K 435426 435398 435459
2K 436041 436072 436138
1K 437756 437875 437952
(Note : Fast3 is the newly designed fast approximation algorithm, Fast2 was designed in this post)
As can be seen, Fast3 consistently beats Fast2, albeit by a small margin, since there is very little gain remaining.
The most important expected benefit is on the speed front : since the algorithm is much simpler, it is expected to run faster.
To measure this effect, I've modified FSE so that it runs its normalization loop several thousand times per block, hence ensuring normalization becomes the dominant operation. The final result should not be interpreted as an absolute speed, but rather a relative measure.
Fast3 Fast2
Norm. SpeedTest : 25.6 6.4
This one is pretty clear : the new algorithm is approximately 4x times faster than the previous one.
Faster, stronger, simpler, that's a good deal.
The new algorithm is now integrated into the open-sourced version of FSE hosted at github.
------------------------------------------------------------
Epilogue :
Astute readers may have noted that Fast3 compresses better than HC on 4K Table setting.
How could that be ? HC is supposed is supposed to be perfect, hence unbeatable, normalization !
Well, an answer is, the perfect normalization assumes a perfect entropy coder, which FSE is not, it's only a good approximation of it.
On the "book1" example, the threshold bar of Fast3 produces distributions which incidentally benefit from FSE irregularity.
Which introduce a new question : could we intentionally use these irregularities to boost compression ratio even further ?
Well, let's keep that question for a future blog post...