## Sunday, January 8, 2012

### Probability Update

Following the previous post on Binary Arithmetic Coder, we left with a simple-to-ask but difficult-to-answer question : how to update the probability that the next value will be a 0 or a 1 ?

Indeed. Presuming that we can update P0 (probability that next value will be a zero) suppose that we accept a simple idea : the past is a strong indicator for the future.

This may not always be true, but for compression it appears to work relatively well. At this stage, we can outline the strong relationship between compression and prediction : compression ratio will only be as good as the algorithm can predict the future (the next bit in this case).

Trying to predict the future is not limited to compression, and massive amount of investment have been and are currently spent around this applied concept. In this post however, we'll keep our mind focused on compression, considering other fields only if they can help this objective.

A first simple way to evaluate P0 is to count the number of 0 and 1 previously seen.
Then, simply set : P0 = N0 / (N0+N1).
It works, and is a simple way to achieve convergence towards optimal stationary statistics.

But, in fact, no source is really static. They constantly evolve (otherwise, compression would be trivial). The probability should therefore adapt to the source, and consequently, more weight should be given to the most recent bits.

A very simple way to achieve this property is by giving to new bits a fixed share of the updated probability. For example 20%. This is an easy method to gradually and smoothly "reduce" the weight of older records.
And we get :
newP0 = oldP0 x (1-share) + NewBitIsZero x share

This works too, although we start to wonder what should be such "share" ?
20% ? 10% ?

It is very tempting to believe that the optimal adaptation share can be found as a unique value. And indeed, through brute force testings (such as Genetic Algorithms) an optimal value can be found.
[Edit : an excellent example of  "optimal adaptation share" is provided by Shelwien here : http://nishi.dreamhosters.com/u/book1bwt_wr.png ]
However, the optimal value will be true only for a given file, number of contexts and precision level. Change any parameter, and the optimal value will change too..

Could such optimal adaptation share be guessed beforehand, through calculation, rather than sheer experimentation ? Well, unfortunately no. It depends too much on the source, although some basic rules of thumb do apply : if the file is small, the share will be higher. If it is large, the share will be lower.

This points towards something retrospectively obvious : at the beginning of the file, when statistics are still scarce, the adaptation must be faster. Then, the more we accumulate statistics, the more accurate they should become, so the adaptation share must become lower.

It does not answer to the question "how much", but hints towards a moving value, becoming lower as we progress into the source file.

In order to get closer to the "how much" answer, I've collected some statistics, that we'll see and comment in a later post...

1. > how to update the probability that
> the next value will be a 0 or a 1 ?

Suppose you've encountered n0 zeroes and n1 ones,
then the entropy of the processed string is
l(p0)=n0*log(p0)+n1*log(1-p0)
and by solving l'(p)=0 we'd be able to find the p value
which minimizes the entropy.
http://www.wolframalpha.com/input/?i=Solve%5BD%5Bn0*Log%5B2%2Cp%5D%2Bn1*Log%5B2%2C1-p%5D%2Cp%5D%3D%3D0%2Cp%5D
which is p=n0/(n0+n1).
But its a posterior estimation.

And for a prior estimation there's a useful method based on likelihoods:
// the probability of a string with n0 zeroes and n1 ones:
c[n0_,n1_]:=1/Binomial[n0+n1,n0];
// the probability of the next string being the one with extra zero:
p=c[n0+1,n1]/(c[n0+1,n1]+c[n0,n1+1])
http://www.wolframalpha.com/input/?i=FullSimplify%5B%281%2FBinomial%5B1%2Bn0%2Bn1%2C1%2Bn0%5D%29%2F%281%2FBinomial%5B1%2Bn0%2Bn1%2Cn0%5D%2B1%2FBinomial%5B1%2Bn0%2Bn1%2C1%2Bn0%5D%29%5D

Same method actually can be also used with string probabilities based on entropy
with substituted posterior p=n0/(n0+n1), ie
c[n0_,n1_]:=(n0/(n0+n1))^n0*(n1/(n0+n1))^n1
but entropy formula is an approximation (it assumes that probability is static),
so we'd end up with a pretty weird estimation:
p=1/(1+(n0^n0*(n1+1)^(n1+1)/(n0+1)^(n0+1)/n1^n1))
Although Shkarin was actually advertising it (as "MP-estimator") on encode.ru -
its supposedly better than plain (n0+d)/(n0+n1+d) in practical use.

Actually this approach can be improved even further by taking into
account the probability distribution of p values.
For example,
gives p=(n0+1)/(n0+n1+1)
and
gives p=(n0+.5)/(n0+n1+1) (the Krichevski-Trofimov estimator)

But its usually more interesting to use secondary statistics
and numerical integration there.

2. > It is very tempting to believe that the optimal share
> can be found as a unique value

Actually yes.
1. The entropy(rate) function is pretty smooth, so you
can find the optimum value for a given data block with
just a few tests.
2. When the probability update is linear, the probability
value at a given bit becomes a polynomial function of
initial p and rate.
Then, exponent of the code entropy is the product of
the probabilities at each bit - ie also polynomial.
So, with some approximations, it should be possible
to compute the parametric entropy(rate) function for
given data bits.
3. Normally a coder uses adaptive mixing of a few
predictions given by counters with different update rates,
so instead of optimizing each counter, the model
just uses the best combination of existing ones.

3. > Actually yes. The entropy(rate) function is pretty smooth, so you
can find the optimum value for a given data block with
just a few tests.

This doesn't fit my observations, but maybe i should first develop these observations into a separated post to make a proper answer.

4. This is the entropy(rate) function for book1bwt:
http://nishi.dreamhosters.com/u/book1bwt_wr.png

5. Very interesting. How is it generated ?

6. Nevermind, i think i can guess :
Vertical value is final size (compressed)
and horizontal value is the "entropy rate", which probably means the adaptation ratio used by the linear update formula (Although, to be complete, we would probably need to be told if the adaptation ratio is 1/value, or value >> "base shift", how many contexts are used (order 0 ? order N?), etc.).

I think it's a pretty good example. I will link to it in the article.

7. This is the full picture:
http://nishi.dreamhosters.com/u/book1bwt_wr1.png
There's also a text version - http://nishi.dreamhosters.com/u/book1bwt_wr.txt
As to "rate", its the wr value for my usual linear counter-
SCALE=1<<15,p'=(p*wr+bit*(SCALE-wr))/SCALE.
The coder is order0, the modified "rc_v0".

8. Also this: http://www.ctxmodel.net/rem.pl?-6