Since FSE claims to achieve equivalent compression performance but without the speed penalty, it's interesting to compare them, and see what makes them similar, or different.
The basic concept behind Arithmetic Encoding is the simple fact that shifting a bit is equivalent to a division by 2.
So now, it becomes clear that Huffman is just a special case of Arithmetic Encoding : since Huffman only shift bits, it's equivalent to divisions by power of 2.
By fully acknowledging this equivalence, it becomes obvious that it's also possible to divide by any other number, unlimited by the power of 2 requirement. This is how fractional bit is achieved. You can divide by 1.11, and you achieved the equivalent of a 0.1 bit cost.
The illustration above, taken from Wikipedia, is relatively clear, although it may be even simpler to look at an Huffman one. Let's use again this simple distribution :
Each symbol has an offset (or start position) and a range.
A : start 0, range 4 => factor 2
B : start 4, range 2 => factor 4
C : start 6, range 1 => factor 8
D : start 7, range 1 => factor 8
In order to encode a symbol, arithmetic encoding always consider that it's selecting a subrange between 0 and 1, eventually rescaled to match a distribution table. In this case, the table has a size of 8, so we have to output a value between 0 and 7.
To encode symbol B, we restrict our new range between 4 and 6 (excluded). So we first settle with 4, because it is the start value of B.
Then we look at the divisor : we only have 2 positions remaining, so in order to find again the full range of possibility, we have to "upscale" this range by the factor. So we multiply by 4 (e.g. shift left by 2 bits), and find again a range of 8 values to encoded.
We can start again, but now, we have 2 bits which are already determined, occupying higher bit position ('10').
This example is very easy because we have only power of 2. But the exact same mechanism works with any distribution. You could even have a total which is a prime number, and any kind of distribution into it (such as 9,7,1,2 for a total of 19). It would work same : scale the value to 19 slots, add the start value, rescale by the factor, and so on.
Except that, you will have to face rounding errors. In a world of unlimited precision, this would not matter, but alas, that's not how computer works. You have to scale arithmetic precision to a bounded accuracy. It's not necessarily a big issue, since what truly matters is to do the same rounding error on both encoding and decoding (which is a significant problem in itself, especially in a cross-platform environment). Beside, it's equivalent to a small compression loss, and therefore it's important to limit that loss as much as possible.
A common way to do that is to "spread" the segment to a virtual size as large as possible. So, in previous example, we will not have a value between 0 and 18, but rather between 0 and 2^31-1 for example. We accept a small rounding error in doing it, but it's very acceptable.
Now that we have described encoding, let's look at how decoding works.
It's almost the same operation. To simplify decoding, we have a precomputed decoding table, of known size (8 in our first example). We read our first number, using virtual scale, and then scale it back to distribution table size, decode the symbol (B if it's a 4 or 5), get the new range, rescale, read more bits, and so on.
All this works well, but as can be guessed, cost quite a bit of multiplications and divisions in the process. With several clever tricks (typically ensuring that distribution total is a power of 2), it's possible to decrease these operations to a minimum, but alas, one division remains in the decoder, and several multiplications are also there.
Now let's have a look at ANS/FSE decoder.
We start with basically one identical concept : a bit read is equivalent to a division by 2.
A first difference though, is that a bit read is not bit shift.
In contrast with Huffman or Arithmetic, the "state" value is not read directly from the bitstream, it is a virtual value which is maintained alongside the bitstream.
That being said, it can be almost the same : a symbol with a 50% probability will basically divide state by 2, then discover than one bit must be added to keep state within range, and then shift it (equivalent to arithmetic's rescale), and read one bit from the bitStream.
Another difference is that a fractional bit is obtained by reading a variable number of bits. For example, if a symbol should be written using 3.3 bits (equivalent to 10%), the decoder will sometimes read 3 bits, and sometimes read 4 bits.
So we never "rescale by 10" as would do an arithmetic decoder, but rather sometimes rescale by 8, and sometimes by 16 (which, incidentally, is a better fit for binary representation, reducing rounding errors).
Okay, but how is this decided ?
Well, this is the magic of the decoding table, that will be described into another post, but to give you an idea of the bigger principle, here is how I would describe it in simple terms :
the fractional bit is translated by a "decrease" in state value.
When state value is high, state is just decreased, but it doesn't trigger the "additional bit". So, following our 10% example, it means we just read 3 more bits.
When state value is low, it tends to "cross" the lower limit. As a consequence, one more bit is read, and state is regenerated at its full value.
OK, that's roughly how the decoder behaves, from an external perspective.
But it's not enough to fully describe the exact behavior. You cannot expect to subtract a fixed amount from state in order to get your new state value. Unfortunately, it matters if the next state value is exactly 4 or 5, because it will influence some future symbol. So accuracy is more complex to describe. And that will be explained in a later post.
No comments:
Post a Comment