RealTime Data Compression

Friday, February 28, 2014

FSE encoding : Mapping sub-ranges in a memory efficient way

With last blog post, we were left with a description of the FSE compression algorithm. However, the code associated was a bit cryptic. So let's spend some time understanding it.

Here are the lines of code :


    nbBitsOut = (state + symbolTT.deltaNbBits) >> 16;
    flushBits(bitStream, state, nbBitsOut);

    subrangeID = state >> nbBitsOut;
    state = stateTable[subrangeID + symbolTT.deltaFindState];

As suggested in an earlier blog post, the first task is to determine the number of bits to flush. This is basically one of 2 values, n or n+1, depending on state crossing a threshold.
symbolTT.deltaNbBits stores a value which, when added with state, makes the result of >> 16 produces either n or n+1, as required. It is technically equivalent to :

nbBitsOut = n;
if (state >= threshold) nbBitsOut += 1;

but as can be guessed, it's much faster, because it avoids a test, hence a branch.

The 2nd line just flushes the required nb of bits.

So we are left with the last 2 lines, which are more complex to grasp. It realises the conversion from newState to oldState (since we are encoding in backward direction). Let's describe how it works.

A naive way to do this conversion would be to create conversion tables, one per symbol, providing the destination state for each origin state. It works. It's just memory wasteful.
Consider for example a 4096 states table, for a 256 alphabet. Each state value uses 2 bytes. It results into 4K * 256 * 2 = 2 MB of memory. This is way too large for L1 cache, with immediate consequences on performance.

So we need a trick to reduce that amount of memory.
Let's have another look at a sub-range map :

Remember, all origin state values within a given sub-range have the same destination state. So what seems clear here is that we can simply reduce all origin state values by 9 bits, and get a much smaller map, with essentially the same information.

It's simple yet very effective. We now have a much smaller 8-state map for a symbol of probability 5/4096. This trick can be achieved with all other symbols, reducing the sum of all sub-range maps to a variable total between number of state and number of states x 2.

But we can do even better. Notice that the blue sub-ranges occupy 2 slots, providing the same destination state.
Remember that the red area corresponds to n=9 bits, and the blue area corresponds to n+1=10 bits. What we just have to do then is to shift origin state by this amount of bits. Looks complex ? not really : we already have calculated this number of bits. Let's just use it now.

subrangeID = state >> nbBitsOut;

For this trick to work properly, state values must be scaled not from 0 to 4095, but from 4096 to 8191. Doing this, it results in the following sub-range map :

A few important properties to this transformation :
- There are as many cells as symbol probability.
- The first subrangeID is the same as symbol probability (in this example, 5).
- The sub-ranges are now stored in order (from 1 to 5). This is desirable, as it will simplify the creation of the map : we will just store the 5 destination states in order.
- Since sub-range maps have same size as symbol probability, and since the sum of probabilities is equal to the size of state table, the sum of all sub-ranges map is the size of state table ! We can now pack all sub-range maps into a single table, of size number of states.

Using again the previous example, of a 4096 states table, for an alphabet of 256 symbols. Each state value uses 2 bytes. We essentially now disregard the alphabet size, which has no more impact on memory allocation.
It results into 4K * 2 = 8 KB of memory, which is much more manageable, suitable for an L1 cache.

We now have all sub-range maps stored into a single common table. We just need to find the segment corresponding to the current symbol to be encoded. This is what symbolTT.deltaFindState does : it provides the offset to find the correct segment into the table.
Hence :


    state = stateTable[subrangeID + symbolTT.deltaFindState];

This trick is very significant. It was a decisive factor in publishing an open source FSE implementation, as the initial naive version was unpractical, too memory hungry and too slow.

Tuesday, February 25, 2014

FSE encoding, Part 2

In previous article, we learned how to determine the number of bits to write, and which bits to write, when encoding a symbol using FSE compression. We are therefore left with the conversion exercise, from newState to oldState (remember that we encode in reverse direction). This is the most important step of the compression process.

Remember that we said we could simplify the determination of nbBits by a simple threshold comparison, simply by looking at sub-ranges map, such as this one :

Well, in fact, thanks to this map, we know quite a bit more : we know in which exact sub-range fit the current newState.
Remember that we have numbered these sub-ranges, starting with the larger ones :

What we need to know, is the ordered position of symbols into the state table. Since we have 5 sub-ranges, it simply means we also have 5 symbols, directly associated.

And that's it, we know the sub-range, we know the destination state of the encoding process, oldState.

Encoding is therefore just a repeat of this process :
- get Symbol to encode
- look at current state value
- determine nbBits, flush them
- determine sub-Range Id
- look for Symbol position of same Id : you get your next state

If you look into FSE source code, you'll find that the following lines of code perform this job :


    nbBitsOut  = (state + symbolTT.deltaNbBits) >> 16;
    flushBits(bitC, state, nbBitsOut);
    state = stateTable[(state>>nbBitsOut) + symbolTT.deltaFindState];

If you feel that the correlation between the text and the code is not obvious, it's because it's not.
To reach this compact expression, a few non-trivial tricks have been used, reducing the amount of computation, and quite importantly, reducing the amount of memory required to describe the sub-ranges map.

These will be explained in a later post.

Wednesday, February 19, 2014

FSE encoding : how it works

Quite unusually, we have started this serie on FSE by focusing on the decompression part. It may seem strange to try to decode a data we have not even started to encode. That's because it's much easier to understand the logic this way, the decoder being much more "straighforward".

But we are still left, now, with the problem of properly encoding our source data.

For an overall view on ANS theory, I will point to the updated paper by Jarek Duda, and for a in-depth description of ANS properties, let's mention the excellent 12-parts serie of Charles Bloom. With this context in mind, this blog will now concentrate on actual programming concepts, allowing you to build your own FSE encoder / decoder.

One key concept to keep in mind when dealing with ANS streams is the requirement to encode and decode in opposing directions. This is unusual, since in known entropy construction codes (such as Huffman, or Arithmetic, but also Golomb, Shannon, Elias, Tunstall, etc.), we keep the intuitive property to encode and decode forward. No longer with ANS.

This property is important to understand what we have to do to make compression work : we will simply reverse operations.

Let's have a look at these operations. Taken from previous article, here is a snippet from decoding loop :

// Persistent variables across the loops : state and bitStream
BYTE symbol = decodeTable[state].symbol;
int nbBits = decodeTable[state].nbBits;
int rest = getBits (bitStream, nbBits);
state = decodeTable[state].newStateBaseline + rest;

If we roll back in time, we start by inverting the last operation :

newState = decodeTable[oldState].baseline + rest;

OK, this look almost impossible. We don't yet know oldState, since it's the final objective of the encoding stage, and we don't know baseline/rest decomposition. But let's get into details, solving problems one by one.

/* rest => */ int rest = getBits (bitStream, nbBits);

rest is basically what the decoder will read from the bitStream. So the inverse operation is to write those bits into bitStream.
rest represents the lower part of newState. More precisely, it is the nbBits lower bits of state.
Which leads us to the next question : how to determine nbBits ?

Well, in contrast with the decoder, this time, the encoder knows which symbol it has to encode. This is the critical piece of information we will use.

Remember that we emulate fractional bits by using sometimes n bits, sometimes n+1 bits, resulting in an average of n.fractional bits. Therefore, knowing symbol, we already can tell that nbBits is going to be n or n+1.

But in fact, we know a bit more than that. Remember that, by construction, all sub-ranges much fill the entire range of state values. Remember also that smaller sub-ranges are present at the beginning, while larger ones occupy the rest. It gives the following example (for a probability of 5/4096) :

In this example : we know that if newState is >= 1024, then we are supposed to use 10 bits. If newState < 1024, then we will use only 9 bits.
You obviously don't need to store a full table per symbol, just knowing the threshold is enough.

And that's basically what FSE is doing : start by calculating nbBits :

int nbBitsOut = symbolTT[symbol].minBitsOut; // n
if (state >= symbolTT[symbol].threshold) nbBitsOut++; // or n+1

Note that in later version, this process was changed for a different formula giving the same result :
U32 nbBitsOut = (U32)((statePtr->value + symbolTT.deltaNbBits) >> 16);

In this new version, there is no branch, nor cmov. It is replaced by a fixed right shift. The result is controlled by symbolTT.deltaNbBits, which ensures the result is n or n+1 depending on state value.

Knowing nbBits is key, since it allows to determine rest, and therefore to flush this part into the bitStream.

int rest = newState & mask[nbBits];
writeBits(bitStream, rest, nbBits);

With that part solved, we are left with :

intermediateState = newState - rest = decodeTable[oldState].baseline;

which brings us one step closer to determine oldState, the final objective of the encoding stage.

to be continued...

Friday, February 14, 2014

FSE decoding : wrap up

With all the previous articles written on FSE, it seems we now have all elements to build a complete FSE decoder. Let's wrap up all this learning into a single piece.

To start with, we consider that all symbols are attributed a weight, of minimum 1, normalized over a sum which is a power of 2 (8, 16, 32, etc.).
The size of the decoding table is this sum.

Each symbol will appear in the table exactly as many times as it weights.

We have seen that, to maximize table's capability to capture fractional bits, we have to spread each symbol over the table, at roughly regular distances (the more regular, the better the compression performance).
For example, using the previous example of a symbol A with weight 5/16 :

We have seen that destination sub-ranges must sum up to cover the complete table, with each sub-range having a size in power of 2, defining an exact number of bits to use. We have seen that the smaller ranges must be placed at the beginning.

We now must associate each symbol in the table with a destination range.
To do that, we will number the ranges, beginning with the larger ones.

So the final assembling becomes :

If you build a decoding table using the above construction method, you have a simple decoding algorithm as a result. The hot loop is a mere 4-lines, as follows :

// Persistent variables across the loops : state and bitStream
BYTE symbol = decodeTable[state].symbol;
int nbBits = decodeTable[state].nbBits;
int rest = getBits (bitStream, nbBits);
state = decodeTable[state].newStateBaseline + rest;

This is all it takes to build your own FSE decoder.
You can always use the code at Github for further guidance.

Next part will be a study of the compression algorithm part of FSE, which is a bit more complex...

Sunday, February 9, 2014

FSE : distributing symbol values

Following the last article on defining optimal sub-ranges for FSE, and attributing them according to symbol ranking, we were left with the question of how to position symbol values into the state table.

We already said that an FSE decoding table following ANS theory is equivalent to an Huffman Low-bit-order decoding table : a symbol with a probability of 50% will be positioned once every 2 cells. With a probability of 25%, it will be positioned once every 4 cells. And so on.

Obviously, the next question is : what if the probability of a symbol is not a power of 2 ?
A first obvious idea is to follow the same logic : if, for example, a symbol has a probability of 33.333%, it should be positioned once every 3 cells.

OK, but even 33% is an easy target, because it's such a convenient fractional number. What if we have, for example, a probability of 7.27%, resulting in an average distance between symbols at every 13.75515.... cells ?
Well, an intuitive idea is that we will try to approach this distribution, typically by spacing symbols sometimes by 13 cells, sometimes by 14 cells. And indeed, it works.

OK, it may look clear and simple for a single symbol. But what about all other symbols together, trying to get their own regularly spaced slots ? Won't they be fighting for the same slots ?

Let's take an example, to illustrate the situation. Let's imagine that, on a 16-states decoding table, we have the following distribution :

A : 5 ; B : 5 ; C : 3 ; D : 3

In this case, we have a first issue, because 16/5=3.2, which obviously is not an integer. So we will have to space A by a distance of 3.2, which we can only approximate, by being 3 most of the time, and maybe 4 sometimes.

The intuitive optimal distribution for A is reached by applying this algorithm :
Position 0 : at P/2 (=1.6, so rounded down to 1 for example)
Position n = position n-1 + 3.2
Which gives : 1 (1.6), 4 (4.8), 8 (8.0), 11 (11.2), 14 (14.4)
(Edit : Charles Bloom calls this method "bias 0.5", and also note that "bias 1.0" actually gives better results, which feels strange, as "bias 0.5" is supposed to be "optimal". This result is likely related to the fact that positions in the state table are not equal in term of contribution to probability, with the lower states being worth more than the higher ones. See follow up investigation)

Yeah, this would be fine enough, except that, A is not alone.
What about B ? It has the same probability, so it should occupy the same positions ?
Of course not, only one symbol can occupy a state, so let's say that B is kind enough to move its own positions up by 1, leading to 2, 5, 9, 12, 15.

OK, it looks fine, so now, what about C ?
C distribution is 3, so Pos0(C) = (16/3) / 2 = 5.33/2 = 2.66 => 2.
Humm, that's a problem, 2 is already used by B.
OK, let's say B has higher priority because it should have 1.6 to begin with, so it's lower than 2.66. So we keep 2 for B, and give 3 for C.

And so on. Add D to the mix for even more fun. D basically takes whatever cell remains for its own use, wherever they are.

As you can see, it seems to be a mess. And it is : with more and more symbols filling the table, space is necessarily becoming scarce for later symbols. Notice how D ends up receiving 2 positions clustered together, instead of being nicely spread.

Thankfully, a "perfect" distribution algorithm is possible : for each position, compare all fractional state values of all symbols, pick the lowest one, increase its value by step=1/probability, and start again the same process for next state. This will produce a lot of ties, but one could use symbol order to solve such cases.
It works.
It's just damn slow, due to the huge number of comparisons required.

It can be sped up by using bucket sorting strategy. An interesting method is to use more buckets than states, for easier tie management. It works fine too, and drastically reduces the number of comparisons involved, at the cost of a bit of accuracy. But that's okay, efficiency remains good enough, "close to Shannon limit", which means that precise initialization is actually not required.

Let's double up on this sentence : precise initialization is not required.
Indeed, the only point is to reduce the amount of "waste" in fractional bit estimation. Where does that waste come from ?
Well, as said in a previous article, a fractional bit is translated into a "decrease" of the state value.

For this to work correctly, i.e. to minimize the fractional bit cost, we need to have our target symbol approximately at the intended destination state value. If not, then we will "waste" more fractional bit, in order to find the intended symbol.

With this concept in mind, the worst thing that can happen is to "cluster" the values together at one end of the range : it maximizes the probability that state value will have to be decreased more than necessary to find the next symbol.

On the other hand, if a symbol with a probability 50% is distributed evenly, once every 2 cells, finding it is highly simplified, and even guaranteed at a distance of at most 1.
The above explanation is a bit simplified, but works well at explaining how fractional bit works.

So, it's not very important to be precise, what matters is to be well "spread" across the spectrum of state values.

With this idea in mind, we can concentrate on the next objective : speed.
Rather than trying to find the best distribution, we'll try to find one which is "good enough" at spreading symbols across the table, and most importantly fast to execute.

With some experiment, I ended up selecting this simple formula :
nextState = (currentState + (5/8) range + 3) % range
which features a pretty good spreading power, and is way faster to run than all previous proposals, thanks to the absence of comparisons, branches, and the guarantee to reach all state values in a single scan. Using the above example, this solution produces the following distribution :

in a word, good enough.
Note that, on real implementation, result looks even better than this example, due to the use of larger tables. We are limited here, for visibility, at 16 states, but the heuristic has been tuned for larger sizes, 512 states and beyond.

In Charles Bloom's article, there's a different distribution method recommended, closer to optimal, backed with experimental results. There's no contradiction : this blog is called "fast compression" for a reason, and that's why in this recommendation, speed takes precedence.

Wednesday, February 5, 2014

FSE : Defining optimal subranges

Note : Charles Bloom started an independent in-depth analysis of ANS on his blog, which is definitely worth the read.

As stated in earlier FSE comparison with Arithmetic Coding, FSE deals with fractional bits by "decreasing" its state value. Now how does that work precisely ?

Let's use again the example of a symbol which probability of appearance is 5/4096. The Shannon limit for this symbol is 9.68 bits per symbol (if we do better, then some other symbols will suffer, and the overall will compress less). As stated earlier, we deal with the fractional bit issue by outputting sometimes 9 bits, sometimes 10 bits.

The distribution of 9-bits & 10-bits sub-ranges must ensure that all possible values remain accessible after the current symbol. It's a condition for lossless compression. The result is the following layout :

OK, so why are the 9-bits subranges positioned at the beginning of the range ?
Well, in previous article, we said that an additional bit is read when the state is so much decreased that is goes below lower limit. In which case, it just "wraps around", starting again from the top value, at the cost of an additional bit read.

So it's not so much that the 9-bits are at the bottom. Rather, with above explanation, it is more intuitive to explain that the 10-bits are at the top.
Maybe it would be easier to notice with another example, a symbol with 7/4096 probability of appearance.

Note that these subranges are "destination ranges". Each of the 7 symbol_cells are mapped to one and only one of these ranges.

So which symbol is associated with each range ?

As stated earlier, since the additional bit is "paid for" when crossing the low limit, the most likely symbol to trigger the limit is lowest one.
So we give an index to each of the 7 symbol_cells, in their order of appearance, from the smaller to the larger one.
The largest 10-bits range is attributed to symbol_cells n°1.
The other ranges simply follow, in their natural order :

This is coherent with an optimal cost associated with a 7/4096 distribution : 9.19 bits.
It means that we are only consuming 0.19 fractional bits each time the symbol is decoded. So we can decode it about 5 times at 9-bits before requiring 10-bits.

This is in contrast with a probability of 5/4096 :

Here, the optimal cost is 9.68 bits, so fractional bits is much higher (0.68 bits). We are likely to cross the low limit much more often. Hence the first 3 symbol_cells trigger the additional bit read, and require 10 bits.

This rule is valid for any probability; so now, whether it is 5/4096, 134/4096, or even 3812/4096, you know how to define destination sub-ranges, and can build the associated decoding table.
(The latest example is interesting, since some sub-ranges will be 0-bits large, which means they map directly to another state value : 3812/4096 => 284 1-bit sub-ranges + 3528 0-bit sub-ranges).

What remains to be understood is how to distribute the symbol states. A naive approach would be to simply cluster them together. For example, a symbol with probability of 7/4096 would occupy state values from 0 to 6. It would work, but compression effectiveness would be very poor.
In another blog post, we'll see why, and how to correct that.

Monday, February 3, 2014

A comparison of Arithmetic Encoding with FSE

Arithmetic encoding was invented in the 70's by Jorma Rissanen. After a few decades during which the technique almost stagnated, partly as a side-effect of abusive patent strategy, it started to take off in the early 2000's, and is now clearly established as the "current generation" entropy coding stage, thanks to its much better compression effectiveness than Huffman.

Since FSE claims to achieve equivalent compression performance but without the speed penalty, it's interesting to compare them, and see what makes them similar, or different.

The basic concept behind Arithmetic Encoding is the simple fact that shifting a bit is equivalent to a division by 2.

So now, it becomes clear that Huffman is just a special case of Arithmetic Encoding : since Huffman only shift bits, it's equivalent to divisions by power of 2.
By fully acknowledging this equivalence, it becomes obvious that it's also possible to divide by any other number, unlimited by the power of 2 requirement. This is how fractional bit is achieved. You can divide by 1.11, and you achieved the equivalent of a 0.1 bit cost.

The illustration above, taken from Wikipedia, is relatively clear, although it may be even simpler to look at an Huffman one. Let's use again this simple distribution :

Each symbol has an offset (or start position) and a range.
A : start 0, range 4 => factor 2
B : start 4, range 2 => factor 4
C : start 6, range 1 => factor 8
D : start 7, range 1 => factor 8

The "factor" is simply the total divided by the range.

In order to encode a symbol, arithmetic encoding always consider that it's selecting a subrange between 0 and 1, eventually rescaled to match a distribution table. In this case, the table has a size of 8, so we have to output a value between 0 and 7.
To encode symbol B, we restrict our new range between 4 and 6 (excluded). So we first settle with 4, because it is the start value of B.
Then we look at the divisor : we only have 2 positions remaining, so in order to find again the full range of possibility, we have to "upscale" this range by the factor. So we multiply by 4 (e.g. shift left by 2 bits), and find again a range of 8 values to encoded.
We can start again, but now, we have 2 bits which are already determined, occupying higher bit position ('10').

This example is very easy because we have only power of 2. But the exact same mechanism works with any distribution. You could even have a total which is a prime number, and any kind of distribution into it (such as 9,7,1,2 for a total of 19). It would work same : scale the value to 19 slots, add the start value, rescale by the factor, and so on.
Except that, you will have to face rounding errors. In a world of unlimited precision, this would not matter, but alas, that's not how computer works. You have to scale arithmetic precision to a bounded accuracy. It's not necessarily a big issue, since what truly matters is to do the same rounding error on both encoding and decoding (which is a significant problem in itself, especially in a cross-platform environment). Beside, it's equivalent to a small compression loss, and therefore it's important to limit that loss as much as possible.
A common way to do that is to "spread" the segment to a virtual size as large as possible. So, in previous example, we will not have a value between 0 and 18, but rather between 0 and 2^31-1 for example. We accept a small rounding error in doing it, but it's very acceptable.

Now that we have described encoding, let's look at how decoding works.

It's almost the same operation. To simplify decoding, we have a precomputed decoding table, of known size (8 in our first example). We read our first number, using virtual scale, and then scale it back to distribution table size, decode the symbol (B if it's a 4 or 5), get the new range, rescale, read more bits, and so on.

All this works well, but as can be guessed, cost quite a bit of multiplications and divisions in the process. With several clever tricks (typically ensuring that distribution total is a power of 2), it's possible to decrease these operations to a minimum, but alas, one division remains in the decoder, and several multiplications are also there.

Now let's have a look at ANS/FSE decoder.

We start with basically one identical concept : a bit read is equivalent to a division by 2.

A first difference though, is that a bit read is not bit shift.
In contrast with Huffman or Arithmetic, the "state" value is not read directly from the bitstream, it is a virtual value which is maintained alongside the bitstream.
That being said, it can be almost the same : a symbol with a 50% probability will basically divide state by 2, then discover than one bit must be added to keep state within range, and then shift it (equivalent to arithmetic's rescale), and read one bit from the bitStream.

Another difference is that a fractional bit is obtained by reading a variable number of bits. For example, if a symbol should be written using 3.3 bits (equivalent to 10%), the decoder will sometimes read 3 bits, and sometimes read 4 bits.
So we never "rescale by 10" as would do an arithmetic decoder, but rather sometimes rescale by 8, and sometimes by 16 (which, incidentally, is a better fit for binary representation, reducing rounding errors).

Okay, but how is this decided ?

Well, this is the magic of the decoding table, that will be described into another post, but to give you an idea of the bigger principle, here is how I would describe it in simple terms :
the fractional bit is translated by a "decrease" in state value.

When state value is high, state is just decreased, but it doesn't trigger the "additional bit". So, following our 10% example, it means we just read 3 more bits.

When state value is low, it tends to "cross" the lower limit. As a consequence, one more bit is read, and state is regenerated at its full value.

OK, that's roughly how the decoder behaves, from an external perspective.
But it's not enough to fully describe the exact behavior. You cannot expect to subtract a fixed amount from state in order to get your new state value. Unfortunately, it matters if the next state value is exactly 4 or 5, because it will influence some future symbol. So accuracy is more complex to describe. And that will be explained in a later post.

Tuesday, January 14, 2014

Huffman, a comparison with FSE

In this serie explaining the behavior of Finite State Entropy, I find it interesting to make a quick comparison with Huffman entropy. It's a direct result from the description of its decoding principles, nonetheless I believe it's useful to clearly establish the equivalence since it will be illustrative for future articles.

To better grasp the similarities, let's have a simple example. We'll use the following probabilities, suitable for an Huffman tree :

A : 50% ; B : 25% ; C : 12.5% ; D : 12.5%

This tree only needs 8 values to be properly represented, and gives the following linear representation :

This representation uses the "high bit first" methodology, which means we are starting our node travel from the highest bit.
If it's 0, then we have an A. We stop there, shift left the register by one bit and load a new one at the bottom.
If the highest bit is 1, we now we are not done, and need to read the second bit. If this second bit is 0, we know we have a B. If not we continue reading. And so on.
The decision tree looks as follows :

But we could have done differently : we could have used "low bit first". In this case, the linear representation of the tree looks like this :

Which seems a lot messier, but it's not : we simply reversed the bit order of the decision tree :

The 2 decision trees are therefore completely equivalent.

Now, look again at the "low bit first" linear representation. It's important to state that this is what FSE would produce, but it also embeds a few additional properties :

The "newStateBaseline" does not need to be calculated and stored into the decoding table : it can be automatically determined from the state itself : newStateBaseline = currentState & ~mask(nbBits);
nbBits is stable for a given character : for A, it's always 1, for B, it's always 2, etc. It simplifies the decoding table, which does not need to store this information.

So basically, an Huffman-low_bit_first table looks the same as an FSE table, and it guarantees a stable number of bits per character, and newStateBaseline is automatically deduced from remaining bits.

There is, however, a small but important difference.
An Huffman-low_bit_first would shift right the current state, and load the next bit(s) high. In contrast, the FSE decoder keeps them, and only reload the low bits. This creates a dependency on future state : basically, once you've reached state 2 with FSE for example, you can already guarantee that you won't see a D before you see a C. Which brings the question : how would the encoder select the right state, since it depends on future characters yet to be decoded ?

The answer : the encoder will work in backward direction, encoding characters from last to first, solving dependencies along the way.

To be continued...