## Monday, December 16, 2013

### Finite State Entropy - A new breed of entropy coder

In compression theory, the entropy encoding stage is typically the last stage of a compression algorithm, the one where the gains from the model are realized.

The purpose of the entropy stage is to reduce a set of flags/symbol to their optimal space given their probability. As a simple example, if a flag has 50% to be set, you want to encode it using 1 bit. If a symbol has 25% probability to have value X, you want to encode it using 2 bits, and so on.
The optimal size to encode a probability is proven, and known as the Shannon limit. You can't beat that limit, you can only get close to it.

A solution to this problem has been worked for decades, starting with Claude Shannon own work, which were efficient but not optimal. The optimal solution was ultimately found by one of Shannon's own pupils, David A. Huffman, almost by chance. His version became immensely popular, not least because he could prove, a few years later, that his construction method was optimal : there was no way to build a better distribution.

Or so it was thought.
There was still a problem with Huffman encoding (and all previous ones) : an hidden assumption is that a symbol must be encoded using an integer number of bits. To say it simply, you can't go lower than 1 bit.
It seems reasonable, but that's not even close to Shannon's limit. An event which has 90% probability to happen for example should be encoded using 0.15 bits. You can't do that using Huffman trees.

A solution to this problem was found almost 30 years later, by Jorma Rissanen, under the name of Arithmetic coder. Explaining how it works is outside of the scope of this blog post, since it's complex and would require a few chapters; I invite you to read the Wikipedia page if you want to learn more about it. For the purpose of this presentation, it's enough to say that Arithmetic encoding, and its little brother Range encoding,  solved the fractional bit issue of Huffman, and with only some minimal losses to complain about due to rounding, get closer to Shannon limit. So close in fact that entropy encoding is, since then, considered a "solved problem".

Which is terrible because it gives the feeling that nothing better can be invented.

Well, there is more to this story. Of course, there is still a little problem with arithmetic encoders : they require arithmetic operations, such as multiplications, and divisions, and strictly defined rounding errors.
This is serious requirement for CPU, especially in the 80's. Moreover, some lawyer-happy companies such as IBM grabbed this opportunity to flood the field with multiple dubious patents on minor implementation details, making clear that anyone trying to use the method would face expensive litigation. Considering this environment, the method was barely used for the next few decades, Huffman remaining the algorithm of choice for the entropy stage.

Even today, with most of the patent issues cleared, modern CPU will still take a hit due to the divisions. Optimized versions can sometimes get away with the division during the encoding stage, but not the decoding stage (with the exception of the Binary arithmetic coding, which is however limited to 0/1 symbols).
As a consequence, arithmetic encoders are quite slower than Huffman ones. For low-end or even "retro" CPU, it's simply out of range.

It's been a long time objective of mine to bring arithmetic-level compression performance to vintage (or retro) CPU. Consider it a challenge. I've tested several variants, for example a mix of Huffman and Binary Arithmetic, which was free of divisions, but alas still needed multiplications, and required more registers to operate, which was overkill for weak CPU.

So I've been reading with a keen eye the ANS theory, from Jarek Duda, which I felt was heading into the same direction. If you are able to fully understand his paper, you are better than me, because quite frankly, most of the wording used in his document is way out of my reach. (Note : Jarek pointed to an update version of his paper, which should be easier to understand). Fortunately, it nonetheless resonated, because I was working on something very similar, and Jarek's document provided the last elements required to make it work.

And here is the result today, the Finite State Entropy coder, which is proposed in a BSD-license package at Github.

In a nutshell, this coder provides the same level of performance as Arithmetic coder, but only requires additions, masks, and shifts.
The speed of this implementation is fairly good, and even on modern high-end CPU, it can prove a valuable replacement to standard Huffman implementations.
Compared to zlib's Huffman entropy coder, it manages to outperform its compression ratio while besting it on speed, especially decoding speed.

Benchmark platform : Core i5-3340M (2.7GHz), Window Seven 64-bits
Benchmarked file : win98-lz4-run
Algorithm Ratio Compression Decompression
FSE       2.688   290 MS/s     415 MS/s
zlib      2.660   200 MS/s     220 MS/s

Benchmarked file : proba70.bin
Algorithm Ratio Compression Decompression
FSE       6.316   300 MS/s     420 MS/s
zlib      5.575   250 MS/s     270 MS/s

Benchmarked file : proba90.bin
Algorithm Ratio Compression Decompression
FSE       15.21   300 MS/s     420 MS/s
zlib      7.175   250 MS/s     285 MS/s

As could be guessed, the higher the compression ratio, the more efficient FSE becomes compared to Huffman, since Huffman can't break the "1 bit per symbol" limit.
FSE speed is also very stable, under all probabilities.

I'm quite please with the result, especially considering that, since the invention of arithmetic coding in the 70's, little new has been brought to this field.

-----------------------------------------------------

A little bit of History :

Jarek Duda's ANS theory was first published in 2007, and the paper received many revisions since then. Back in 2007, only Matt Mahoney had enough skill and willpower to sift through the complex theory, and morph it into a working code. However, Matt concentrated on the only use case of interest to him, the Binary version, called ABS, limited to 0/1 alphabet. This decision put his implementation in direct competition with the Binary Arithmetic Coder, which is very fast, efficient, and flexible. Basically, a losing ground for ANS. As a consequence, ANS theory looked uncompetitive, and slumbered during the next few years.

FSE work re-instates ANS as a competitive algorithm for multi-symbol alphabet (>2), concentrating its perspective as a viable alternative to block-based Huffman.

Thanks to promising early results from FSE, Jarek concentrated back its attention on multi-symbol alphabet. As we were chatting about perspectives and limitations of ANS, I underlined the requirement of a decoding table as a memory cost, and suggested a solution in the making to limit that issue (which ultimately failed). But Jarek took the challenge, and successfully created a new variant. He then published an updated version of his paper. The new method would be called rANS. He would later invent the terms tANS and rANS to distinguish the different methods.

rANS was later adapted by Fabian Giesen and Charles Bloom, producing some very promising implementations, notably vector-oriented code by Fabian.

But as said, this is not the direction selected for FSE, created before Jarek's paper revision. FSE is a finite state machine, created precisely to avoid any kind of multiplication, with an eye on low-power CPU requirements. It's interesting to note such radically different implementations can emerge from a common starting theory.

For the time being, FSE is still considered beta stuff, so please consider this release for testing purposes or private development environments.

Explaining how and why it works is pretty complex, and will require a few more posts, but bear with me, they will come in this blog.

Hopefully, with Jarek's document and these implementations now published, it will be harder this time for big corporations to confiscate an innovation from the public domain.

-----------------------------------------------------

List of Blog posts explaining FSE algorithm :

http://fastcompression.blogspot.com/2014/01/fse-decoding-how-it-works.html
http://fastcompression.blogspot.com/2014/01/huffman-comparison-with-fse.html
http://fastcompression.blogspot.com/2014/02/a-comparison-of-arithmetic-encoding.html
http://fastcompression.blogspot.com/2014/02/fse-defining-optimal-subranges.html
http://fastcompression.blogspot.com/2014/02/fse-distributing-symbol-values.html
http://fastcompression.blogspot.com/2014/02/fse-decoding-wrap-up.html
http://fastcompression.blogspot.com/2014/02/fse-encoding-how-it-works.html
http://fastcompression.blogspot.com/2014/02/fse-encoding-part-2.html
http://fastcompression.blogspot.com/2014/02/fse-tricks-memory-efficient-subrange.html

 : replaced huff0 by zlib on the comparison table
 : added entry on rANS variants by Fabian & Charles
 : added list of blog entries

1. Thanks, Yann - indeed while ABS for "bit-wise adaptive" is practically equivalent to arithmetic coding, ANS for fixed probabilities ("block-based") allows to encode symbol from a large alphabet in a single table check - it should be faster than Huffman, having precision like arithmetic.

The number of states should be a few times larger than alphabet size to work in deltaH~0.001 bits/symbol regime.
You could make it faster by putting also the bitwise operations into the table, like the whole "(symbol,state)->(bit sequence, new state)" rules ... but at cost of using lower number of states/alphabet to remain in L1 cache.
Generally there are lots of options to optimize among ...

Another advantage of ANS is that we can slightly perturb the initialization procedure using a pseudo-random number generator initialized with cryptographic key (e.g. and also the number of block): for example choosing between the lowest weight symbol and the second best one.

The advantage of Huffman is that we can adaptively modify the tree on the run, but we could also do it in ANS - switch some appearances between symbols and renumerate the following ones ... but it would be costly for a large table.

About materials, the recent paper ( http://arxiv.org/abs/1311.2540 ) should be more readable and here is a poster gathering basic information: https://dl.dropboxusercontent.com/u/12405967/poster.pdf .

1. Hi Jarek

> You could make it faster by putting also the bitwise operations into the table, like the whole "(symbol,state)->(bit sequence, new state)" rules

I believe that's exactly what the current code is doing. If you have time, you may want to have a look at https://github.com/Cyan4973/FiniteStateEntropy/blob/master/fse.c , to give some thoughts and comments about further optimizations.

> The advantage of Huffman is that we can adaptively modify the tree on the run

Another advantage of Huffman is that it is possible to reduce the size of decoding table, and even remove it entirely, in exchange for a more complex and slower node-based decoding.
This property can be critical for systems with very low RAM available.
In contrast, FSE requires the full table to be in memory.
The consequence is that the number of states must be kept as low as possible to be compatible with most destination limitations.

> the recent paper ( http://arxiv.org/abs/1311.2540 ) should be more readable

Thanks, I'll update the article with your information.

2. > The number of states should be a few times larger than alphabet size to work in deltaH~0.001 bits/symbol regime.

For information :
in most circumstances I could test, current FSE implementation is able to compress more than my reference Range coder implementation, Range0 (http://fastcompression.blogspot.fr/p/huff0-range0-entropy-coders.html).
OK, Range0 was designed for speed rather accuracy, but nonetheless, since it's part of the arithmetic encoder family, its compression ratio is still a good comparison point for FSE.

This is achieved using only 12 bits for FSE state.

3. Thanks,
>I believe that's exactly what the current code is doing (...)
Yes, now I see that you indeed use tableDecode[i].symbol, .nbBits and .newstate - I will think how to make faster.

You could use larger b to get more bits at once - fpaqc uses b=256 to retrieve the whole bytes. The cost is loss of precisions, what is negligible for directly using arithmetic, but could be painful here ...

As I imagine, a perfect software decoder would create assembler code for a given automaton: parts of fixed length (L) code describing behavior for given state, ending with: jump to "new state" * L position.

>Another advantage of Huffman is that it is possible to reduce the size of decoding table, and even remove it entirely, in exchange for a more complex and slower node-based decoding (...)
Decoding Huffman is moving on the tree, which has "the size of alphabet" leaves - how you can manage without having this tree stored in memory?

If you want low memory ANS, see the poster - switching between 5 automata having 5 states allows for deltaH~0.01 bits/symbol, about 20 automata of 16 states for deltaH~0.001.

2. > Decoding Huffman is moving on the tree, which has "the size of alphabet" leaves - how you can manage without having this tree stored in memory?

Indeed, this is the minimum required. But it's much smaller than a full decode table, which would read the bitstream and directly give the symbol.
On the other hand, for FSE, the full decode table seems mandatory.

> If you want low memory ANS, see the poster - switching between 5 automata having 5 states allows for deltaH~0.01 bits/symbol, about 20 automata of 16 states for deltaH~0.001.

I'm not sure to understand this part.
The poster seems to concentrate on the ABS variant.
I understand the concept of switching between different state tables,
but I don't see how it could allow to build a lower-memory version of ANS.

Currently, the default settings of FSE is to use a single 12-bits table to decode symbols which can have up to 256 values.
Suppose I'm using this construction to encode.
How would you propose to decode it using less memory than FSE does now ?

1. > Indeed, this is the minimum required. But it's much smaller than a full decode table (...)
Still the amount of required memory is proportional to alphabet size - are you sure that storing a tree (not a succinct tree, but a quick access one) is less costly than decoding table? Which representation?
The minimum needed for ANS decoding is symbol and new state, for encoding: new state (the number of bits are not required) - it is a few bytes times the alphabet size.
>The poster seems to concentrate on the ABS variant.
There are two graphs on the right size - covering all probabilities for 5 and 16 state automates. But indeed it is for the binary case - for larger alphabets we need more memory.

> How would you propose to decode it using less memory than FSE does now ?
The only still fast way I see is not using the "nbBits", which can be simply deduced or we can take bits one by one - but it will be a bits slower.

2. > Still the amount of required memory is proportional to alphabet size - are you sure that storing a tree (not a succinct tree, but a quick access one) is less costly than decoding table? Which representation?

Oh yes,
a decoding table costs between 1KB & 16KB, depending on level of accuracy,
while a tree will cost between 32 bytes and 512 bytes, depending on alphabet size. Almost pocket change in comparison.

> The minimum needed for ANS decoding is symbol and new state, for encoding: new state (the number of bits are not required) - it is a few bytes times the alphabet size.

Not in my implementation. It is a few bytes times the state table size (aka, the decoding table).

> The only still fast way I see is not using the "nbBits", which can be simply deduced or we can take bits one by one - but it will be a bits slower.

Yes, but not only is the gain minor (about 25%), and therefore far from the objective, it's also not a gain at all for normal C structures, since for alignment reasons, each state is going to cost 4 bytes anyway, whether the field .nbBits is present or not.

3. >(...)while a tree will cost between 32 bytes and 512 bytes, depending on alphabet size.
32-512 bytes to quickly decode a 256 size alphabet?
I have looked at some article about Huffman decoding ( http://www.commsp.ee.ic.ac.uk/~tania/teaching/DIP%202012/huffman_1.pdf ), and don't see how would you like to do it?

For ANS decoding of 256 size alphabet, the reasonable minimum is 512 states - you need 1 byte for the symbol plus at most 10 bits for the new state (8 if no symbol is above 1/4 probability, 6 if not above 1/16,...) - about 1kB would be enough.

1. hey, it's not "fast". It's either "small memory footprint" and lots of node travels, hence slow, OR, large table, simple pickup, fast, but lot of memory.

Regarding small decoding tables for ANS/FSE, I arrived at the same conclusion regarding number of states (at least nb values x 2) and size of fields (although, to be fair, expect some fairly large impact on compression ratio with only 512 states for 256 symbol values...).

But it misses the point.

Want I meant initially is : with Huffman, you can, on the compression side, decide the depth of the Huffman tree (and even select a very deep one),
and then, *independently*, the decoder can select the optimum strategy to decode it, depending on its available memory (either a large table, or traveling through node, or an intermediate solution such as zlib, which uses linked 10-bits tables)

It's not possible to do that with ANS : it's necessary to take into consideration the memory limit of the decoder side *during the compression process* : the number of states needs to be decided right at the beginning.
Which is, currently, the only limitation which I feel does matter, for example for embedded systems environment (and "retro" ones...).

4. Still if something was compressed using Huffman for 256 size alphabet, I don't think you could decompress it using less that 1kB memory?
And generally in situation when the memory is that important, you could use arithmetic or ABS instead...

For ANS with very small number of states, the way we spread symbols becomes really essential. I see you use a bit different method than mine - while checking all possible spreads for a few cases, the method from the last paper gave usually the optimal spread ...

5. > Still if something was compressed using Huffman for 256 size alphabet, I don't think you could decompress it using less that 1kB memory?

511 bytes, to be exact

> And generally in situation when the memory is that important, you could use arithmetic or ABS instead...

It really misses the point :
it's certainly possible to build a low-memory entropy coder *by design*.
What I'm trying to underline is that there is no need to make such a trade off on the compression side when using Huffman. It can be figured out on the decompression side as needed.
Oh well, at least I tried to explain.

> the way we spread symbols becomes really essential. I see you use a bit different method than mine

Indeed. I was expecting to detail that point in a later post.

An optimal distribution can be built, and that's what was doing my initial algorithm. But it was slow. Moreover, it required some divisions and a few multiplication to init critical variables, not a lot, but still not ideal for low-power cpu.

Initially I thought that it was critical for the performance of the algorithm to have a perfect distribution, but then I started experimenting, and understood that no, it was just a matter of efficiency, a classic compression trade-off.

The method currently coded into FSE has been selected because :
1) It's damn fast.
3) It's still very close to optimal. I typically get a 0.01% difference between "optimal distribution" and this method.
4) It's not adapted to very small number of states. Below 64 states, forget it, it's not designed for such a use case.

6. I understand the advantage of Huffman decoder flexibility, but don't see how to decode 256 size alphabet in 511 bytes...
This cost becomes essential for history/context dependent probability estimation (in contrast to Huffman, with ANS it makes sense - especially for static compression like of DNA) - the memory requirement has to be multiplied by the number of distinguished histories/contexts (separate coding tables).
The precision of symbol spread is nearly negligible for a large number of states (we could use this freedom for simultaneous encryption or maybe a different purpose?), however for this 1kB case: 256 size alphabet on 512 states, it will probably mater.
The main cost in precise spread is finding the symbol with the minimum weight - it is linear cost if we would just checked one by one, but it can be reduced to log(n) by using heap.

7. Hi Yann,
I was thinking about your decoding hot loop - you access the data stream table every step ( bitStream = *(U32*)ip ), what seems costly - maybe it would be faster to use 64 (or 32) bit "buffer" there?
Something like this (uses ip table >4 times less frequently):

state = decodeTable[state].newState + (buffer & ((1<> nbBits;
If((bitCount-=nbBits)<=16) {buffer = buffer | (*(U32*)ip <<bitCount; bitCount+=32; ip-=4;}

The "16" is the maximum number of bits that can be used in a single step (FSE_MEMORY_USAGE). For 32 bit buffer this 16 is important and you read 2 bytes when needed (bitCount+=16; ip-=2)

Also, bitor ("|") could be a bit faster than "+", and are you sure that "mask[nbBits]" is faster than "(1<<nbBits)-1" ?

1. Indeed, optimizing the bit-flush operation (for both write or read mode) is part of the work to do.

I'm opened to any suggestion. Keep in mind I already tested many variants, and selected this one based on its proven benchmark result. Since many variations can be invented, please consider benchmarking them, it will be easier to consider if integrating the new ideas is worthwhile.

Now, onto the details.
A "branch" is a very costly operation on modern processors, or to be more precise, an unpredictable branch is very costly. The cost is so high that, most of the time, it's better to do heavier but more predictable operations.

Reading all the time (instead of using a branch) has been tested and seems faster in most circumstances. Writing is another question, where the lead is more debatable.

The predictability of the branch is a big part of the equation.
For "low ratio" files (<=2), the branch is much too unpredictable.
For "high ratio" files (for example, if one symbol value is > 90%), high ratio translates into slow bit flushing operation. For these cases, the branch will be more "predictable" : the CPU can consider that, on average, there is no read/write operation, and speculatively jump over it. When it's wrong, then the pipeline must be flushed, and execution resume at the I/O operation. The cost is a hefty number of cycles. But if it happens rarely, then it's worth it.

The same holds true for 64-bits versus 32-bits accumulators. The first one is likely to create a "more predictable branch", since the number of symbols that can be read or written before it is needed to read/write into memory is larger.

In the end, what this means : it's likely that FSE will need several bit-flushing operation routines, depending on 32/64 bits systems, and predicted compression ratio.
Since it's complex to test and to maintain, this part has been kept for a later time.

You're welcomed to try your proposals in the meantime. Please consider using some time to benchmark them, so that you can now if it is likely to produce a gain.

2. I have no experience with low level optimization, so I only write my suggestions (so sure, they can be stupid).
Table use sounds like a relatively costly operation and the number of using the stream (ip and op) definitely can be reduced.
If branching is indeed even more costly, you can group a safe number of steps - use a loop:
- refill the buffer to 57-64 bits,
- place a few copies of step, such that 57 bits will certainly be sufficient (4 in our case):
u64 buffer, symbols;

symbols=decodeTable[state].symbol;
state = decodeTable[state].newState + (buffer & mask[nbBits]);
buffer = buffer > > nbBits;
bitCount-=nbBits;

symbols=symbols | decodeTable[state].symbol << 8;
state = decodeTable[state].newState + (buffer & mask[nbBits]);
buffer = buffer > > nbBits;
bitCount-=nbBits;
...

In this way, without any branching, you get 4 times less uses of *ip and *op table.

3. Register scarcity is another problem to face when dealing with 32-bits mode, but okay, I get the idea. Reducing the number of accesses for the bitStream sounds like an obvious proposition, but reducing it for symbols is less usual. I'll try that.

4. If there is lack of registers, "bitCount" can be removed by adding a guardian: single "1" in buffer in front of the bit sequence.
Now this " en.wikipedia.org/wiki/Find_first_set " operation allows to retrieve the number of bytes from stream to fill the buffer (but we need to shift the guardian).

Optimizing bit and symbol flush, especially on x64, should double the speed.
Best,
Jarek

ps. And generally I really doubt that "mask[nbBits]" is faster than "(1<<nbBits)-1" (?)

5. > generally I really doubt that "mask[nbBits]" is faster than "(1<<nbBits)-1" (?)

On my Core i7 system, I confirm that dynamically calculating the mask, instead of getting it from a pre-computed table, is faster ... inside the decoding hot loop.

The situation is less good insider the compression. Here, the dynamic calculation incur a small loss (about 2%, so it's marginal).

Nevertheless, it shows the answer is not straightforward. It may depend on the parallel execution capabilities of the CPU, which varies vastly depending on model. Core i7 sure is one of the best in this respect.

I'll make another test later on a different CPU to see if the results remain consistent. At the very least, I expect to bring this improvement to the decoding loop.

6. A bit of complement : the speed improvement is, alas, only true in 64 bits mode.
In 32-bits, performance plummets.

There is probably an issue with register scarcity.
Anyway, it shows once again that optimization strategy varies depending on target.

8. It keeps removing one line:

state = decodeTable[state].newState + (buffer & mask[nbBits]);
buffer = buffer > > nbBits;
If((bitCount-=nbBits)<=16) {buffer = buffer | (*(U32*)ip < <bitCount; bitCount+=32; ip-=4;}

9. This is a very interesting technique. It seems to have the potential to outperform my SSE-aided range decoder in my current compressor, at least when the number of contexts is small. Alternating between many decode tables is probably not a good idea with this technique.
Which is faster also depends on how well this method behaves when you interleave multiple independent decodes.
Right now I'm more interested in getting the implementation of the single decode as fast/simple as possible.

Here are some of my finding so far:
Disclaimer this is tested on an i7 laptop with msvc2012. Your mileage may vary.

Baseline decode loop: 190mb/s
-while(op<oend)
-{
- U32 rest;
- const int nbBits = decodeTable[state].nbBits;
- *op++ = decodeTable[state].symbol;
- bitCount -= nbBits;
- rest = (bitStream >> bitCount) & mask[nbBits];
- {
- int nbBytes = (32-bitCount) >> 3;
- ip -= nbBytes;
- bitStream = *(U32*)ip;
- bitCount += nbBytes*8;
- }
- state = decodeTable[state].newState + rest;
-}

- rest = (bitStream >> bitCount) & mask[nbBits];
->
- rest = (bitStream >> bitCount) & ((1<<nbBits)-1);

Use two shifts to isolate bits and reorder to save a sub: 226mb/s
- bitCount -= nbBits;
- rest = (bitStream >> bitCount) & ((1<<nbBits)-1);
->
- rest = (bitStream << (32 - bitCount)) >> (32 - nbBits);
- bitCount -= nbBits;

We now have (32 - bitCount) twice in the same code segment, so lets invert it and keep track of 32-bitCount instead:
This also requires us to flip af few signs, as we are now counting in the opposite direction.

Inverted bit counter: 229mb/s
-bitCount = 32 - bitCount;
-while(op<oend)
-{
- U32 rest;
- const int nbBits = decodeTable[state].nbBits;
- *op++ = decodeTable[state].symbol;
- rest = (bitStream << bitCount) >> (32 - nbBits);
- bitCount += nbBits;
- {
- int nbBytes = bitCount >> 3;
- bitCount -= nbBytes*8;
- ip -= nbBytes;
- bitStream = *(U32*)ip;
- }
- state = decodeTable[state].newState + rest;
}

We can now simplify the 'renormalization' code a bit more.

Simpler bitCount renormalization: 242mb/s
- int nbBytes = bitCount >> 3;
- bitCount -= nbBytes*8;
->
- bitCount &= 7;

If we negate nbBits (and change to signed char) in the decode table, we get:
- rest = (bitStream << bitCount) >> (32 - nbBits);
- bitCount += nbBits;
->
- rest = (bitStream << bitCount) >> (32 + nbBits);
- bitCount -= nbBits;

On x86 the second operand of shifts are modulo 32, so we can omit the +32. You probably want a define for that.
Modulo 32 shift trick: 252mb/s
- rest = (bitStream << bitCount) >> (32 + nbBits);
->
- rest = (bitStream << bitCount) >> nbBits;

I'm interested to hear if you see similar improvements.
-Rune

1. This is definitely a very interesting suggestion. Be sure I'll look into it and post feedback here.

2. Oversized shifts are undefined behaviour in c, so the mod 32 trick doesn't work in this form.

> If the value of the right operand is negative or is
greater than or equal to the width of the promoted left operand, the behavior is undeﬁned.

One could try `x >> (y & 0x1F)` hoping that the compiler optimizes it out. No idea if common c compilers do so.

3. > One could try `x >> (y & 0x1F)` hoping that the compiler optimizes it out.

It's a good idea.

Also worthy to note, the "Oversized shifts" trick is not currently useful for FSE, since >> (32-nbBits) doesn't work when nbBits==0.

Regards

10. This leaves us at: (still assuming nbBits is negated)
-bitCount = 32 - bitCount;
-while (op<oend)
-{
- U32 rest;
- const int nbBits = decodeTable[state].nbBits;
- *op++ = decodeTable[state].symbol;
-#if USE_MODULO32_TRICK
- rest = (bitStream << bitCount) >> (nbBits);
-#else
- rest = (bitStream << bitCount) >> (32 + nbBits);
-#endif
- bitCount -= nbBits;
- {
- int nbBytes = bitCount >> 3;
- ip -= nbBytes;
- bitStream = *(U32*)ip;
- bitCount &= 7;
- }
- state = decodeTable[state].newState + rest;
-}

1. Hi Rune

I've been looking into your proposal.

Removing precompute mask from the hot decoding loop is indeed a win, but only in 64-bits mode.
In 32-bits mode, it's a loss, by about the same amount (~20%).
So it's necessary to differentiate both situations.

My current guess is that it might be a side-effect of register scarcity.

The second part :
rest = (bitStream << (32 - bitCount)) >> (32 - nbBits);
doesn't work.

That's because it doesn't cover the case where nbBits=0.

It is indeed possible to have nbBits=0, when the probability of one symbol is > 50%. In this case, '>> (32 - nbBits)' doesn't "erase" the value for rest to zero, and program crashes.

2. Using msvc2012 i see a significant improvement in both 32-bit and 64-bit.

You are right about the nbBits=0 case. My test data was not redundant enough for it to show up. That was a pretty embarrassing mistake. I can see two easy fixes. Either you force the shift to be performed in 64bit with a cast, but that probably won't work well in 32bit and actually not in 64bit either, as you would probably want bitStream to be 64bit itself in most cases.

The other option is to perform the shift in two steps:
U32 rest = ((bitStream<<bitCount)>>1) >> (31 - nbBits);
bitCount += nbBits;
Although it is still pretty fast (237mb/s) it feels like there must be a better way. I'll have to play around with it some more.

3. Yes, this version works, and still produces a small gain compared to the mask method.
This time, it also works for the 32 bits version, which shows a small gain too.

> Using msvc2012 i see a significant improvement in both 32-bit and 64-bit.

That difference report was limited to the "remove precompute mask" part. Mask is removed later on, using double (triple?) shifting instead.

4. OK, I've been using more elements from your suggestion list.
The only one which is missing is the "modulo 32 trick". I've consequently kept nbBits at positive values so far.

Nonetheless, I notice some very interesting improvements.
Decoding speed has reached 265 MB/s on my system.
That's on 64 bits mode.
32-bits, unfortunately, doesn't benefit much, and remain stuck at 220 MB/s. Less impressive, but still positive.

The last stage I'm hesitating on include is the "modulo on shift" trick, for portability (and complexity) reasons.

Do you know if this behavior is limited on x86, or is the same on a larger range of CPU, such as ARM typically ?

Regards

5. I just came to realize that the whole point of negating nbBits is to simplify the expression >> (32+nbBits), using the "modulo 32 trick", into >> nbBits.
However, since the latest version is now >> 1 >> (31+nbBits), to handle the special case nbBits==0, it makes no more sense to negate nbBits. At least, as long as there is no workaround.

It's not a problem though, since even with the triple shift, the improvement in decoding speed is definitely noticeable in 64 bits mode.
I'll use this opportunity to update the repository on GitHub.

Regards

6. As an interesting side-effect :
I just changed the source code on GitHub to separate encoding/decoding from header management.

As a side effect, it resulted in the 32-bits version decoding speed jumping on part to the 64-bits one, at 265 MB/s. Quite a visible gain compared to previous 220 MB/s, especially for a change which was not targeting speed at all.

My interpretation is, once again, concentrated on register scarcity. 32-bits mode only has 8 registers, while 32-bits mode has a roomier 16 registers.
My current guess is that, with decoding alone in its own function, all registers are only used for this task, while previously, with both header management and decoding in the same function, some registers may be allocated for variables not dedicated to the hot decoding loop. It probably resulted in as little as 1 or 2 registers unavailable for the hot loop, which was enough to have such a deep impact.

Well, anyway, whatever the reason, it seems solved now.

7. i've seen the same effect when optimized my Delta a year ago - placing hot loop into its own function made it faster. not sure about compiler but most likely it was gcc 3.4.5

11. hi rune stubbe,
You use both ip and op tables once per step - using 64 bits variables as buffers, you could use op table once per 8 steps and ip table could be safely used once per 4 steps - wouldn't it be much faster?

Just refill buffer and flush symbols e.g. every 4 steps:

symbols = symbols | decodeTable[state].newState < < ... ;
state = decodeTable[state].newState + (buffer & ((1 < < nbBits)-1));
buffer = buffer > > nbBits; bitCount -= nbBits;

12. Jarek: Thank you. I think ANS/ABS is the most interesting thing I have seen in lossless data compression in years :)

You are of course correct. Skipping renormalizations is already a win for range coding. I'm sure we see similar improvements here. I'm not sure the same is true for output (op) as you are actually executing more instructions to pack the bytes into a dword/qword.

A quick and dirty test seems to indicate that unrolling the code by 2x and doing the ip update once only improves performance to about 263mb/s (~4% faster). It is actually a little disapointing. I'm guessing part of the explanation is that the ip code is not on the most critical path and thus already partially overlapped with the rest of the decode. I'm guessing the improvement would be larger, if we were doing enough independent decodes to saturate the execution units of the cpu.

I might be wrong, but I think Yann's intention with FSE is for it to be an example of ANS and a good starting point for your own implementation. If that is the case then I think it makes sense to keep the implementation as clean and simple as possible, so it is easy to integrate into your own project, where you can make all of the trade-offs with the proper context of the rest of the compressor.

On the other hand, if the intention is to make it as fast as possible, then it makes sense to interleave multiple independent decodes, find a good compromise between maximum symbol length, number of states and renormalization frequency, etc.

13. Thanks rune stubbe. It can be also used in lossy compression - it would perfectly fit into e.g. JPEG and generally in nearly every place where Huffman (,Elias, Golomb, Tunstall, ..., sometimes arithmetic) is currently used.

The 4% improvement is indeed a bit disappointing for 2x unrolling. For 64 bit buffer, we could write succeeding 8 steps in the loop: with single use of *op and 2 uses of *ip - should give another few percents.
Sure, finally it should turn into multiple separate decoders, encoders ... and initializations: symbol spreading procedures (it can be more precise, and can use PRNG initialized with a key to practically for free simultaneously encrypt the message).

I think both intentions are important, but with a good implementation it can be directly (or with small modifications) inserted into specific compressor - I think it's the main goal now, especially that Yann has already put it into Zhuff.

Thanks,
Jarek

1. Memory accesses in read mode are not that costly.
Expect a 3 cycles latency for L1 cache.

It should not be confused with write accesses, which produces some delay due to committing changes to memory, hence it's more difficult to update twice the same memory area.

2. You are right, the compiler should buffer neighboring reads/writes itself - doing this optimization manually probably may lead only to a slight improvement.
About the nbBits==0 case, it is possible only for symbols with probability above 1/2, what seems a rare case for 256 size alphabet - could be dealt with a separate code.

And generally, you are using the oldest bits of bitStream, while using the youngest instead looks simpler and so maybe might be a bit faster:
state = decodeTable[state].newState + (bitStream & mask[nbBits]);
bitStream = bitStream > > nbBits;

14. Very interesting - can't wait to see more exposition on how this works.

How is it different than just doing a state-based arithmetic coder with only a few fractional bits? The decoder pseudo-code above looks very much like a "deferred summation" arithmetic decoder, or a Howard-Vitter state based arithmetic coder.

1. If I do understand it correctly, in the document present at
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.8561
the Howard-Vitter algorithm is only valid for Binary alphabet.

In this case, the comparison should better be done with ABS, from Matt Mahoney (http://mattmahoney.net/dc/dce.html#Section_33).

FSE encodes full symbol values (currently 0-255, range to be extended later on), which is beneficial for its speed, one of the goal of this blog.
I like your definition of "deferred summation", which is very close to FSE decoder behavior.

Nonetheless, there are some close similarities between arithmetic coding and FSE. I will demonstrate in a later post that Arithmetic Coding is in fact just a corner case of this algorithm.

15. Howard-Vitter "quasi arithmetic coding" is only for binary alphabet.
While state of arithmetic coding are two numbers defining the range, the advantage of ANS approach is that the state is only a single natural number (containing log(state) bits of information).
Thanks of that we, ANS allows to construct relatively small automata also for given probability distribution on large alphabet.

You can find the basic ideas and difference from arithmetic in the poster: https://dl.dropboxusercontent.com/u/12405967/poster.pdf