Wednesday, February 5, 2014

FSE : Defining optimal subranges

Note : Charles Bloom started an independent in-depth analysis of ANS on his blog, which is definitely worth the read.


 As stated in earlier FSE comparison with Arithmetic Coding, FSE deals with fractional bits by "decreasing" its state value. Now how does that work precisely ?

Let's use again the example of a symbol which probability of appearance is 5/4096. The Shannon limit for this symbol is 9.68 bits per symbol (if we do better, then some other symbols will suffer, and the overall will compress less). As stated earlier, we deal with the fractional bit issue by outputting sometimes 9 bits, sometimes 10 bits.

The distribution of 9-bits & 10-bits sub-ranges must ensure that all possible values remain accessible after the current symbol. It's a condition for lossless compression. The result is the following layout :


OK, so why are the 9-bits subranges positioned at the beginning of the range ?
Well, in previous article, we said that an additional bit is read when the state is so much decreased that is goes below lower limit. In which case, it just "wraps around", starting again from the top value, at the cost of an additional bit read.



So it's not so much that the 9-bits are at the bottom. Rather, with above explanation, it is more intuitive to explain that the 10-bits are at the top.
Maybe it would be easier to notice with another example, a symbol with 7/4096 probability of appearance.



Note that these subranges are "destination ranges". Each of the 7 symbol_cells are mapped to one and only one of these ranges.

So which symbol is associated with each range ?

As stated earlier, since the additional bit is "paid for" when crossing the low limit, the most likely symbol to trigger the limit is lowest one.
So we give an index to each of the 7 symbol_cells, in their order of appearance, from the smaller to the larger one.
The largest 10-bits range is attributed to symbol_cells n°1.
The other ranges simply follow, in their natural order :


This is coherent with an optimal cost associated with a 7/4096 distribution : 9.19 bits.
It means that we are only consuming 0.19 fractional bits each time the symbol is decoded. So we can decode it about 5 times at 9-bits before requiring 10-bits.

This is in contrast with a probability of 5/4096 :



Here, the optimal cost is 9.68 bits, so fractional bits is much higher (0.68 bits). We are likely to cross the low limit much more often. Hence the first 3 symbol_cells trigger the additional bit read, and require 10 bits.

This rule is valid for any probability; so now, whether it is 5/4096, 134/4096, or even 3812/4096, you know how to define destination sub-ranges, and can build the associated decoding table.
(The latest example is interesting, since some sub-ranges will be 0-bits large, which means they map directly to another state value : 3812/4096 => 284 1-bit sub-ranges + 3528 0-bit sub-ranges).

What remains to be understood is how to distribute the symbol states. A naive approach would be to simply cluster them together. For example, a symbol with probability of 7/4096 would occupy state values from 0 to 6. It would work, but compression effectiveness would be very poor.
In another blog post, we'll see why, and how to correct that.



Monday, February 3, 2014

A comparison of Arithmetic Encoding with FSE

 Arithmetic encoding was invented in the 70's by Jorma Rissanen. After a few decades during which the technique almost stagnated, partly as a side-effect of abusive patent strategy, it started to take off in the early 2000's, and is now clearly established as the "current generation" entropy coding stage, thanks to its much better compression effectiveness than Huffman.

Since FSE claims to achieve equivalent compression performance but without the speed penalty, it's interesting to compare them, and see what makes them similar, or different.

The basic concept behind Arithmetic Encoding is the simple fact that shifting a bit is equivalent to a division by 2.

So now, it becomes clear that Huffman is just a special case of Arithmetic Encoding : since Huffman only shift bits, it's equivalent to divisions by power of 2.
By fully acknowledging this equivalence, it becomes obvious that it's also possible to divide by any other number, unlimited by the power of 2 requirement. This is how fractional bit is achieved. You can divide by 1.11, and you achieved the equivalent of a 0.1 bit cost.

The illustration above, taken from Wikipedia, is relatively clear, although it may be even simpler to look at an Huffman one. Let's use again this simple distribution :




Each symbol has an offset (or start position) and a range.
A : start 0, range 4 => factor 2
B : start 4, range 2 => factor 4
C : start 6, range 1 => factor 8
D : start 7, range 1 => factor 8

The "factor" is simply the total divided by the range.

In order to encode a symbol, arithmetic encoding always consider that it's selecting a subrange between 0 and 1, eventually rescaled to match a distribution table. In this case, the table has a size of 8, so we have to output a value between 0 and 7.
To encode symbol B, we restrict our new range between 4 and 6 (excluded). So we first settle with 4, because it is the start value of B.
Then we look at the divisor : we only have 2 positions remaining, so in order to find again the full range of possibility, we have to "upscale" this range by the factor. So we multiply by 4 (e.g. shift left by 2 bits), and find again a range of 8 values to encoded.
We can start again, but now, we have 2 bits which are already determined, occupying higher bit position ('10').

This example is very easy because we have only power of 2. But the exact same mechanism works with any distribution. You could even have a total which is a prime number, and any kind of distribution into it (such as 9,7,1,2 for a total of 19). It would work same : scale the value to 19 slots, add the start value, rescale by the factor, and so on.
Except that, you will have to face rounding errors. In a world of unlimited precision, this would not matter, but alas, that's not how computer works. You have to scale arithmetic precision to a bounded accuracy. It's not necessarily a big issue, since what truly matters is to do the same rounding error on both encoding and decoding (which is a significant problem in itself, especially in a cross-platform environment). Beside, it's equivalent to a small compression loss, and therefore it's important to limit that loss as much as possible.
A common way to do that is to "spread" the segment to a virtual size as large as possible. So, in previous example, we will not have a value between 0 and 18, but rather between 0 and 2^31-1 for example. We accept a small rounding error in doing it, but it's very acceptable.

Now that we have described encoding, let's look at how decoding works.

It's almost the same operation. To simplify decoding, we have a precomputed decoding table, of known size (8 in our first example). We read our first number, using virtual scale, and then scale it back to distribution table size, decode the symbol (B if it's a 4 or 5), get the new range, rescale, read more bits, and so on.

All this works well, but as can be guessed, cost quite a bit of multiplications and divisions in the process. With several clever tricks (typically ensuring that distribution total is a power of 2), it's possible to decrease these operations to a minimum, but alas, one division remains in the decoder, and several multiplications are also there.

Now let's have a look at ANS/FSE decoder.

We start with basically one identical concept : a bit read is equivalent to a division by 2.

A first difference though, is that a bit read is not bit shift.
In contrast with Huffman or Arithmetic, the "state" value is not read directly from the bitstream, it is a virtual value which is maintained alongside the bitstream.
That being said, it can be almost the same : a symbol with a 50% probability will basically divide state by 2, then discover than one bit must be added to keep state within range, and then shift it (equivalent to arithmetic's rescale), and read one bit from the bitStream.

Another difference is that a fractional bit is obtained by reading a variable number of bits. For example, if a symbol should be written using 3.3 bits (equivalent to 10%), the decoder will sometimes read 3 bits, and sometimes read 4 bits.
So we never "rescale by 10" as would do an arithmetic decoder, but rather sometimes rescale by 8, and sometimes by 16 (which, incidentally, is a better fit for binary representation, reducing rounding errors).

Okay, but how is this decided ?

Well, this is the magic of the decoding table, that will be described into another post, but to give you an idea of the bigger principle, here is how I would describe it in simple terms :
the fractional bit is translated by a "decrease" in state value.

When state value is high, state is just decreased, but it doesn't trigger the "additional bit". So, following our 10% example, it means we just read 3 more bits.



When state value is low, it tends to "cross" the lower limit. As a consequence, one more bit is read, and state is regenerated at its full value.



OK, that's roughly how the decoder behaves, from an external perspective.
But it's not enough to fully describe the exact behavior. You cannot expect to subtract a fixed amount from state in order to get your new state value. Unfortunately, it matters if the next state value is exactly 4 or 5, because it will influence some future symbol. So accuracy is more complex to describe. And that will be explained in a later post.

Tuesday, January 14, 2014

Huffman, a comparison with FSE


 In this serie explaining the behavior of Finite State Entropy, I find it interesting to make a quick comparison with Huffman entropy. It's a direct result from the description of its decoding principles, nonetheless I believe it's useful to clearly establish the equivalence since it will be illustrative for future articles.

To better grasp the similarities, let's have a simple example. We'll use the following probabilities, suitable for an Huffman tree :

A : 50% ; B : 25% ; C : 12.5% ; D : 12.5%

This tree only needs 8 values to be properly represented, and gives the following linear representation :


This representation uses the "high bit first" methodology, which means we are starting our node travel from the highest bit.
If it's 0, then we have an A. We stop there, shift left the register by one bit and load a new one at the bottom.
If the highest bit is 1, we now we are not done, and need to read the second bit. If this second bit is 0, we know we have a B. If not we continue reading. And so on.
The decision tree looks as follows :



But we could have done differently : we could have used "low bit first". In this case, the linear representation of the tree looks like this :


Which seems a lot messier, but it's not : we simply reversed the bit order of the decision tree :



The 2 decision trees are therefore completely equivalent.

Now, look again at the "low bit first" linear representation. It's important to state that this is what FSE would produce, but it also embeds a few additional properties :

  • The "newStateBaseline" does not need to be calculated and stored into the decoding table : it can be automatically determined from the state itself : newStateBaseline = currentState & ~mask(nbBits);
  • nbBits is stable for a given character : for A, it's always 1, for B, it's always 2, etc. It simplifies the decoding table, which does not need to store this information.


So basically, an Huffman-low_bit_first table looks the same as an FSE table, and it guarantees a stable number of bits per character, and newStateBaseline is automatically deduced from remaining bits.

There is, however, a small but important difference.
An Huffman-low_bit_first would shift right the current state, and load the next bit(s) high. In contrast, the FSE decoder keeps them, and only reload the low bits. This creates a dependency on future state : basically, once you've reached state 2 with FSE for example, you can already guarantee that you won't see a D before you see a C. Which brings the question : how would the encoder select the right state, since it depends on future characters yet to be decoded ?

The answer : the encoder will work in backward direction, encoding characters from last to first, solving dependencies along the way.

To be continued...

Thursday, January 9, 2014

FSE decoding : how it works

 After announcing the release of FSE and its results, it's time to have a look at its internals.

We'll spend some time first on the decoding loop, since it's both the easiest one to understand and the most important one. Encoding is much more convoluted, and that's where ANS theory comes into play (For some more detailed explanation on how ANS works and its related properties, I invite you to read Jarek's latest paper). But for the time being, we don't need that yet. The decoding loop is generic enough to handle diverse kind of entropy algorithms, including Huffman and Arithmetic, as we'll see later on.

The decoding code can be freely consulted on GitHub. The "hot loop" is pretty short, and can be pseudo-coded like this :
            // One persistent variable across all loops : int state outputSymbol (decodeTable[state].symbol); int nbBits = decodeTable[state].nbBits; int rest = getBits (bitStream, nbBits); update (bitStream, nbBits); state = decodeTable[state].newState + rest;
There is a single value to track, called state.
state value can range from 0 to a maximum total_number_of_states-1, which is conveniently a power of 2. For example, in FSE default implementation, total_number_of_states is 4096, representing 12 bits.

state initial value is read directly from the bitStream. Then its evolution is a bit more complex.
In contrast with Huffman, which merely shifts current bits and replace them with new ones from the bitStream, FSE completely rewrite state value after each step.

The first thing to keep in mind is that a state value is enough to immediately decode a symbol, and provide the next state value. Both information are directly stored into the decoding table.
If we had the chance to handle a large enough state number, with its associated decoding table, the decoding loop would be trivial :

outputSymbol (decodeTable[state].symbol); state = decodeTable[state].newState;
Obviously, we are not that lucky : a large number representing a complete file or segment would result in a way too large decoding table to be practical. So we have to keep this size reasonable.
In practice, we keep as many bits as necessary to represent probabilities with enough accuracy, for example 12 bits proposed by default in FSE for a 256 alphabet.

Now we have to modify the decoding loop, by introducing a few new bits between each step, keeping the value of state within accessible range. The number of bits to load is also embedded directly into the decoding table.
outputSymbol (decodeTable[state].symbol); int nbBits = decodeTable[state].nbBits; state = decodeTable[state].newStateBaseline + getSomeBits(bitStream, nbBits);
A legitimate question at this stage is : why on earth this construction does compress anything ?
Well, we are getting close to it. Consider that the number of bits to read after each symbol is decoded is the cost of compression.

This number of bits is in no way random : it entirely depends on the (normalized) frequency of the decoded character. And here is why.

Let's suppose our range of possible values is 4096 (12 bits). And let's supposed that the normalized frequency of our decoded symbol is 4. That means it should be present about 4 times every 4096 symbols, which is equivalent to 1 times every 1024 symbols, which is equivalent to 10 bits.
That conclusion would have been the same for Huffman : after reading the character, we need 10 new bits to read the next one.
This example is simple because the symbol frequency is a power of 2.

So now, let's imagine that its frequency is 5. That means it's present 5 times in the table of 4096 cells.
According to Shannon theory, we are supposed to use -log2(5/4096) = 9.68 bits to represent this symbol. How can we achieve this ?

Well, we are going to divide the range of accessible values into 5 sub-ranges. Since we are loading full bits, we need each sub-range to be a power of 2. And since the total remain 4096, we need multiple different sub-range sizes.

A simple way to achieve this is to use a combination of 9 bits and 10 bits ranges (sometimes called phase-in codes). For example :


Now, we have 5 sub-ranges, and collectively they cover the full spectrum of possible values.
Each of the 5 symbol occurrence will get assigned one of these ranges, which translates into an offset (0, 512, 1024, etc.), and a number of bits (9 or 10). Hence :
state = decodeTable[state].newStateBaseline + getSomeBits(bitStream, nbBits);
At this stage, we have not yet settled any specific rule regarding these ranges nor their assignment. We could have the 9-bits range at the end, or mix them (for ex. 10-9-10-9-10). Finally the 5 occurrences of symbol could be anywhere between 0 and 4095, and they could be assigned these ranges in no particular order.

The suite of states is always "going down", generating a bit read each time it crosses the lower limit.
Fractional bits from a "high state" are consumed by going "low state".
Fractional bits from a "low state" are consumed by reading a full bit, and then reach a "high state".

What could be the efficiency of such a representation method ?
We are targeting 9.68 bits. Do we get close to it ?

Well, if we take the naive assumption that all values in the state range are equi-probable, then we can calculate the likelihood of getting 9 or 10 bits to read.

Average_bitCost = (2 x 9 x 512 + 3 x 10 x 1024) / 4096 = 9.75 bits

That's not bad, we are getting fairly close to 9.68 bits, but not there yet.

The trick is, these values are not equi-probable. In ANS construction, states at the beginning of the range are slightly more likely to be generated than states at the end of the range, due to the way 'newStateBaseline' are generated. So the lower range [0-511] is more likely to be generated than same size higher range [3584-4095], so it weights more.
Calculating the average is therefore not straighforward. But bear with me, if you don't have the patience to read the mathematical proof from Jarek Duda, the ANS construction converge towards the optimal cost of the Shannon limit (we were already pretty close to it to begin with).

This result, however, depends on the way Symbols are spread into the table.
Which is more complex, and will be detailed in another blog post.

Monday, December 23, 2013

Zhuff v0.9, or a first FSE application

 As a quick follow up to last week release of FSE, I wanted to produce a quick demo to show the kind of gain to expect with this new entropy algorithm.

And a simple example shall be Zhuff.

Zhuff, is basically, LZ4 and Huff0 stuffed together. It's very fast in spite of this 2-pass process. Its major limitation so far is a 64KB window size, as an inheritance from LZ4, but that's something we'll talk again in a later post.

For the time being, my only intention is to swap huff0 with FSE, and look at the results.

And the results are positive.
Here are some benchmark numbers, on a Core i5-3340M @2.7GHz, Windows  Seven 64 bits, using single-thread, fast algorithm

Filename Compressor   Ratio Decoding speed
silesia  zhuff v0.8   2.783    275 MB/s
silesia  zhuff v0.9   2.805    311 MB/s

enwik8   zhuff v0.8   2.440    200 MB/s
enwik8   zhuff v0.9   2.460    230 MB/s

calgary  zhuff v0.8   2.672    220 MB/s
calgary  zhuff v0.9   2.693    250 MB/s

(Note : I've not mentioned improvements in compression speed, they are measurable but unrelated to FSE, so out of this report)

So, in a nutshell, we have a moderate (>10%) increase of decoding speed, and a small improvement to compression ratio. Nothing earth-shattering, but still worthy to grab.
The situation would have been even better for FSE if higher probabilities had to be compressed, but Zhuff is a simple fast LZ algorithm, so it doesn't exactly qualify as generating "high probabilities".

The purpose of this demo was just to show that FSE can easily replace and improve results compared to Huffman. As a side-effect, it makes Zhuff an interesting playground to test future FSE improvements.

With this part now settled, the next articles shall describe FSE principles, especially how and why it works.

Monday, December 16, 2013

Finite State Entropy - A new breed of entropy coder

 In compression theory, the entropy encoding stage is typically the last stage of a compression algorithm, the one where the gains from the model are realized.

The purpose of the entropy stage is to reduce a set of flags/symbol to their optimal space given their probability. As a simple example, if a flag has 50% to be set, you want to encode it using 1 bit. If a symbol has 25% probability to have value X, you want to encode it using 2 bits, and so on.
The optimal size to encode a probability is proven, and known as the Shannon limit. You can't beat that limit, you can only get close to it.

A solution to this problem has been worked for decades, starting with Claude Shannon own work, which were efficient but not optimal. The optimal solution was ultimately found by one of Shannon's own pupils, David A. Huffman, almost by chance. His version became immensely popular, not least because he could prove, a few years later, that his construction method was optimal : there was no way to build a better distribution.

Or so it was thought.
There was still a problem with Huffman encoding (and all previous ones) : an hidden assumption is that a symbol must be encoded using an integer number of bits. To say it simply, you can't go lower than 1 bit.
It seems reasonable, but that's not even close to Shannon's limit. An event which has 90% probability to happen for example should be encoded using 0.15 bits. You can't do that using Huffman trees.

A solution to this problem was found almost 30 years later, by Jorma Rissanen, under the name of Arithmetic coder. Explaining how it works is outside of the scope of this blog post, since it's complex and would require a few chapters; I invite you to read the Wikipedia page if you want to learn more about it. For the purpose of this presentation, it's enough to say that Arithmetic encoding, and its little brother Range encoding,  solved the fractional bit issue of Huffman, and with only some minimal losses to complain about due to rounding, get closer to Shannon limit. So close in fact that entropy encoding is, since then, considered a "solved problem".

Which is terrible because it gives the feeling that nothing better can be invented.

Well, there is more to this story. Of course, there is still a little problem with arithmetic encoders : they require arithmetic operations, such as multiplications, and divisions, and strictly defined rounding errors.
This is serious requirement for CPU, especially in the 80's. Moreover, some lawyer-happy companies such as IBM grabbed this opportunity to flood the field with multiple dubious patents on minor implementation details, making clear that anyone trying to use the method would face expensive litigation. Considering this environment, the method was barely used for the next few decades, Huffman remaining the algorithm of choice for the entropy stage.

Even today, with most of the patent issues cleared, modern CPU will still take a hit due to the divisions. Optimized versions can sometimes get away with the division during the encoding stage, but not the decoding stage (with the exception of the Binary arithmetic coding, which is however limited to 0/1 symbols).
As a consequence, arithmetic encoders are quite slower than Huffman ones. For low-end or even "retro" CPU, it's simply out of range.

It's been a long time objective of mine to bring arithmetic-level compression performance to vintage (or retro) CPU. Consider it a challenge. I've tested several variants, for example a mix of Huffman and Binary Arithmetic, which was free of divisions, but alas still needed multiplications, and required more registers to operate, which was overkill for weak CPU.

So I've been reading with a keen eye the ANS theory, from Jarek Duda, which I felt was heading into the same direction. If you are able to fully understand his paper, you are better than me, because quite frankly, most of the wording used in his document is way out of my reach. (Note : Jarek pointed to an update version of his paper, which should be easier to understand). Fortunately, it nonetheless resonated, because I was working on something very similar, and Jarek's document provided the last elements required to make it work.

And here is the result today, the Finite State Entropy coder, which is proposed in a BSD-license package at Github.

In a nutshell, this coder provides the same level of performance as Arithmetic coder, but only requires additions, masks, and shifts.
The speed of this implementation is fairly good, and even on modern high-end CPU, it can prove a valuable replacement to standard Huffman implementations.
Compared to zlib's Huffman entropy coder, it manages to outperform its compression ratio while besting it on speed, especially decoding speed.

Benchmark platform : Core i5-3340M (2.7GHz), Window Seven 64-bits
Benchmarked file : win98-lz4-run
Algorithm Ratio Compression Decompression
FSE       2.688   290 MS/s     415 MS/s
zlib      2.660   200 MS/s     220 MS/s

Benchmarked file : proba70.bin
Algorithm Ratio Compression Decompression
FSE       6.316   300 MS/s     420 MS/s
zlib      5.575   250 MS/s     270 MS/s

Benchmarked file : proba90.bin
Algorithm Ratio Compression Decompression
FSE       15.21   300 MS/s     420 MS/s
zlib      7.175   250 MS/s     285 MS/s

As could be guessed, the higher the compression ratio, the more efficient FSE becomes compared to Huffman, since Huffman can't break the "1 bit per symbol" limit.
FSE speed is also very stable, under all probabilities.

I'm quite please with the result, especially considering that, since the invention of arithmetic coding in the 70's, little new has been brought to this field.

-----------------------------------------------------

A little bit of History :

Jarek Duda's ANS theory was first published in 2007, and the paper received many revisions since then. Back in 2007, only Matt Mahoney had enough skill and willpower to sift through the complex theory, and morph it into a working code. However, Matt concentrated on the only use case of interest to him, the Binary version, called ABS, limited to 0/1 alphabet. This decision put his implementation in direct competition with the Binary Arithmetic Coder, which is very fast, efficient, and flexible. Basically, a losing ground for ANS. As a consequence, ANS theory looked uncompetitive, and slumbered during the next few years.

FSE work re-instates ANS as a competitive algorithm for multi-symbol alphabet (>2), concentrating its perspective as a viable alternative to block-based Huffman.

Thanks to promising early results from FSE, Jarek concentrated back its attention on multi-symbol alphabet. As we were chatting about perspectives and limitations of ANS, I underlined the requirement of a decoding table as a memory cost, and suggested a solution in the making to limit that issue (which ultimately failed). But Jarek took the challenge, and successfully created a new variant. He then published an updated version of his paper. The new method would be called rANS. He would later invent the terms tANS and rANS to distinguish the different methods.

rANS was later adapted by Fabian Giesen and Charles Bloom, producing some very promising implementations, notably vector-oriented code by Fabian.

But as said, this is not the direction selected for FSE, created before Jarek's paper revision. FSE is a finite state machine, created precisely to avoid any kind of multiplication, with an eye on low-power CPU requirements. It's interesting to note such radically different implementations can emerge from a common starting theory.


For the time being, FSE is still considered beta stuff, so please consider this release for testing purposes or private development environments.

Explaining how and why it works is pretty complex, and will require a few more posts, but bear with me, they will come in this blog.

Hopefully, with Jarek's document and these implementations now published, it will be harder this time for big corporations to confiscate an innovation from the public domain.

-----------------------------------------------------

List of Blog posts explaining FSE algorithm :

http://fastcompression.blogspot.com/2014/01/fse-decoding-how-it-works.html
http://fastcompression.blogspot.com/2014/01/huffman-comparison-with-fse.html
http://fastcompression.blogspot.com/2014/02/a-comparison-of-arithmetic-encoding.html
http://fastcompression.blogspot.com/2014/02/fse-defining-optimal-subranges.html
http://fastcompression.blogspot.com/2014/02/fse-distributing-symbol-values.html
http://fastcompression.blogspot.com/2014/02/fse-decoding-wrap-up.html
http://fastcompression.blogspot.com/2014/02/fse-encoding-how-it-works.html
http://fastcompression.blogspot.com/2014/02/fse-encoding-part-2.html
http://fastcompression.blogspot.com/2014/02/fse-tricks-memory-efficient-subrange.html
http://fastcompression.blogspot.com/2014/04/taking-advantage-of-unequalities-to.html

[Edit] : replaced huff0 by zlib on the comparison table
[Edit] : added entry on rANS variants by Fabian & Charles
[Edit] : added list of blog entries

Saturday, October 19, 2013

Compression stream (part 3)

 After taking care of the decompression stream, it's now time to have a look at the compression side.

As stated earlier, a compression stream delivering independent blocks is fairly easy to setup. So the objective of today is to define a proper behavior using chained block methodology, which uses data from previous blocks to better compress the next one. By extension, this strategy also works with static dictionaries.

An ideal compression stream should be able to :
- accept input from any source and any size
- deliver compressed output into destination buffer of any size.
- use as little working memory as possible

The last point is the most interesting one.
A (relatively) easy way to create a streaming interface is to use intermediate buffers, for both input and output. These buffers would easily absorb side effects, such as : providing a ridiculously small output buffer, channeling more input than output, etc.
Obviously, this methodology has downsides too, such as performance slowdown, due to the need to copy data from and into intermediate buffers. But more importantly, it increases memory footprint.

Let's focus on the output buffer for now.

It's relatively easy to skip the intermediate output buffer in the simple case when the remaining space into the destination buffer is "large enough". Then, it's possible to compress directly into the destination buffer, because the compressed block can fit in. It saves a useless memory copy operation, and potentially a memory allocation, if the intermediate output buffer is not yet allocated.

"Large enough" condition is met when one of these conditions is reached :
- Remaining space is larger than a full block size.
- Remaining space is larger than input.

However, if none of these conditions is fulfilled, it becomes harder to decide.

If remaining space into output buffer is smaller than input size (or block size), we could eventually try compression, just in case it would work. After all, if compression ratio is, for example 2:1, an output buffer with barely more than half the size of input could be enough.
The problem is, if it's not enough, then the compression operation will fail, and compression tables would become unusable. This is a major issue. In contrast with decoding, we can't "try and retry" compression, due to the modifications of compression tables.

A potential idea to solve this issue could be to duplicate compression tables before attempting the compression. However, let's remind that the objective is to perform better than "compressing into an intermediate buffer", to avoid some memory copy and allocation. Duplicating the compression tables would create the very same issues, it's doubtful if this technique achieve any saving.

Another idea is to dynamically adapt the amount of input to compress. Basically, since the worst case is that input may be not compressible, compress as much input data as there is remaining space into output buffer. This is possible, because blocks don't need to be full, they just have a "maximum size" (currently, selectable from 64KB to 4MB).

OK, this method works, up to a point.

As stated previously, LZ4 is currently able to produce compressed blocks, which need to be "properly terminated". Small blocks compress worse than larger ones. Fortunately, chained block is an excellent methodology to mitigate this effect.

However, when block sizes start to be become very small (for example, < 100 bytes), some inefficiencies become sensible. For example, the last 5 bytes of any block must be uncompressed. A match cannot start at a distance < 12 bytes from the end of the block. Each block starts with a 4 bytes header, and potentially ends with a 4 bytes checksum. These costs are small, but repeated small blocks will create them more often.

Consequently, there is a "trade-off" to find.

From a compression ratio perspective, an intermediate buffer is more optimal. For example, if the user application provides a destination buffer of 150 bytes, it's likely preferable to compress first into a (larger) temporary buffer, and then deliver the result 150 bytes at a time.

Now, it's possible to imagine some embedded systems with very limited available memory, where allocating another temporary buffer is not possible. In this case, these systems might prefer the inefficiency of the "small blocks" method, rather than no compression at all.

So, maybe, the better way is to provide both methods, letting the programmer select which one works best for his use case.

In the end, trying to manage the complexity of all border cases might be a bit too large for a first delivery. After all, the first delivery is not the last, it's still possible to provide improvements in future releases.

So, a probably faster way to deliver a working streaming interface is to work with intermediate buffers methodology by default, avoiding them in the most obvious use cases. These use cases, by the way, should be the most common ones. By properly documenting them, it increases the odd that the Streaming interface is used in a more optimal way.

Friday, September 27, 2013

Stream Decoding (part 2)


 I've been considering a complementary idea since writing the first part of LZ4 streaming interface.

A major point which bothered me was the need to shuffle data around, and the impossibility to decode data directly at final destination. Let's go into detail.

Currently, the LZ4 decoding functions either accept of refuse references outside of current block. Refusing is the default behavior, compatible with independent blocks.

For chained block, however, there is a major restriction. The decoding function will expect to find data from previous block *just before* current block. The good news is that it makes the decoding function fairly simple. The downside is that it pushes some requirements into user territory.

Addressing this requirement is now a major concern. It can be accomplished by decoding into an intermediate buffer, where the last decoded 64KB can be saved, and decompression operation can be performed in a contiguous space after it.

Decoding into an intermediate buffer makes the work done, but also results in the following issues :
1) It's necessary to allocate more memory for stream-decoding. Memory-constrained systems may not afford it. Worse, this cost is multiplied by the number of streams decoded in parallel.
2) It's necessary to copy decoded data, from the intermediate buffer to final destination. This is obviously bad for performance.
3) Keeping the last decoded 64KB just before the new block to decode require either to repeatedly copy 64KB at this place, which can become a serious waste for small blocks (think about small network packets), or require an even bigger intermediate buffer to limit the number of times data is copied.

Trying to fix these issues, I came up with a different design, with much more desirable properties.

The main idea is to work with a rotating memory buffer. It reduces the memory requirement for stream-decoding to just and only 64KB.

Even more importantly, it authorizes to decode directly at final destination, with no requirement regarding size or continuity. This is way more flexible for a user program.

To reach these properties, it's necessary to create a new family of decoding functions, able to use an external dictionary. The main idea is : when a reference is generated "outside of current block", look into another memory space, provided as an input argument.

As a positive side effect, this capability saves a lot of memory for server systems, handling many clients' connections. When using full adaptive streams, the memory requirement is only 64KB per decoding stream.
But even better, when using a static dictionary (which means, each block is compressed with the same prefix data), memory requirement per stream is reduced to zero. All decoding streams will directly read into the same memory block. For systems handling huge number of streams in parallel, it is a major memory allocation saver.

With the main properties of the decoding stream settled, now is the time to talk about the compressing stream.

(To be continued...)