RealTime Data Compression

Tuesday, January 14, 2014

Huffman, a comparison with FSE

In this serie explaining the behavior of Finite State Entropy, I find it interesting to make a quick comparison with Huffman entropy. It's a direct result from the description of its decoding principles, nonetheless I believe it's useful to clearly establish the equivalence since it will be illustrative for future articles.

To better grasp the similarities, let's have a simple example. We'll use the following probabilities, suitable for an Huffman tree :

A : 50% ; B : 25% ; C : 12.5% ; D : 12.5%

This tree only needs 8 values to be properly represented, and gives the following linear representation :

This representation uses the "high bit first" methodology, which means we are starting our node travel from the highest bit.
If it's 0, then we have an A. We stop there, shift left the register by one bit and load a new one at the bottom.
If the highest bit is 1, we now we are not done, and need to read the second bit. If this second bit is 0, we know we have a B. If not we continue reading. And so on.
The decision tree looks as follows :

But we could have done differently : we could have used "low bit first". In this case, the linear representation of the tree looks like this :

Which seems a lot messier, but it's not : we simply reversed the bit order of the decision tree :

The 2 decision trees are therefore completely equivalent.

Now, look again at the "low bit first" linear representation. It's important to state that this is what FSE would produce, but it also embeds a few additional properties :

The "newStateBaseline" does not need to be calculated and stored into the decoding table : it can be automatically determined from the state itself : newStateBaseline = currentState & ~mask(nbBits);
nbBits is stable for a given character : for A, it's always 1, for B, it's always 2, etc. It simplifies the decoding table, which does not need to store this information.

So basically, an Huffman-low_bit_first table looks the same as an FSE table, and it guarantees a stable number of bits per character, and newStateBaseline is automatically deduced from remaining bits.

There is, however, a small but important difference.
An Huffman-low_bit_first would shift right the current state, and load the next bit(s) high. In contrast, the FSE decoder keeps them, and only reload the low bits. This creates a dependency on future state : basically, once you've reached state 2 with FSE for example, you can already guarantee that you won't see a D before you see a C. Which brings the question : how would the encoder select the right state, since it depends on future characters yet to be decoded ?

The answer : the encoder will work in backward direction, encoding characters from last to first, solving dependencies along the way.

To be continued...

Thursday, January 9, 2014

FSE decoding : how it works

After announcing the release of FSE and its results, it's time to have a look at its internals.

We'll spend some time first on the decoding loop, since it's both the easiest one to understand and the most important one. Encoding is much more convoluted, and that's where ANS theory comes into play (For some more detailed explanation on how ANS works and its related properties, I invite you to read Jarek's latest paper). But for the time being, we don't need that yet. The decoding loop is generic enough to handle diverse kind of entropy algorithms, including Huffman and Arithmetic, as we'll see later on.

The decoding code can be freely consulted on GitHub. The "hot loop" is pretty short, and can be pseudo-coded like this :


            // One persistent variable across all loops : int state
            outputSymbol (decodeTable[state].symbol);
            int nbBits = decodeTable[state].nbBits;
            int rest = getBits (bitStream, nbBits);
            update (bitStream, nbBits);
            state = decodeTable[state].newState + rest;

There is a single value to track, called state.
state value can range from 0 to a maximum total_number_of_states-1, which is conveniently a power of 2. For example, in FSE default implementation, total_number_of_states is 4096, representing 12 bits.

state initial value is read directly from the bitStream. Then its evolution is a bit more complex.
In contrast with Huffman, which merely shifts current bits and replace them with new ones from the bitStream, FSE completely rewrite state value after each step.

The first thing to keep in mind is that a state value is enough to immediately decode a symbol, and provide the next state value. Both information are directly stored into the decoding table.
If we had the chance to handle a large enough state number, with its associated decoding table, the decoding loop would be trivial :


            outputSymbol (decodeTable[state].symbol);
            state = decodeTable[state].newState;

Obviously, we are not that lucky : a large number representing a complete file or segment would result in a way too large decoding table to be practical. So we have to keep this size reasonable.
In practice, we keep as many bits as necessary to represent probabilities with enough accuracy, for example 12 bits proposed by default in FSE for a 256 alphabet.

Now we have to modify the decoding loop, by introducing a few new bits between each step, keeping the value of state within accessible range. The number of bits to load is also embedded directly into the decoding table.


            outputSymbol (decodeTable[state].symbol);
            int nbBits = decodeTable[state].nbBits;
            state = decodeTable[state].newStateBaseline + getSomeBits(bitStream, nbBits);

A legitimate question at this stage is : why on earth this construction does compress anything ?
Well, we are getting close to it. Consider that the number of bits to read after each symbol is decoded is the cost of compression.

This number of bits is in no way random : it entirely depends on the (normalized) frequency of the decoded character. And here is why.

Let's suppose our range of possible values is 4096 (12 bits). And let's supposed that the normalized frequency of our decoded symbol is 4. That means it should be present about 4 times every 4096 symbols, which is equivalent to 1 times every 1024 symbols, which is equivalent to 10 bits.
That conclusion would have been the same for Huffman : after reading the character, we need 10 new bits to read the next one.
This example is simple because the symbol frequency is a power of 2.

So now, let's imagine that its frequency is 5. That means it's present 5 times in the table of 4096 cells.
According to Shannon theory, we are supposed to use -log2(5/4096) = 9.68 bits to represent this symbol. How can we achieve this ?

Well, we are going to divide the range of accessible values into 5 sub-ranges. Since we are loading full bits, we need each sub-range to be a power of 2. And since the total remain 4096, we need multiple different sub-range sizes.

A simple way to achieve this is to use a combination of 9 bits and 10 bits ranges (sometimes called phase-in codes). For example :

Now, we have 5 sub-ranges, and collectively they cover the full spectrum of possible values.
Each of the 5 symbol occurrence will get assigned one of these ranges, which translates into an offset (0, 512, 1024, etc.), and a number of bits (9 or 10). Hence :
state = decodeTable[state].newStateBaseline + getSomeBits(bitStream, nbBits);
At this stage, we have not yet settled any specific rule regarding these ranges nor their assignment. We could have the 9-bits range at the end, or mix them (for ex. 10-9-10-9-10). Finally the 5 occurrences of symbol could be anywhere between 0 and 4095, and they could be assigned these ranges in no particular order.

The suite of states is always "going down", generating a bit read each time it crosses the lower limit.
Fractional bits from a "high state" are consumed by going "low state".
Fractional bits from a "low state" are consumed by reading a full bit, and then reach a "high state".

What could be the efficiency of such a representation method ?
We are targeting 9.68 bits. Do we get close to it ?

Well, if we take the naive assumption that all values in the state range are equi-probable, then we can calculate the likelihood of getting 9 or 10 bits to read.

Average_bitCost = (2 x 9 x 512 + 3 x 10 x 1024) / 4096 = 9.75 bits

That's not bad, we are getting fairly close to 9.68 bits, but not there yet.

The trick is, these values are not equi-probable. In ANS construction, states at the beginning of the range are slightly more likely to be generated than states at the end of the range, due to the way 'newStateBaseline' are generated. So the lower range [0-511] is more likely to be generated than same size higher range [3584-4095], so it weights more.
Calculating the average is therefore not straighforward. But bear with me, if you don't have the patience to read the mathematical proof from Jarek Duda, the ANS construction converge towards the optimal cost of the Shannon limit (we were already pretty close to it to begin with).

This result, however, depends on the way Symbols are spread into the table.
Which is more complex, and will be detailed in another blog post.

Monday, December 23, 2013

Zhuff v0.9, or a first FSE application

As a quick follow up to last week release of FSE, I wanted to produce a quick demo to show the kind of gain to expect with this new entropy algorithm.

And a simple example shall be Zhuff.

Zhuff, is basically, LZ4 and Huff0 stuffed together. It's very fast in spite of this 2-pass process. Its major limitation so far is a 64KB window size, as an inheritance from LZ4, but that's something we'll talk again in a later post.

For the time being, my only intention is to swap huff0 with FSE, and look at the results.

And the results are positive.
Here are some benchmark numbers, on a Core i5-3340M @2.7GHz, Windows Seven 64 bits, using single-thread, fast algorithm

Filename Compressor Ratio Decoding speed
silesia zhuff v0.8 2.783 275 MB/s
silesia zhuff v0.9 2.805 311 MB/s

enwik8 zhuff v0.8 2.440 200 MB/s
enwik8 zhuff v0.9 2.460 230 MB/s

calgary zhuff v0.8 2.672 220 MB/s
calgary zhuff v0.9 2.693 250 MB/s

(Note : I've not mentioned improvements in compression speed, they are measurable but unrelated to FSE, so out of this report)

So, in a nutshell, we have a moderate (>10%) increase of decoding speed, and a small improvement to compression ratio. Nothing earth-shattering, but still worthy to grab.
The situation would have been even better for FSE if higher probabilities had to be compressed, but Zhuff is a simple fast LZ algorithm, so it doesn't exactly qualify as generating "high probabilities".

The purpose of this demo was just to show that FSE can easily replace and improve results compared to Huffman. As a side-effect, it makes Zhuff an interesting playground to test future FSE improvements.

With this part now settled, the next articles shall describe FSE principles, especially how and why it works.

Monday, December 16, 2013

Finite State Entropy - A new breed of entropy coder

In compression theory, the entropy encoding stage is typically the last stage of a compression algorithm, the one where the gains from the model are realized.

The purpose of the entropy stage is to reduce a set of flags/symbol to their optimal space given their probability. As a simple example, if a flag has 50% to be set, you want to encode it using 1 bit. If a symbol has 25% probability to have value X, you want to encode it using 2 bits, and so on.
The optimal size to encode a probability is proven, and known as the Shannon limit. You can't beat that limit, you can only get close to it.

A solution to this problem has been worked for decades, starting with Claude Shannon own work, which were efficient but not optimal. The optimal solution was ultimately found by one of Shannon's own pupils, David A. Huffman, almost by chance. His version became immensely popular, not least because he could prove, a few years later, that his construction method was optimal : there was no way to build a better distribution.

Or so it was thought.
There was still a problem with Huffman encoding (and all previous ones) : an hidden assumption is that a symbol must be encoded using an integer number of bits. To say it simply, you can't go lower than 1 bit.
It seems reasonable, but that's not even close to Shannon's limit. An event which has 90% probability to happen for example should be encoded using 0.15 bits. You can't do that using Huffman prefix codes.

A solution to this problem was found almost 30 years later, by Jorma Rissanen, under the name of Arithmetic coder. Explaining how it works is outside of the scope of this blog post, since it's complex and would require a few chapters; I invite you to read the Wikipedia page if you want to learn more about it. For the purpose of this presentation, it's enough to say that Arithmetic encoding, and its little brother Range encoding, solved the fractional bit issue of Huffman, and with only some minimal losses to complain about due to rounding, get closer to Shannon limit. So close in fact that entropy encoding is, since then, considered a "solved problem".

Which is terrible because it gives the feeling that nothing better can be invented.

Well, there is more to this story. Of course, there is still a little problem with arithmetic encoders : they require arithmetic operations, such as multiplications, and divisions, and strictly defined rounding errors.

This is serious requirement for CPU, especially in the 80's. Moreover, some lawyer-happy companies such as IBM grabbed this opportunity to flood the field with multiple dubious patents on minor implementation details, making clear that anyone trying to use the method would face expensive litigation. Considering this environment, the method was barely used for the next few decades, Huffman remaining the algorithm of choice for the entropy stage.

Even today, with most of the patent issues cleared, modern CPU will still take a hit due to the divisions. Optimized versions can sometimes get away with the division during the encoding stage, but not the decoding stage (with the exception of the Binary arithmetic coding, which is however limited to 0/1 symbols).
As a consequence, arithmetic encoders are quite slower than Huffman ones. For low-end or even "retro" CPU, it's simply out of range.

It's been a long time objective of mine to bring arithmetic-level compression performance to vintage (or retro) CPU. Consider it a challenge. I've tested several variants, for example a mix of Huffman and Binary Arithmetic, which was free of divisions, but alas still needed multiplications, and required more registers to operate, which was overkill for weak CPU.

So I've been reading with a keen eye the ANS theory, from Jarek Duda, which I felt was heading into the same direction. If you are able to fully understand his paper, you are better than me, because quite frankly, most of the wording used in his document is way out of my reach. (Note : Jarek pointed to an update version of his paper, which should be easier to understand). Fortunately, it nonetheless resonated, because I was working on something very similar, and Jarek's document provided the last elements required to make it work.

And here is the result today, the Finite State Entropy coder, which is proposed in a BSD-license package at Github.

In a nutshell, this coder provides the same level of performance as Arithmetic coder, but only requires additions, masks, and shifts.
The speed of this implementation is fairly good, and even on modern high-end CPU, it can prove a valuable replacement to standard Huffman implementations.
Compared to zlib's Huffman entropy coder, it manages to outperform its compression ratio while besting it on speed, especially decoding speed.

Benchmark platform : Core i5-3340M (2.7GHz), Window Seven 64-bits
Benchmarked file : win98-lz4-run
Algorithm Ratio Compression Decompression
FSE 2.688 290 MS/s 415 MS/s
zlib 2.660 200 MS/s 220 MS/s

Benchmarked file : proba70.bin
Algorithm Ratio Compression Decompression
FSE 6.316 300 MS/s 420 MS/s
zlib 5.575 250 MS/s 270 MS/s

Benchmarked file : proba90.bin
Algorithm Ratio Compression Decompression
FSE 15.21 300 MS/s 420 MS/s
zlib 7.175 250 MS/s 285 MS/s

As could be guessed, the higher the compression ratio, the more efficient FSE becomes compared to Huffman, since Huffman can't break the "1 bit per symbol" limit.
FSE speed is also very stable, under all probabilities.

I'm quite please with the result, especially considering that, since the invention of arithmetic coding in the 70's, little new has been brought to this field.

-----------------------------------------------------

A little bit of History :

Jarek Duda's ANS theory was first published in 2007, and the paper received many revisions since then. Back in 2007, only Matt Mahoney had enough skill and willpower to sift through the complex theory, and morph it into a working code. However, Matt concentrated on the only use case of interest to him, the Binary version, called ABS, limited to 0/1 alphabet. This decision put his implementation in direct competition with the Binary Arithmetic Coder, which is very fast, efficient, and flexible. Basically, a losing ground for ANS. As a consequence, ANS theory looked uncompetitive, and slumbered during the next few years.

FSE work re-instates ANS as a competitive algorithm for multi-symbol alphabet (>2), concentrating its perspective as a viable alternative to block-based Huffman.

Thanks to promising early results from FSE, Jarek concentrated back its attention on multi-symbol alphabet. As we were chatting about perspectives and limitations of ANS, I underlined the requirement of a decoding table as a memory cost, and suggested a solution in the making to limit that issue (which ultimately failed). But Jarek took the challenge, and successfully created a new variant. He then published an updated version of his paper. The new method would be called rANS. He would later invent the terms tANS and rANS to distinguish the different methods.

rANS was later adapted by Fabian Giesen and Charles Bloom, producing some very promising implementations, notably vector-oriented code by Fabian.

But as said, this is not the direction selected for FSE, created before Jarek's paper revision. FSE is a finite state machine, created precisely to avoid any kind of multiplication, with an eye on low-power CPU requirements. It's interesting to note such radically different implementations can emerge from a common starting theory.

For the time being, FSE is still considered beta stuff, so please consider this release for testing purposes or private development environments.

Explaining how and why it works is pretty complex, and will require a few more posts, but bear with me, they will come in this blog.

Hopefully, with Jarek's document and these implementations now published, it will be harder this time for big corporations to confiscate an innovation from the public domain.

-----------------------------------------------------

List of Blog posts explaining FSE algorithm :

http://fastcompression.blogspot.com/2014/01/fse-decoding-how-it-works.html
http://fastcompression.blogspot.com/2014/01/huffman-comparison-with-fse.html
http://fastcompression.blogspot.com/2014/02/a-comparison-of-arithmetic-encoding.html
http://fastcompression.blogspot.com/2014/02/fse-defining-optimal-subranges.html
http://fastcompression.blogspot.com/2014/02/fse-distributing-symbol-values.html
http://fastcompression.blogspot.com/2014/02/fse-decoding-wrap-up.html
http://fastcompression.blogspot.com/2014/02/fse-encoding-how-it-works.html
http://fastcompression.blogspot.com/2014/02/fse-encoding-part-2.html
http://fastcompression.blogspot.com/2014/02/fse-tricks-memory-efficient-subrange.html
http://fastcompression.blogspot.com/2014/04/taking-advantage-of-unequalities-to.html

[Edit] : replaced huff0 by zlib on the comparison table
[Edit] : added entry on rANS variants by Fabian & Charles
[Edit] : added list of blog entries

Saturday, October 19, 2013

Compression stream (part 3)

After taking care of the decompression stream, it's now time to have a look at the compression side.

As stated earlier, a compression stream delivering independent blocks is fairly easy to setup. So the objective of today is to define a proper behavior using chained block methodology, which uses data from previous blocks to better compress the next one. By extension, this strategy also works with static dictionaries.

An ideal compression stream should be able to :
- accept input from any source and any size
- deliver compressed output into destination buffer of any size.
- use as little working memory as possible

The last point is the most interesting one.
A (relatively) easy way to create a streaming interface is to use intermediate buffers, for both input and output. These buffers would easily absorb side effects, such as : providing a ridiculously small output buffer, channeling more input than output, etc.
Obviously, this methodology has downsides too, such as performance slowdown, due to the need to copy data from and into intermediate buffers. But more importantly, it increases memory footprint.

Let's focus on the output buffer for now.

It's relatively easy to skip the intermediate output buffer in the simple case when the remaining space into the destination buffer is "large enough". Then, it's possible to compress directly into the destination buffer, because the compressed block can fit in. It saves a useless memory copy operation, and potentially a memory allocation, if the intermediate output buffer is not yet allocated.

"Large enough" condition is met when one of these conditions is reached :
- Remaining space is larger than a full block size.
- Remaining space is larger than input.

However, if none of these conditions is fulfilled, it becomes harder to decide.

If remaining space into output buffer is smaller than input size (or block size), we could eventually try compression, just in case it would work. After all, if compression ratio is, for example 2:1, an output buffer with barely more than half the size of input could be enough.
The problem is, if it's not enough, then the compression operation will fail, and compression tables would become unusable. This is a major issue. In contrast with decoding, we can't "try and retry" compression, due to the modifications of compression tables.

A potential idea to solve this issue could be to duplicate compression tables before attempting the compression. However, let's remind that the objective is to perform better than "compressing into an intermediate buffer", to avoid some memory copy and allocation. Duplicating the compression tables would create the very same issues, it's doubtful if this technique achieve any saving.

Another idea is to dynamically adapt the amount of input to compress. Basically, since the worst case is that input may be not compressible, compress as much input data as there is remaining space into output buffer. This is possible, because blocks don't need to be full, they just have a "maximum size" (currently, selectable from 64KB to 4MB).

OK, this method works, up to a point.

As stated previously, LZ4 is currently able to produce compressed blocks, which need to be "properly terminated". Small blocks compress worse than larger ones. Fortunately, chained block is an excellent methodology to mitigate this effect.

However, when block sizes start to be become very small (for example, < 100 bytes), some inefficiencies become sensible. For example, the last 5 bytes of any block must be uncompressed. A match cannot start at a distance < 12 bytes from the end of the block. Each block starts with a 4 bytes header, and potentially ends with a 4 bytes checksum. These costs are small, but repeated small blocks will create them more often.

Consequently, there is a "trade-off" to find.

From a compression ratio perspective, an intermediate buffer is more optimal. For example, if the user application provides a destination buffer of 150 bytes, it's likely preferable to compress first into a (larger) temporary buffer, and then deliver the result 150 bytes at a time.

Now, it's possible to imagine some embedded systems with very limited available memory, where allocating another temporary buffer is not possible. In this case, these systems might prefer the inefficiency of the "small blocks" method, rather than no compression at all.

So, maybe, the better way is to provide both methods, letting the programmer select which one works best for his use case.

In the end, trying to manage the complexity of all border cases might be a bit too large for a first delivery. After all, the first delivery is not the last, it's still possible to provide improvements in future releases.

So, a probably faster way to deliver a working streaming interface is to work with intermediate buffers methodology by default, avoiding them in the most obvious use cases. These use cases, by the way, should be the most common ones. By properly documenting them, it increases the odd that the Streaming interface is used in a more optimal way.

Friday, September 27, 2013

Stream Decoding (part 2)

I've been considering a complementary idea since writing the first part of LZ4 streaming interface.

A major point which bothered me was the need to shuffle data around, and the impossibility to decode data directly at final destination. Let's go into detail.

Currently, the LZ4 decoding functions either accept of refuse references outside of current block. Refusing is the default behavior, compatible with independent blocks.

For chained block, however, there is a major restriction. The decoding function will expect to find data from previous block *just before* current block. The good news is that it makes the decoding function fairly simple. The downside is that it pushes some requirements into user territory.

Addressing this requirement is now a major concern. It can be accomplished by decoding into an intermediate buffer, where the last decoded 64KB can be saved, and decompression operation can be performed in a contiguous space after it.

Decoding into an intermediate buffer makes the work done, but also results in the following issues :
1) It's necessary to allocate more memory for stream-decoding. Memory-constrained systems may not afford it. Worse, this cost is multiplied by the number of streams decoded in parallel.
2) It's necessary to copy decoded data, from the intermediate buffer to final destination. This is obviously bad for performance.
3) Keeping the last decoded 64KB just before the new block to decode require either to repeatedly copy 64KB at this place, which can become a serious waste for small blocks (think about small network packets), or require an even bigger intermediate buffer to limit the number of times data is copied.

Trying to fix these issues, I came up with a different design, with much more desirable properties.

The main idea is to work with a rotating memory buffer. It reduces the memory requirement for stream-decoding to just and only 64KB.

Even more importantly, it authorizes to decode directly at final destination, with no requirement regarding size or continuity. This is way more flexible for a user program.

To reach these properties, it's necessary to create a new family of decoding functions, able to use an external dictionary. The main idea is : when a reference is generated "outside of current block", look into another memory space, provided as an input argument.

As a positive side effect, this capability saves a lot of memory for server systems, handling many clients' connections. When using full adaptive streams, the memory requirement is only 64KB per decoding stream.
But even better, when using a static dictionary (which means, each block is compressed with the same prefix data), memory requirement per stream is reduced to zero. All decoding streams will directly read into the same memory block. For systems handling huge number of streams in parallel, it is a major memory allocation saver.

With the main properties of the decoding stream settled, now is the time to talk about the compressing stream.

(To be continued...)

Wednesday, September 11, 2013

Towards a Streaming Interface for LZ4

After settling a Streaming Format for LZ4, it seems about time to define a proper API for it.

LZ4 is quite stable in "block mode" scenarios. They are quite simple. "Take a memory zone here, and compress it there"; with obviously, its reverse operation : "take a compressed memory zone there, and decompress it here". These scenarios and close variants seem today properly supported.

The streaming API objective is to extend usages beyond block mode.
The API shall serve a fairly typical communication scenario between distant machines, in which data may not be easily prepared into a single source buffer (in case of multiple sources), and must be sent at arbitrary (application-defined) moments, with the goal to reduce latency.
It's not necessarily the only scenario the API will serve, but "live communication" is interesting enough; it requires a strong set of properties, difficult to fulfill, and is therefore a likely superset of many other, simpler, use cases.

A streaming interface requires to define a direction. For each direction, there is a data producer, acting as a compressing pipe, and a data consumer, requiring a decompressing pipe. (A full bi-directional communication, therefore, requires 2 directions.)
As a consequence, two roles, and two interfaces, have to be defined.

From a definition perspective, starting from the end can be more instructive, in order to properly see and take into consideration desirable properties. So let's start by the decoding side.

Streaming API for decompression

The LZ4 streaming format describes a stream as a series of blocks.
The fundamental assumption is that each block is complete, to be decoded.
This can be easily checked, thanks to the block prefix, which tells its size. When enough data is provided, the compressed block can be considered "complete", and decompression can be performed.

The main advantage of this method is that it can use existing field-tested decoding functions. However, it also introduces a few limitations.

Unfinished input block :
If the full compressed block is available directly from source buffer, there is no problem.
But if not, there is a need to copy the piece of block into a temporary buffer, in order to complete it later on.

This situation could for example happen in a packet scenario, where each packet might be small (for example about 1500 bytes over Ethernet), and several ones are needed to represent even a small compressed block.

Nonetheless, just the possibility that it "may" happen translates into a requirement to allocate some memory for a temporary input buffer, just in case. Another more complex solution would be to allocate it "just in time", which is, the first time an unfinished block is received.

Fortunately, the streaming format defines an upper bound to this temporary buffer size, since each block has a maximum size. Therefore, this memory requirement is predictable.
A way to limit the amount of temporary memory is to use small maximum block sizes. The decoder could then simply refuse to decode streams with a maximum block size above a selected limit.

Currently, the LZ4 streaming format defines a few maximum block sizes values, from 64KB to 4MB. There may be a need to define more sizes (specifically at the lower end), and, maybe, let users define their own sizes. In this case, the format specification will have to evolve to integrate this requirement.
Size of destination buffer :
The only thing that the decoder knows when it starts decoding a block is its maximum decoded size.
Should the destination buffer be not large enough to contain the largest possible block size, there is a risk that the decoding process will fail.

There are several ways to handle this situation. The easiest one is to impose a restriction on authorized destination buffer size. Simply put, if the destination buffer is not large enough, the decoding process will not even start.

Such a restriction certainly makes the streaming interface easier to code, but can be considered too limiting. For example, maybe the user process wants to decode data directly at its final destination. In this case, the destination buffer is just big enough to handle the exact amount of data expected, not a single byte more.

To face such a circumstance, a temporary output buffer would be allocated. Whenever a decoded block is too large to fit into the destination buffer, the block will be decoded into this temporary output buffer instead; then, the relevant piece of block will be copied into the destination buffer.
Such "memory copy" operation seems obviously sub-optimal, but it does the job done.

This mechanism still lets opened the choice for the priority buffer.
If the destination buffer is not large enough to contain the largest possible block size,
should the block be decoded first into the temporary output buffer, and then the relevant piece get copied into the destination buffer,
or,
should it be decoded first into the destination buffer, just in case it would be large enough, and then backup to the temporary output buffer when it does not work ?

The second choice basically trusts more the user programmer, at the cost of heavier performance penalty if the bet wasn't correct (2 decoding function calls instead of one). On the other hand, if the bet was correct, it saves one memory copy operation.
The first choice is more middle-ground, costing one memory copy operation and a single decoding function call in all circumstances.

A complex proposal could be to "heuristically guess" the best choice. For example, the algorithm would start with choice 2 (trust the programmer, and decode directly into destination buffer), and then revert to choice 1 after a few fails.
Another possibility is to have 2 separate functions, so that the programmer can directly select what he wants.
Chained blocks :
In order to improve compression ratio, sometimes dramatically for small packets, the LZ4 streaming format is able to encode new blocks using the previously decoded 64KB from previous blocks.
This mechanism produces terrific compression improvements, especially for packets which have a lot of headers, fields and/or identifiers in common.

The logical consequence is that the previous 64KB must be known by the decoding process. Since there is no guarantee that prior decoded data still sits at its decoded position (and it is most probably not), it seems there is no other way than to save these 64KB into a temporary buffer.

A fairly logical choice would be to put this 64KB into the "temporary output buffer" mentioned in previous paragraph, thus merging both requirements. It inflates the size of this temporary output buffer by 64KB.

Another consequence is that choice 2 (of previous paragraph) does no longer make sense, therefore only choice 1 seems relevant. Choice 2 can still make sense for in-place decompression of independent blocks. However, for a streaming scenario, chained block is really the advised setup, since it greatly improves compression ratio.

If there is no other possibility than to first decode into a temporary output buffer, then it seems to make sense to provide, as a result, a pointer and a length into this buffer, rather than copying the result into another destination buffer. It will be up to the user process to decide what to do with this data.
Too much input for your output
This situation can happen when input contains several full size blocks.
Since decoding is going to be done in a temporary output buffer, which size is controlled by the decoding stream, it is guaranteed to decode at least one full block.
But beyond that point, there is no such guarantee.

As a consequence, with not enough output buffer left to decompress remaining input data, the streaming interface must deliver what it can, and then indicate that the job is not completed.
One relatively simple way to achieve this is to output a Boolean "more_to_come" signal. When it is set, the user process is informed that more data needs to be decoded, and should therefore call again the decoding function, after disposing of current output (since the next output is likely to overwrite it).

So here we have a reasonably complex set of requirements for the decoding process. And it already has a few perspectives for improvements.

For example, most of requirements regarding temporary buffers come from the fact that decoding function must handle complete blocks.
Should a function able to decode partial blocks exist, it would eliminate the need for a temporary input buffer.
The size of the temporary output buffer, which currently must be 64KB + MaxBlockSize, could receive an upper limit of 64KB + 64KB = 128KB (This obviously would only matter for large values of MaxBlockSize).

Going one step further, a decoding function able to handle data copy from "out of buffer" positions would reduce memory needs to just 64KB, on top of offering some perspectives for in-place decompression. By avoiding the requirement to decompress first into a temporary output buffer, and then copy the relevant result at its final destination, it could improve performance.

On the other hand, a new decoding function able to decode "partial blocks" would require some more complex logic, tests, branching, and state information. As a consequence, all this complexity costs a fair share of performance, losing some speed in the process. It's unclear if the final result would be faster than using intermediate buffer. But at least, it would shave off a few buffers.

There are apparently several optimization steps which could be attempted beyond the first delivery. The main idea is that, such future improvements will, ideally, have no impact on the API itself.

(To be continued...)

Thursday, August 15, 2013

Inter-block compression

One of the new capabilities defined into the LZ4 Streaming format is inter-block compression. While the concept is simple to formulate (the next block shall be compressed using information from previous block(s)), its execution has been non existent so far.

Up to now, the LZ4 compression utility is doing something relatively simple : cut the data to be processed into independent blocks, compress them one by one (or decode them one by one), assemble the result into destination file or stream.
It is simple and it works.

The problem : the smaller the block size is, the worse the compression ratio.
It can be clearly noticed on the following table :

File : Silesia.tar / Independant blocks
Block Size LZ4 LZ4 HC
4 MB 47.97% 36.82%
1 MB 48.03% 36.99%
256 KB 48.27% 37.68%
64 KB (*) 40.41%

(*) Note : the 64KB version of LZ4 is using a dedicated algorithm which improves compression ratio, and is therefore not comparable with previous block sizes. Nonetheless, for the record, the result is 48.34%.

As can be seen, the situation gets worse and worse with each block size reduction. That's because the first 64KB of each block starts with an empty dictionary, worsening compression capabilities. LZ4 HC is, by the way, much more affected than the Fast LZ4, since it achieves better benefit from earlier data (and therefore loses the most) thanks to its better parsing methodology.
One can also note that, starting with 64KB block size, the decrease in compression ratio is very sensible. It can only get worse from this point towards smaller sizes.

One work around solution is to simply not use small block. It works fairly well : lz4c uses the maximum block size (4MB) for file compression as default.
But there are circumstances when this choice is not possible.

For example, let's suppose your application is sending packets in real time to a server. Because the application is supposed to react in real-time with low-latency, you can't wait to amass 4 MB of data before compressing it and sending it to the server. Data must be sent at application-defined moment, no later.

Such a packet of data is typically small, at most a few KB.
This is very small : if compressing just the packet, the compression is probably going to be poor.

Even more importantly, data packets tend to have a lot of common elements between them, such as headers, and identifiers. But since the repetition can only be noticed across multiple packets, it is lost if each packet is processed independently.

So the idea is to compress the next packet using information from previous packets.

It may seem simple, but this directly contradicts some current assumption in the LZ4 algorithm, which states that each compression function starts from a clean state.

Well, no longer. The soon-to-be release r102 of LZ4 features new compression functions to continue the compression process from where it was stopped previously.
To realize this operation, the compression context is now an explicit argument of the new functions. The context must be created manually, using a dedicated function, and then provided at each round to the new compression functions.
(Note : The usual compression functions are still there, and work the same, so there is no modification for existing LZ4 users.)

As a consequence, these new functions should improve compression ratio. And that's what can be measured :

File : Silesia.tar / Inter-block compression
Block Size LZ4 LZ4 HC (HC Independent blocks)
4 MB 47.95% 36.76% (36.82%)
1 MB 47.95% 36.76% (36.99%)
256 KB 47.96% 36.77% (37.68%)
64 KB 47.97% 36.78% (40.41%)

There are a few learnings from this experiment :

1) Compression ratio do improve a little bit. This was expected : with previous data available, the initial 64KB of data of each block can be fully compressed at its potential.
One can notice that, once again, LZ4 HC benefit more from it. That's still for the same reason : better parsing, making better usage of available dictionary.

2) Compression ratio do not degrade with smaller block sizes.
Or, to be fair, it does worsen, but by a tiny amount compared to previous situation.

This property is key : it opens the perspective for very small block sizes (such as network packets, of a few KB), while keeping a good compression ratio, close to optimal conditions.

This was the target property of this exercise. It represents a significant step towards an LZ4 Streaming API.