RealTime Data Compression

Monday, April 30, 2012

Selecting a Checksum algorithm

Following the entry regarding Error detection, it is still necessary to find a correct algorithm to do the job.

In other circumstances, the solution would have been a simple reliance on CRC32. After all, this algorithm is already in use in many applications, including modem transmission (ITU-T V42), Zip, MPEG2, etc. So it looks solid.

And solid it is, but fast it is not.
As stated by Shelwien, a fast hard-wired version of CRC32 exists, within the most recent Intel CPUs. However, not only is it a limited choice in term of target CPU, it is also a different variant (i.e. incompatible) from the normalized one. Therefore, software mode cannot be avoided.

And what's the use of a very fast compression algorithm if the checksum takes much longer to be calculated ?

To be more specific, LZ4 decoder works roughly at 1GB/s on my test machine. So, to be almost negligible, a checksum algorithm would need to be ~10x faster, which would give 10GB/s. This is obviously not possible, keeping in mind that the RAM speed doesn't cope with such situation : a memcpy() operation is limited to 2.5GB/s on the very same machine.

So OK, we can't reach the ideal situation, but what would be the best possible one, i.e. the fastest one ?

This is not an easy question, since making a fast but weak checksum with very poor properties is the usual outcome. For example, sum all your bytes one by one, and here you have your checksum ! Except that its distribution is absolutely horrible, resulting in too many potentially undetected errors.

In order to assess if a checksum algorithm features some good properties, it is necessary to test them, with an appropriate tool.
And we are lucky, such a tool exists.

Austin Appleby created MurmurHash, in a bid to provide the fastest possible hash algorithm. In order to test its own algorithm, he also created SMHasher, which is an invaluable tool to assess the performance of any hash function. As can be guessed, MurmurHash answer perfectly to all criteria, but that doesn't mean that the tool is flawed, it's more a testament that MurmurHash was developed and improved along the lines of SMHasher. That being said, the methodology is sound, and results look reliable.

I immediately used the opportunity to test different checksum algorithm provided within the package. Starting with CRC32 of course. And the results are not excellent.
CRC32 speed is 430MB/s on the test system. That's not bad, but compared to LZ4 decoder, it is more than twice slower, which means the CPU would take more time to check the data rather than actually decoding it.
Furthermore, the distribution properties of CRC32 are not "best in class". Quite far from it in fact. The bit distribution is typically "biased", which means that the probably for a bit position to be 0 or 1 is not a near-perfect ~50%. Worse, bit independence is nonexistent. It fails some tests, such as the "2-bytes keyset", and the "HighBits keyset", which has devastating consequences on bit distribution.
This doesn't make CRC32 unusable. It just shows that this older algorithm is not perfect.
More importantly, as expected, it is not fast enough for our needs.

Let's move on to another candidate, FNV, which is also part of the package.
FNV (Fowler-Noll-Vo) takes a reputation of being an extremely fast hash function. Many variants have appear around the web, with the same core principles, marrying XOR with multiplications in an inner loop.
The speed is indeed better, but at 550 MB/s, it's not that great.
Furthermore, the properties of FNV are no better than CRC32, featuring the same weaknesses, plus failing badly at "Cyclic test", having poor results on "sparse keyset", and very poor on both low and high bits keyset, etc. Well, the results are so bad, i could not advise this function to anyone.

So, let's go straight to the king of the hill. The most recent version of MurmurHash (v3.a) is said to be the fastest hash function available. We are only interested in the 32-bits version, so let's put that one on the grill.
And yes, fast it is. The speed is 2.7 GB/s. This is just huge. It's close enough to my RAM speed limit. Interestingly, the speed is depending on input being aligned on 4-bytes boundaries (32 bits). If not, the speed dips to 2.0 GB/s, which is still good.
More importantly, with this speed, we get all SMHasher properties verified, with no single trace of distribution weakness. That's great. We have a candidate !

************************************************************************************

The story could have ended here. But as usual, i couldn't resist. I would be willing to put some of my own function to the test, thanks to SMHasher.
An old "fast hash" of mine would end up with the same speed as MurmurHash, but much worse properties.

That could have been the end. After all, MurmurHash is so fast, maybe it is merely limited by RAM speed after all ?
So i made another, simpler, attempt purely targeting speed. It resulted in an 8GB/s hash function. Yep. This is much more than my RAM speed, so how could it be ?
Well, in fact, it is much more than memcpy(), but there's nothing magic : in fact, memcpy() not only read input from memory, but also write to memory. This is something the hash function does not need to do. The input is simply "sunk" into registers. So, it can be that fast, and there is definitely some room left for more speed.

I therefore spent a little more time studying what makes MurmurHash such a great hash function.

I believe there are 2 keys parts :

The first one "blends" the fed input bytes into registers, thanks to several simple operations, the most important of them being _rotl().
_rotl() (Rotate left) ensures that the bits that were previously high now becomes low. So it effectively nullify the problem of high bits having a small contribution to the overall hash distribution.

The second one digest the final blended result thanks to a serie of small steps known as "avalanche". It just destroys any kind of bit bias or correlation left that the blend would have left.

With both being combined, we get an almost perfect hash distribution.

So i started from there, re-using the above "fast simple hash" and improving it with the presented ideas. I ended up with a 7.5 GB/s hash function, which is about just Hellish.
It would pass all the tests except for a single one, "Keyset 'Sparse' - 2048-bit keys with up to 2 bits set". That's because the blending phase creates some "twin bits" far apart from each other, at varying distances depending on bit position. This is obviously bad for a crypto-hash, but a 32-bits checksum is anyway not meant for this task.
We simply want to detect errors, more probably transmission and storage failures, or manual glitches. For such cases, since twin bits are very far apart, it seems it would be extremely unlucky to have random errors happening just exactly on twin bits pattern. In fact, the probability seems lower than getting a collision from two completely different inputs on the 32-bits checksum. Which is the limit of this exercise.

Anyway, for the sake of it, i also created a stronger version of the algorithm, which add an operation in the inner loop to make the twin bit pattern vanish. This version reaches 4.2 GB/s, which is still good, but noticeably slower than the fast version. Well, that's the price to pay for stronger properties.

The final result of these investigation is called xxHash, and is now an open-sourced project, hosted on Google Code.

*****************************************************
(Update) : a new xxHash version has been published. It features the same diversity as xxHash_strong, while being faster, at 5.4 GB/s on the same platform (Core 2 Duo, 3 GHz). It also allows providing input in several consecutive calls, as opposed to a single memory block.

Thursday, April 12, 2012

Detecting errors

After thinking a blip second about it, it seems clear to me that an important feature expected from the framing format is a good enough guarantee that the regenerated data is correct.
Surely there are also valid usages of LZ4 which may not need error detection and its associated CPU and bandwidth overhead, hence this feature will remain optional in the final specification, but nonetheless recommended.

This requirement is really very common. And a reasonable method to check for unintentional errors (such as transmission or storage ones) can be provided with a simple checksum. Typically, a good 32-bits checksum will make it relatively improbable (less than 1 in a billionth) that random changes stay unnoticed. This method is typically used by Zip.

One could note that the LZ4 decoder already includes some form of detection for malformed input. It's correct, but it's not enough, because it only checks that provided instructions are valid. The main objective was to forbid any attempt to read or write outside of the memory buffers. Hence, it cannot detect a corrupted literal within the compressed stream for example.

On the other side, one could also require the method to guarantee the input data never was modified, neither randomly nor intentionally (data integrity). Such objective, however, seems out of scope.
To begin with, it require an electronic signature, at least 128-bits such as MD-5. But even MD-5 is now considered too weak for this use, and shall be superseded by stronger 160-bits SHA-1 and such.
One obvious issue is speed. What's the use of a super-fast compression / decompression algorithm if the checksum function is several times slower ?
But the most important issue is that the signature is not enough. The signature's mission is to make it almost impossible to find a modification to the original source file which produces the same electronic signature. But since the electronic signature is provided within the framing format, a malicious attacker would just need to change both the content and the signature.
Therefore, an effective protection shall also require encryption, or a separate signature storage and transmission, which is way beyond the objective of this framing format.
So we'll just consider random data corruption from now on.

The next question is : what has to be checksum-ed ?
Due to the popularity of the Zip format, an obvious answer could be "the original data".
But let's look into it, it's not that obvious.

Take the snappy framing format for example. It checksums the original data of each chunk individually, with each chunk size being <= 32KB. The main advantage is that an incorrect chunk is detected immediately, there is no need to wait for the end of the stream for this result. This is especially important when the stream is very large (for example 1GB), and the consumer process wants to use the decoded data without waiting for the last chunk.
The main drawback, however, is silent truncation. Imagine for example that the last few chunks are missing. There will be no way to detect that.

Another possible answer is to checksum the compressed data instead of the original one.
This is valid. A compressed stream can only be decoded into a single result. Hence, protecting the compressed data from errors is the same as protecting the original data.
And it brings several advantages with it.
First, by checking the checksum against the compressed stream, it warns against data corruption even before attempting to decode.
Second, since compressed data is generally smaller than original one, the checksum function will be faster to execute. Both properties have positive impact on CPU consumption.

Note however that for the first property to be true, each chunk must be checksumed, not just the entire stream, because the decoding starts immediately on reading the first chunk from i/o, both to save time and memory.

OK, so considering these advantages, i'm almost sold to the method of checksuming the compressed chunks. The most important drawback i would still worry about is the silent truncation issue of snappy-frame.

It seems however this can be easily solved. One just needs to ensure that the "continuation" information is provided. And there are several possibilities to do it. For example, provide the size of the original data into the framing format, and calculate the number of (fixed size) chunks. Or embed the continuation information into a field associated with each chunk, and make it part of the checksumed data. There are surely other possibilities. The main idea is that the "end of stream" information must be provided, not just detected on hitting the EOF marker.

One last property that could be missing is using the checksum as a kind of "weak signature". For example, in the zip format, since the checksum is calculated from the original data, it is possible to "check" that the file present into the zip file is likely the expected one.
But i'm not sure if such usage has any real value, not even convenience. Has stated earlier, such "signature" is much too weak to be trusted, so it can merely be used as an "hint", not much else.

As usual comments are welcomed.
There is another point to discuss, which is the checksum algorithm itself.
We'll keep that for another blog note.

Sunday, April 8, 2012

A file container format for LZ4

With increased usage of LZ4 come new use cases, and with them new requirements.

The recent ports of LZ4 to Python and Perl are two excellent examples, demultiplying the possible usages by making the algorithm available to a broader range of applications, including easy-to-deploy scripts.

In this context, a simple question is asked : how can one be sure that a file compressed with, for example, the Python port of LZ4, will be decodable by the C# port of LZ4 ?

Which is indeed a very good question. And the sorry answer is that : there is no such guarantee. At least, not yet.

LZ4 has been developed as an in-memory compression algorithm : it takes a memory buffer as an input, and provides the compressed result into an output memory buffer. Whatever is done before or after that operation is beyond its area of responsibility. The fact that these buffers may be read, or written, to disk is just one of many possibilities.

This is a correct positioning. There are too many possible usages for LZ4, and a lot of them will never bother with on-disk files or network streaming.

Nevertheless, it happens that one natural usage of compression algorithm is file compression, and it applies to LZ4 too. In this case, the compressed result is written to a media, and may be decoded later, possibly by a completely different system. To ensure this operation will be always possible, even when using any other version of LZ4 written in any language, it is necessary to define a file container format.

To be more complete, a container format has been defined into lz4demo, but it was never meant to become a reference. It is too limited for this ambition. Therefore, a new container format specification is necessary.

The ideas in this post will be largely borrowed from this older one, where a container format was discussed and defined for Zhuff. Most of the points are still applicable for LZ4.

We'll start by defining the objectives. What are the responsibilities of a file container format ? Well, mostly, it shall ensure that the original file will be restored entirely and without errors. This can be translated into the following technical missions :

Detect valid containers from invalid input (compulsory)

Detect / control errors (recommended)

Correct errors (optional)

Help the decoder to organize its memory buffers

buffer sizes, and number of buffers (compulsory)
provide user control over buffer sizes (optional)

Organize data into independent chunks for multi-threaded operations (optional)
Allow to quick-jump to a specific data segment within the container (optional)
Be compatible with non-seekable streams, typically pipe mode (compulsory)
Enforce alignment rules (optional)
Allow simplified concatenation of compressed streams (optional)

This is already quite a few missions. But most of them will be possible, as we'll see later.

Another important thing to define is what's not within this format missions :

Save the file name (debatable)
Save file attributes and path
Group several files (or directory) into a single compressed stream
Provide optional user space, typically for comments (debatable)
Split the compressed stream into several output files

In my view, these elements should be part of an archive format.
While the file container format takes care of regenerating original content, the archive will also take care of directory structure. It will be able to save file names, and re-organize them in a better way to help compression. On decoding, it will regenerate the entire tree structure, placing each file at its correct location, with sometimes also original attributes.
An archive format therefore occupy a higher layer in the structure. And it typically depends itself on such container formats.

Last but not least, we'll make the definition of "file" a bit broader, Unix-like, in order to also encompass "pipe'd data". The "file container format" then becomes a "stream container format", with files being one of the possible streams.

All these elements can be provided in a relatively compact and efficient header.
But first, let's discuss the most important features.

[Note] Considering the comments, it seems the word "container" does not properly define the objective of this specification. Another word shall be found. "Framing format" is the current candidate.

Sunday, January 8, 2012

Probability Update

Following the previous post on Binary Arithmetic Coder, we left with a simple-to-ask but difficult-to-answer question : how to update the probability that the next value will be a 0 or a 1 ?

Indeed. Presuming that we can update P0 (probability that next value will be a zero) suppose that we accept a simple idea : the past is a strong indicator for the future.

This may not always be true, but for compression it appears to work relatively well. At this stage, we can outline the strong relationship between compression and prediction : compression ratio will only be as good as the algorithm can predict the future (the next bit in this case).

Trying to predict the future is not limited to compression, and massive amount of investment have been and are currently spent around this applied concept. In this post however, we'll keep our mind focused on compression, considering other fields only if they can help this objective.

A first simple way to evaluate P0 is to count the number of 0 and 1 previously seen.
Then, simply set : P0 = N0 / (N0+N1).
It works, and is a simple way to achieve convergence towards optimal stationary statistics.

But, in fact, no source is really static. They constantly evolve (otherwise, compression would be trivial). The probability should therefore adapt to the source, and consequently, more weight should be given to the most recent bits.

A very simple way to achieve this property is by giving to new bits a fixed share of the updated probability. For example 20%. This is an easy method to gradually and smoothly "reduce" the weight of older records.
And we get :
newP0 = oldP0 x (1-share) + NewBitIsZero x share

This works too, although we start to wonder what should be such "share" ?
20% ? 10% ?

It is very tempting to believe that the optimal adaptation share can be found as a unique value. And indeed, through brute force testings (such as Genetic Algorithms) an optimal value can be found.
[Edit : an excellent example of "optimal adaptation share" is provided by Shelwien here : http://nishi.dreamhosters.com/u/book1bwt_wr.png ]
However, the optimal value will be true only for a given file, number of contexts and precision level. Change any parameter, and the optimal value will change too..

Could such optimal adaptation share be guessed beforehand, through calculation, rather than sheer experimentation ? Well, unfortunately no. It depends too much on the source, although some basic rules of thumb do apply : if the file is small, the share will be higher. If it is large, the share will be lower.

This points towards something retrospectively obvious : at the beginning of the file, when statistics are still scarce, the adaptation must be faster. Then, the more we accumulate statistics, the more accurate they should become, so the adaptation share must become lower.

It does not answer to the question "how much", but hints towards a moving value, becoming lower as we progress into the source file.

In order to get closer to the "how much" answer, I've collected some statistics, that we'll see and comment in a later post...

Thursday, January 5, 2012

Binary Arithmetic Coder

There is a way to code an element using less than one bit of space. Although the idea might sound weird, it is in fact a proven method since a few decades now, under the name of Arithmetic Coding.

As a simple explanation, let's say that arithmetic coding is no longer about using the optimal number of bits to code an element, as does Huffman, since this method can't do better than one bit per element. We will now define "ranges", within which each element will take a share (such as 2%, 75%, etc.). This method is extremely precise, since each element will cost exactly its entropy, such as 2.53 bits (precisely, not an approximation to 2 or 3), or 0.17 bits.

For more details on how it works, i invite you to read the Wikipedia entry, which is fairly good.

The method, however, has some drawbacks. To begin with, defining the precise share of each element incurs a header cost. This situation is pretty visible with Range0, which is a block-based range coder. In order to compensate the (relatively) large size of the header, i had to accept block sizes of 128KB, which is much more than the equivalent Huff0 (16KB), resulting in a less precise entropy adaptation.

A way to remove the header cost is to not transmit any header. Then, entropy adaptation is done dynamically, by recalculating shares after each update.

Unfortunately, this method also has its costs. This time, speed is the price to pay. A lot of speed actually. Adjusting a single element requires renormalizing all others, which can be too much of a burden for a 256-element alphabet, such as a Byte.

This is where the Binary Arithmetic Coder can help.
In contrast with previous coders, this one is limited to a 2-elements alphabet, 1 or 0. It's basically a yes/no switch. With this restriction in mind, the Binary Arithmetic Coder comes with a crucial advantage : a trivial update mechanism.

A single value is enough to define the switch. It represents the share of 0, while 1 will take the rest. If 0 is very likely, the share will be high. If we believe that, on next update, the probability of 0 will be even higher, then it's enough to increase the value. Otherwise, decrease the value. That's it.

With such a trivial update mechanism now in place, it's time to concentrate on the update rule. The question asked is "by how much should i increase/decrease the value" ? Simple question, but the answer to it will need quite some developments. And we'll keep that for another blog entry.

Monday, December 19, 2011

LZ4 into Hadoop-MapReduce

After a very fast evaluation, LZ4 has been recently integrated into the Apache project Hadoop - MapReduce.

This is an important news, since, in my humble opinion, Hadoop is among the most advanced and ambitious projects to date (an opinion which is shared by some). It also serves as an excellent illustration of LZ4 usage, as an in-memory compression algorithm for big server applications.

But first, a few words on Hadoop.
By 2005, Google shook the IT world by presenting Big Table, its home-grown distributed database with eventual consistency, able to store virtually the entire web and queries it. BigTable was built on top of Google FS, a virtual file system covering the entire planet, tens of thousands of computers distributed in hundreds of datarooms all over the world, as if it was a single massive one. This limitless amount of stored data could then be processed in parallel, typically for query preparation, thanks to the MapReduce framework, which allows to process petabytes of data in a small amount of time (if you can afford the number of servers necessary for that).

At this very moment, Google stealed the crown of programmatic champion from Microsoft. It was now clear that they were setting the future. Although most of these technologies were already studied, it was the first time they were executed together and so well, at such a huge scale for commercially available products. This gave Google literally years of advance over the competition, since most of its Web products were based on these capabilities.

Since then, all other "big names" of IT, (namely Yahoo, Facebook, Amazon, IBM, Microsoft, Apple, etc.) have been willing to duplicate this architecture. The result of all these efforts finally converged into the open-source project Hadoop.
Hadoop now has most of the capabilities presented in 2005 by Google, including a Distributed File storage system (HDFS), a distributed Database (HBase), and the same distributed-computing framework as Google, MapReduce.

So, where does that leave any place for LZ4 ?
Well, in such architecture, compression is used as a performance enabler.

As can be guessed, massive amounts of data are traveling between the different nodes. Moreover, each node is also processing a fair amount of data, more or less permanently.
In such situations, compression offers some advantages : less data to transfer means less time and cost to send/receive it. It also means that more data can be stored into RAM memory, or that more data can remain into the CPU cache. All this translates into better system speed.

Or does it ? For this affirmation to be true, it is mandatory for the compression algorithm to be "unobtrusive", which means it should consume very little CPU cycles. Otherwise, the cost of compression can void or reverse the speed advantage. This basically means only fast compressors do qualify for the job.

In the beginning, LZO was such a champion. It offered great speed, however with some important usage limitations, due to its GPL license.
Then early 2011, Google released Snappy, ex-zippy, the very same algorithm used by Google in its own BigTable implementation. It quickly became a great alternative, thanks to its better licensing policy and better performance.

LZ4 was also released this year, just after Snappy. Google's notoriety means there was basically little attention left for competing algorithms. But it also raised awareness that Fast compression algorithms have a role in IT architecture. LZ4 gradually improved overtime, to the point of providing now better performance than Google's creation. One Hadoop's contributors, Binglin Chang, made the effort to implement LZ4 as a JNI patch, and compare it directly to Snappy. LZ4 performance was found better than Snappy, even when using Snappy's own set of calibration tests.
In a relatively quick decision process, the LZ4 patch was then integrated into the main Hadoop - MapReduce source trunk.

/* Update : Google's Snappy developer kindly reminds that benchmark figures depend on the tested configuration, and states that on their own test platform, Snappy keeps an edge with regards to compression speed. See comment : http://fastcompression.blogspot.com/2011/12/lz4-into-hadoop-mapreduce.html?showComment=1326071743955#c7649536680897239608 */

The advantage of using fast compression algorithms, as does Hadoop, can be replicated into many server-side applications, for example DataBases. Recently, column-oriented databases have been dragging attention, since they make heavy usage of compression to grab some impressive performance advantage. The idea remains the same : compress data to keep more of it into RAM and into CPU cache : it directly translates into better performance.

Monday, December 12, 2011

Advanced Parsing Strategies

Getting better compression ratio with LZ algorithms requires more than looking for long matches. It also requires some "parsing".

To explain it simply, one would assume that once he finds a "best possible match" at current position, he just has to encode it and move forward.

But in fact, there are better possibilities. The naive encoding strategy is called "greedy match". A proven better proposition is to check if, at next position, a better (i.e. longer) match exists. If it does, then the current match is dropped in favor of the next one. This is called "lazy match", and typically improves the compression ratio by 1-2%, which is not bad considering that the compression format remains unaffected.

Lazy match is simple enough to understand, and clearly illustrates the trade-off at stake : compute more searches, in order to find better (longer) matches and improve compression ratio.

On the other extreme side, there is a strategy called "Optimal parsing", which consists in computing a serie of matches at each position, and then select the best ones by a reverse traversal computation (like a "shortest path" algorithm). Optimal parsing requires some huge computation, especially at search step, since each and every position must be fully verified with a complete list of match candidates in ascending order.

Not willing to pay too much on the CPU side, i've tried to find a middle way, which would mostly emulate the "optimal parsing" result but with a CPU cost closer to Lazy match.

The main ideas is as follows :
Suppose i'm getting a match of length ML at position P.
The only reason i would not want to encode it immediately is if it exists a better match between P and P+ML.
We take the assumption that, at position P, we have found the best possible match. Then, the smallest possible "better match" must have a length ML+1 and start and position P+1.
Consequently, the "smallest and closest better match" must end at position (P+1) + (ML+1) = P+ML+2.

Therefore, should such a better match exist, i'll find it by searching for the sequence stored at position P+ML+2-MINMATCH. Moreover, should any other better match exist, it will necessary include the sequence stored at position P+ML+2-MINMATCH. So i will get all of them, and consequently the best of them, by simply searching for this sequence.

Compared to Optimal Parsing, which requires a full search at each position, this is a huge speed improvement.

However, the search is more complex than it sounds. To begin with, the "longest match" may start at any position between P and P+ML+2-MINMATCH. To find it, it will be necessary to check the match length in both forward and backward direction.
It means it's not possible to use advanced match finders, such as BST or MMC, which tend to eliminate unpromising candidates. It's not suitable here, since the total match length may be achieved thanks to backward direction. Therefore, each and every stream which contains the searched sequence must be checked.

In these circumstances, search algorithm is basically limited to Hash Chain, which is only suitable for "short range" searches. So, it's a first limitation.

If no better match is found, i can safely write the current best match, and then start again the search.

If a better match is found, it means we have overlapping matches, with the second one being better than the first. It would be tempting to simply shorten the first match, in order to "make room" for the longer one, and then start again the process, but alas it's more complex than that. This decision will depend on P3.

Let's analysed the situation more thoroughly.
We have 3 overlapping matches, at positions P1, P2, & P3, with :
ML3 > ML2 > ML1
P2 <= E1 (=P1+ML1)
P3 <= E2 (=P2+ML2)

If P3<E1, then P2 is discarded, P3 becomes P2, and we search a new P3.
If P3=E1, then P1 can be encoded, P2 is discarded, P3 becomes P1, and we continue the search.

Obviously, the situation is less trivial in the more common situation when P3 > E1.

If P3 < E1 + VerySmallLength, then it's very probable that P2 will not pay for itself. A match costs an offset and a length, this can cost more than a few literals. Dealing with this case is almost equivalent to P3=E1, so we'll consider it closed.

If P3-P2 > ML1, then we know that the 2nd match will be longer than the 1st one. So we can proceed with shortening P1 length to ML1=P2-P1. Then we can encode P1, P2 becomes P1, P3 becomes P2, and we continue the search.

So now we are left with the difficult case. We have 3 consecutive matches, the limit between P3 and P2 is not yet settled (we would need P4 to decide), and depending on this decision, it impacts the limit between P1 and P2.

Let's give an example :

P1 = 0
ML1 = 8
E1 = P1 + ML1 = 8

P2 = 5
ML2 = 10
E2 = P2 + ML2 = 15

P3 = 12
ML3 = 12

P3-E1 = 4 : This is enough to "pay for P2"

But should i do :
ML1 = 8 ; ML2 = 4
or
ML1 = 5; ML2 = 7 ?

They look equivalent, but they are not, since the entropy difference between 4 and 5 is likely to be higher than the entropy difference between 7 and 8.
Therefore, if we keep the objective of a total length of 12, the first choice is the best one.

However, it's not yet sure that P3 will effectively starts at 12. It all depends on P4, which is not yet known.
For example, if P4=E2=15, then obviously P3 disappears, and the better choice is ML1 = 5, ML2 = 10.
But, if no P4 is found, then P3=12, and the better choice is ML1 = 8, ML2 = 4.

So, if we want to decide now, we have to accept that the best choice is not yet known, and we have to settle with something sub-optimal. As we are already working on an approximation, this should not be considered a big issue, as long as the approximation is "good enough".
For programming simplification, we'll assume that the second solution is the better one, shortening ML1 as is necessary to make room for P2.
It means we can now encode P1. P2 becomes P1, P3 becomes P2, and we continue the process.

So here we have an algorithm, which tries to create a "nearly optimal" match distribution, taking local decisions to favor longer matches.
As a great property, it searches only N+1 list of matches per serie of N consecutive matches, which is a great time saver compared to Optimal Parsing.

Time for some evaluation. How well does it fare ?

Well, compared to the deep lazy match strategy implemented into Zhuff (-c2 mode), surprisingly little. The compression ratio barely improved by about 1%. At least, it's a gain...

There are several reasons to explain this disappointing result.

First, the deep lazy match strategy of Zhuff v0.8 -c2 is already an advanced parsing algorithm. It heavily favors larger matches, and is complemented by a small memory to re-fill skipped space. So it's pretty good actually, which limits remaining potential savings.

But the Deep Lazy match strategy of Zhuff 0.8 requires on average 3xN forwards searches per N match. This is in contrast with the strategy in this post, which only requires N+1 searches, albeit more complex ones (forwards & backward). As a consequence, the new strategy is 50% faster than the previous deep lazy match one. Now it looks better.

Second, the proposed strategy is only based on maximizing match length, deliberately forgetting any other cost contributor, such as, for example, offset length.
In a variable offset representation, small offsets are very likely to cost less, if not much less than larger ones. It's still possible to use this property "tactically", by using the smallest known offset able to provide the selected match length, but that's just an opportunistic gain, not an advantage that the selection algorithm takes into consideration.
Moreover, the full list of candidates can only be built for the first position P1. For the next ones (P2, P3), only a reduced list is likely to be found, since we learn P2 and P3 positions during the backward search. As a consequence, quite frequently, only a single offset is known (the longest one). This further limits the "opportunistic" offset selection gain.

For Zhuff, the offset cost vary from 9 to 22 bits. More realistically, it hovers between 12 and 21 bits. That's more than one byte of difference. Knowing that a literal is also compressed, with an average cost of about 6 bits per literal (depending on source file), then a long offset costs quite more than a small offset + 1 literal.

In fact, to select the better choice, it would be necessary to properly estimate, at each step, the cost of each contributor, offset, match length, literals and flags.

Well, that's the whole point of Optimal Parsing...

Sunday, December 4, 2011

Fast sequence comparison

If there is one thing that most Compression algorithms must do well and fast, it is comparing byte sequences. Finding a match depends on it, and finding the best match requires numerous comparisons to be realized.

A straightforward way to achieve this function using C code would look like this :

while (*(bytePos+len) == *(ComparePos+len)) len++;

It works well, and is trivial to understand. However, we are giving away a critical property of modern CPU : they can process whole WORD in a single step. For 32 bits CPU, it means that 4 Bytes could be compared in a single instruction. This is expectedly faster than loading and comparing 4 times each single byte.

An optimised comparison function then becomes like this ;

while (*(int32*)(bytePos+len) == *(int32*)(ComparePos+len)) len+=4;

if (*(int16*)(bytePos+len) == *(int16*)(ComparePos+len)) len+=2;

if (*(bytePos+len) == *(ComparePos+len)) len++;

While more complex, this version will provide better performance, especially for longer matches. It has however two drawbacks.

The first problem comes from the very need to compare int32 WORD values. Most modern CPU will have no problem with that, but some, such as ARM, will require these values to be aligned. This means that the WORD value must start at a boundary which is a multiple of 4. Well, since compression algorithms have to compare sequences which start at arbitrary position, this condition cannot be accommodated. ARM will have to live with the simpler comparison loop.

The second problem comes from the trailing comparisons. On leaving the main loop, since we know that we have less than 4 identical bytes, we still need to check if they are 0, 1, 2 or 3. This can be optimally achieved by using 2 comparisons.

Not bad, but still, it means there is a minimum of 3 comparisons to exit this function, and comparisons aren't good for CPU pipelines, especially unpredictable ones. The CPU will try to "guess" the result of these comparisons, in order to keep its pipeline busy by speculatively executing the next instructions. But if it fails, it will have to stall and flush the whole pipeline, resulting in significant performance penalty.

For this very reason, it can be preferable to use mathematical formulas to get the same result, even when they are more complex, since they avoid branching, ensuring a predictable CPU throughput.

In our situation, we want to get rid of the trailing comparisons and end up with a mathematical formula which gives us the number of continuous identical bytes, starting from the first one. We'll consider that we live in a little endian world in the next part of this post, then the first byte becomes the lowest one.

We know that at least one bit is different, since otherwise we would still be in the main loop.
To find this bit, we will use a simple XOR operation between the compared sequences :

difference = *(int32*)bytePos ^*(int32*)comparePos;

If (difference==0), then both sequences are identical.
Since they are not, one bit 1 at least exists. Finding how many bytes are identical between the 2 sequences is now equivalent to finding the rank of the smallest bit 1.

A naive strategy to find lowest bit 1 rank would be to test bits recursively, such as :

while ((difference&1)==0) { difference>>=1; rank++; }

But obviously, this is not going to be our strategy, since we end up with even more comparisons than we started with.

Enters Nicolaas De Bruijn.

The De Bruijn sequence will help us to transform this problem into a mathematical formula. It states that, given an alphabet A with k elements, there is at least one cyclic sequence C within which any possible sub-sequence of size n using alphabet A exists exactly once. It even provides a methodology to build one.
Such a cyclic sequence can become terribly large. We are lucky that for computers, A is just a 2 elements alphabet, bits1 & 0. But we still need to manage n.

We'll do so by keeping only the lowest bit 1 from the xor'ed difference vector. It can be achieved thanks to a simple trick :

LowestBitOne = difference & -difference;

It works because we are in a two's complement world. To summarize, given a value (i), its negative value (-i) can be obtained by inverting all bits, and then adding 1. As a consequence, all bits will be set to zero (since 1 & 0 = 0) except the last (lowest) bit 1, due to the +1.

Now, only the lowest bit 1 remains, which means we have only 32 possible values to consider, and they are all 2^N.

Thanks to the De Bruijn theorem, we now can map these 32 values into a small table of size 32.
We will create a DeBruijn bit sequence which maps all values from 0 to 31 into it (n=5). Since multiplying by 2^N is equivalent to left-shifting by N, the analogy with DeBruijn becomes obvious : we want a bit sequence which, when shifted left bit by bit, produces all possible values between 0 and 31 exactly once.
This image provides a construction method to build such a bit sequence, based on Hamiltonian path. It's a bit complex, so here is one such solution for 5 bits :

00000111011111001011010100110001

In theory, the sequence is supposed to be cyclic, but since we will only shift left, we fill the higher bits with 0 in order to "simulate" cyclic behavior.
It might not look obvious that all values from 0 to 31 are present in this chain, so you can also try it for yourself to verify it.

Knowing the serie generated by shifting the above De Bruijn sequence is now equivalent to knowing the shift which was used. So it provides the result we are looking for with a simple table lookup :

DeBruijnIndex = (0x077CB531 * LowestBitOne) >> 27;
Result = DeBruijnTable[DeBruijnIndex];

And here we are, we have changed 2 comparisons into 3 mathematical operations and one table lookup. Is that a gain ?

Well, according to LZ4 latest benchmark, it seems it is. Gains vary from negligible to measurable, but always positive, so on average it seems a step into the right direction.
This trick is now integrated into LZ4 r41 and later versions.