RealTime Data Compression

Thursday, May 26, 2011

LZ4 explained

At popular request, this post tries to explain the LZ4 inner workings, in order to allow any programmer to develop its own version, potentially using another language than the one provided on Google Code (which is C).

The most important design principle behind LZ4 has been simplicity. It allows for an easy code, and fast execution.

Let's start with the compressed data format.

The compressed block is composed of sequences.
Each sequence starts with a token.
The token is a one byte value, separated into two 4-bits fields (which therefore range from 0 to 15).
The first field uses the 4 high-bits of the token, and indicates the length of literals. If it is 0, then there is no literal. If it is 15, then we need to add some more bytes to indicate the full length. Each additional byte then represent a value of 0 to 255, which is added to the previous value to produce a total length. When the byte value is 255, another byte is output.
There can be any number of bytes following the token. There is no "size limit". As a sidenote, here is the reason why a not-compressible input data block can be expanded by up to 0.4%.

Following the token and optional literal length bytes, are the literals themselves. Literals are uncompressed bytes, to be copied as-is.
They are exactly as numerous as previously decoded into length of literals. It's possible that there are zero literal.

Following the literals is the offset. This is a 2 bytes value, between 0 and 65535. It represents the position of the match to be copied from. Note that 0 is an invalid value, never used. 1 means "current position - 1 byte". 65536 cannot be coded, so the maximum offset value is really 65535. The value is stored using "little endian" format.

Then we need to extract the matchlength. For this, we use the second token field, a 4-bits value, from 0 to 15. There is an baselength to apply, which is the minimum length of a match, called minmatch. This minimum is 4. As a consequence, a value of 0 means a match length of 4 bytes, and a value of 15 means a match length of 19+ bytes.
Similar to literal length, on reaching the highest possible value (15), we output additional bytes, one at a time, with values ranging from 0 to 255. They are added to total to provide the final matchlength. A 255 value means there is another byte to read and add. There is no limit to the number of optional bytes that can be output this way (This points towards a maximum achievable compression ratio of ~250).

With the offset and the matchlength, the decoder can now proceed to copy the repetitive data from the already decoded buffer. Note that it is necessary to pay attention to overlapped copy, when matchlength > offset (typically when there are numerous consecutive zeroes).

By decoding the matchlength, we reach the end of the sequence, and start another one.

Graphically, the sequence looks like this :

Click for larger display

Note that the last sequence stops right after literals field.

There are specific parsing rules to respect to be compatible with the reference decoder :
1) The last 5 bytes are always literals
2) The last match cannot start within the last 12 bytes
Consequently, a file with less then 13 bytes can only be represented as literals
These rules are in place to benefit speed and ensure buffer limits are never crossed.

Regarding the way LZ4 searches and finds matches, note that there is no restriction on the method used. It could be a full search, using advanced structures such as MMC, BST or standard hash chains, a fast scan, a 2D hash table, well whatever. Advanced parsing can also be achieved while respecting full format compatibility (typically achieved by LZ4-HC).

The "fast" version of LZ4 hosted on Google Code uses a fast scan strategy, which is a single-cell wide hash table. Each position in the input data block gets "hashed", using the first 4 bytes (minmatch). Then the position is stored at the hashed position.
The size of the hash table can be modified while respecting full format compatibility. For restricted memory systems, this is an important feature, since the hash size can be reduced to 12 bits, or even 10 bits (1024 positions, needing only 4K). Obviously, the smaller the table, the more collisions (false positive) we get, reducing compression effectiveness. But it nonetheless still works, and remain fully compatible with more complex and memory-hungry versions. The decoder do not care of the method used to find matches, and requires no additional memory.

Note : the format above describes the content of an LZ4 compressed block. It is the raw compression format, with no additional feature, and is intended to be integrated into a program, which will wrap around its own custom enveloppe information.
If you are looking for a portable and interoperable format, which can be understood by other LZ4-compatible programs, you'll have to look at the LZ4 Framing format. In a nutshell, the framing format allows the compression of large files or data stream of arbitrary size, and will organize data into a flow of smaller compressed blocks with (optionnally) verified checksum.

Saturday, May 21, 2011

ZhuffMax parsing improvements

As a small attempt to improve parsing, i've made the lazy matcher of ZhuffMax a bit deeper. It resulted in some small gains, but also new problems.

With a deeper sequence to look for potentially better matches, we indeed increase our chances to find them.
However, in some circumstances, we end up with a serie of matches of increasing lengthes. At some point, the amount of bytes skipped by this recursive search may become so large that we could have used the first match to begin with.

In order to counterbalance this side-effect, we have to keep track of an history of matches, and try to re-use them if we have skipped too many bytes in the search process. It's not complex, and just requires a bit of local memory.

With both mechanisms active, we end up with a small but measurable compression gain. Nothing dramatic, but still, since ZhuffMax is about learning advanced parsing techniques, it's the right place to implement this improvement.
Deeper searches means slower speed though, so the gain comes at some cost.

Benchmark platform : Core 2 Duo E8400 (3GHz), Window Seven 32-bits

	version	threads	Compression Ratio	Speed	Decoding
ZhuffMax	0.2	-t1	2.825	14.1 MB/s	325 MB/s
ZhuffMax	0.2	-t2	2.825	28.0 MB/s	640 MB/s

You can get the new version of ZhuffMax at Zhuff's Homepage.

Saturday, May 14, 2011

Long Range Failure

The next natural step in evaluating MMC would be to test it with large dictionary sizes. Although this can be somewhat "emulated" by simply extending the dictionary size within LZ4HC, it only provides some evaluation on speed, but not on compression gains, since LZ4 output is fixed to 64KB window.

I therefore created a derivative of Zhuff, using an "infinite length" format instead of the fixed 64KB offset one. The change proved relatively painless, keeping almost the same output size with 64KB offset restriction. I therefore proceeded in increasing the dictionary size.

This quickly resulted in a worse compression ratio.
I though my implementation was wrong, but in fact this can be fairly well explained : longer ranges mean larger offsets, which compress using more bits. As a consequence, if i keep the same Minimum Match Length unmodified, it gets compressed less since it requires more bits to describe the larger offset.

The work around to this problem might seem easy : never use large offset for small match length.
Yes, but what is the optimal threshold value ?
That's where problems begin.

It was quick to discover that optimal threshold values were different from one file to another. In a specific example, using enwik8 as input file, the optimal minimum match length ended up being 5 characters, even for very small offset. Most files don't go to such extreme, but nonetheless, the problem is identified : the optimal minimum number of characters at a given offset is different, from one file to another.

A cheap way to achieve "some kind of result" would be to heuristically settle on some kind of "median" value, which is good enough for most files. This would provide some fair improvement over no guard at all.

Fine. But I'm somewhat disappointed with this. I wish i could find something more dynamic, automatically leaning towards "more optimal" threshold for each file.
Indeed, the method to automatically detect if a match found is worthwhile to code or not do exist. It's called "optimal parsing", and its lesser cousin "flexible parsing".

"Optimal parsing" is a very costly operation, in which all positions are searched, their result stored, and a "least distance path" algorithm applied. Flexible parsing do the same, but to a lesser degree, using less distance storage, eventually skipping some positions too.
This is complex enough as is, but there is more to it : in order to calculate distance, we must apply a "cost" to each branch in the graph, which is only possible if we are able to encode the branch cost during evaluation. As long as we keep the coding "forward", this is almost okay for adaptive entropy coders. However, this is a no go for my "block based" implementations, which code the offset "after" LZ parsing.

So that's where i stand today. It seems useless to proceed with larger dictionary sizes without at least some kind of advanced parsing. The naive "greedy parsing" no longer brings automatic benefits with expanding offset costs.

Wednesday, April 27, 2011

ZhuffMax

This program looks more and more like a Pepsi brand each passing day...

In an earlier note, i presented the basics of advanced parsing strategies, a method which allows to grab a few more percentage points of compression ratio in exchange of massive number of CPU cycles. Yes, let's face it, from a pure "compression/speed" perspective, the trade-off is debatable.

However, advanced parsing comes with an excellent property : it keeps the data format intact, meaning it can be decoded normally, by any compression program using the same format. Decoding speed is barely affected, and if it is, it's likely to be positively. In situations where compression time does not matter, but decompression speed does, this trade-off is acceptable.

In a bid to provide a living example of this strategy, i created ZhuffMax, a simple extension to ZhuffHC using the first form of parsing, lazy matching. For this example, i used a recursive level 1 evaluation, meaning that at each byte, i'm also looking for the next byte, select this new one if it is better, and start again the process.
There is definitely a positive impact on compression ratio. At a couple of % points, it's typical of what can be achieved with lazy strategy.

	version	threads	Compression Ratio	Speed	Decoding

Zhuff	0.7	-t2	2.584	285 MB/s	550 MB/s
ZhuffHC	0.1	-t2	2.781	55.6 MB/s	618 MB/s
ZhuffMax	0.1	-t2	2.822	35.0 MB/s	640 MB/s

The impact on compression speed, on the other hand, is pretty massive : 60% slower.

Note the positive impact on decompression speed, since we are once again promoting larger matches.

Such trade-off is unlikely to be interesting in a symmetric scenario, but there are situations where asymmetry is the rule. You can imagine for example a communication between a large, powerful server and a small embedded sensor : the superior compression rate will translate into less data exchanged, which means less battery used for the sensor. Another typical situation is the "distribution" process, where the file is compressed once, and will be downloaded and decoded many times afterwards. In such situations, the extra % of compression ratio is welcomed, even if it means more compression power.

ZhuffMax does not yet completely deserve its name, since there are stronger (and slower) parsing strategies available (namely Optimal parsing). But if i do find time to implement new parsing concepts, i will implement them into ZhuffMax. No miracle should be expected : even optimal can bring only a few more % compression, so we are already close to maximum.

You can grab ZhuffMax at Zhuff's homepage.

As a sidenote, maybe i should consider merging all these versions of Zhuff together. After all, they all decode the same format...

Sunday, April 24, 2011

LZ4 source code now Opened

The "great" thing with all those statistics tools provided by Google is that you can know exactly when, where and why people got to consult your website. Call me "big brother"...
In this particular example, the most frequent Google request leading to this website has been "LZ4 source code" for quite a few weeks. So at least it shows a genuine interest for it.

Well, if you have been typing these few words lately, the wait is over : LZ4 is now an Open Source project, hosted on Google Code.

The version proposed is the "Ultra Fast" one, ready for multi-threading purposes. It has been ex-purged of all windows-specific calls, to increase the capability to build LZ4 for any other system, using a standard C compiler.
An example "main.c" code is also provided, featuring simple file i/o interface. Those willing to start from a ready-to-use Visual Studio project will find one in the "download" category.

Well, hope this helps. Now it's a matter of heating your compilers and test suites...

[Edit] License has been changed to permissive BSD-style

Thursday, April 21, 2011

Zhuff + MMC = Zhuff HC

What can you get marrying MMC with Zhuff ? Well, a stronger version, called ZhuffHC.

MMC is an improved implementation of a full match search function. It always provides the best possible match length in a given search window. In the case of Zhuff, this window is 64K.
MMC is much faster than a simple Hash Chain algorithm, and requires no "early end" trick to keep pace on badly crafted files. The result it provides is guaranteed to be the best possible. So it is quite the opposite from a fast scan strategy.

Integrating MMC with Zhuff has been refreshingly simple. It just required to remove the fast scan matchfinder, and replace it with a single call to MMC matchfinder. In the process, i decided to slightly simplify MMC interface, a newer version of which has been uploaded to Google Code. Differences are very minor, but it should make the code easier to read.

ZhuffHC inherits all the benefits of regular Zhuff, including multi-threading, number of cores detection, drag'n'drop support, benchmark mode, and so on. It improves compression ratio by a fair amount (+8%). That's not as large an improvement as it has been for LZ4HC, which means that both the Huffman encoding stage and improved sampling were already providing a nice compression boost, eating into the potential gains of a full search.
All this comes at a steep price on compression speed though. At just 28 MB/s per core, it is just a fraction of its older brother.

An important point is that both versions, Zhuff and ZhuffHC, are fully compatible : this is in fact exactly the same format, just the search time is different.

As an interesting side effect, decoding speed is improved on compressed files produced by ZhuffHC (decodable by Zhuff too). This is because the more compressed version consists of less but longer matches. With less copy operations to handle, the decoder works 13% faster.

	version	threads	Compression Ratio	Speed	Decoding
Zhuff	0.7	-t1	2.584	147 MB/s	278 MB/s
Zhuff	0.7	-t2	2.584	285 MB/s	550 MB/s
ZhuffHC	0.1	-t1	2.781	28.3 MB/s	312 MB/s
ZhuffHC	0.1	-t2	2.781	55.6 MB/s	618 MB/s

You can grab the new version here.

Sunday, April 10, 2011

A look into zip

If there is one compression format which is well-known and popular, it is Zip. None other has reached such presence.

Zip was invented by Phil Katz, creator of PKWare, in 1989. Yes, that was 22 years ago. The format was made freely available to anyone, an unusual move back in these days. This openness was key to bring Zip to popularity. It provided other companies, such as WinZip, the opportunity to create their own version of Zip and distribute it without the cost, hassle and risk to license. As a consequence, Zip quickly became the "default" compression format for PC systems, and beyond.

Later, an open-source version of the algorithm was created by Jean-Loup Gailly, named gzip for the program, and zlib fo the library. Zlib is probably one of the most deployed library today, being used in a mind-blowing variety of devices and systems.

All these programs have something in common, they share the same compression format, called deflate.

Deflate, the core of Zip files, is now well defined, being described in IETF RFC 1951. Such normalization organism contribute to give great credit to the format, considered reliable enough for anyone to build its program above specifications.

And indeed, the definition is clear enough. Well, at least to whoever has a bit of knowledge of entropy coding, more specifically Huffman one.

The driving idea in starting this analysis was to assess if Zhuff algorithm could be re-used in a way that would produce zip-compatible data streams, with the speed of Zhuff. The quick answer is "not easily", since the differences are non negligible.

The very first property of Zip streams is that they entangle all information (literal, offset, lengths) in a single stream. This stream, however, uses 2 trees, one for offset, and another one for literals, match lengths, and "end of block" marker.
Consequently, the more complex tree, handling 3 types of values, has more than 256 elements. It must check if the value is a literal, a match length (triggering an offset) or an end-of-block (triggering a new tree description, or end of file).

This is in stark contrast with Zhuff, which relies on 5 separated data streams, resulting in very tight loops. Each stream is decoded very fast, thanks to Huff0. More streams, but faster ones.

The complex part of the format is the header tree format, which is itself huffman compressed after being transformed using specific values for repeating patterns and streams of zero. It looks pretty hard at first, but since the description is complete, it can nonetheless be implemented without ambiguity.

Now, since the changes to Zhuff (and by consequence to Huff0) would be quite consistent, i wonder if it is useful to try this exercise. After all, who needs a Zip at Zhuff's speed ?

Tuesday, April 5, 2011

Comparing compressors : new ranking system

(Note : a new page is dedicated to benchmark results, with updated graph)
In the previous note regarding CompressionRatings results, a new method to compare compression programs has been suggested.
The main idea is that the relative "ranking" of a compressor may depend on the "speed" we want it to perform at. This representation is more complex, since a compressor is no longer purely "better" or "worse" than another one, as a simple ladder system.

With the ranking now being also dependent on the target speed, we need something better : a 2D graph has been proposed to visualize the relative performance of several compression programs. Although the proposed scenario and the figures it produces are successful in producing a winner for all speeds, the unfortunate side effect is that the differences tend to become visually small with slower speeds, since the timings to be represented become extremely large.

Therefore, i've been trying to produce a new ranking system, visually clearer, based on the same input and the same methodology.
We keep the initial scenario (compress a file, send it over a bandwidth limited I/O, and then decode the file). Total timings are calculated and compared.
The main new idea is that, whatever the selected I/O speed, the ranking always ends up with a value in a fixed range (say between 0 and 100), "fastest" alternative constantly being represented at the top of the chart.
All other compressors follow, at a relative distance which depends on time difference, effectively magnifying differences between compressors for better clarity, but keeping the ranking intact.

The result is a serie of lines, which produce the following diagram :

Click on graph for enlarged display

The ranking looks now much clearer to observe.

Zhuff takes the lead on a broad segment from ~6 MB/s to ~75 MB/s.

Below 6 MB/s, stronger compression alternatives win the competition, with 4x4 (-3) briefly leading at the 3-5 MB/s range, then losing it to Nanozip (-cd) and FreeArc (-m2) which are roughly equivalent.

A new contender also gets into the fray, with QuickLZ effectively re-benched using a more favorable strategy. With these new figures, it takes the lead in term of raw speed. However, remember that at such fast speeds, RAM Drive bandwidth is saturated. Therefore it tells more about the I/O code than raw compression performance, which is supposed to be the objective of this comparison.

In order to partly compensate I/O saturation effect, we can use another metric provided by the same benchmark, which is the CPU occupation ratio. When I/O is saturated, this value can be used to extrapolate what it could have been if there was no saturation. In order to not give an unfair advantage to this strategy, the extrapolation only ramps up to 350% (instead of 400% for a perfect occupation ratio).

It produces the following output :

Click on graph for enlarged display

Well, this is mostly the same graph as the previous one, with just LZ4 and QuickLZ swapped.
This is expected, since only 3 programs do qualify as saturating the I/O Ramdrive (QuickLZ, Zhuff and LZ4). The new lead from LZ4 comes from its extremely fast decoding speed which is between 2 to 3 times faster than QuickLZ. This can be effectively represented thanks to CPU occupation extrapolation, whereas I/O saturation does virtually erase this performance difference.

That being said, it is also worth observing that Qpress, the I/O code for QuickLZ, is making a very good job at extracting more bandwidth from the saturated RamDrive. Although arguably not strictly a compression related code, it is nonetheless a good inspiration for future improvements.

[Update] : a new benchmark page is now proposed on this blog, with an updated and improved graph.