Wednesday, April 27, 2011

ZhuffMax

This program looks more and more like a Pepsi brand each passing day...

In an earlier note, i presented the basics of advanced parsing strategies, a method which allows to grab a few more percentage points of compression ratio in exchange of massive number of CPU cycles. Yes, let's face it, from a pure "compression/speed" perspective, the trade-off is debatable.

However, advanced parsing comes with an excellent property : it keeps the data format intact, meaning it can be decoded normally, by any compression program using the same format. Decoding speed is barely affected, and if it is, it's likely to be positively. In situations where compression time does not matter, but decompression speed does, this trade-off is acceptable.

In a bid to provide a living example of this strategy, i created ZhuffMax, a simple extension to ZhuffHC using the first form of parsing, lazy matching. For this example, i used a recursive level 1 evaluation, meaning that at each byte, i'm also looking for the next byte, select this new one if it is better, and start again the process.
There is definitely a positive impact on compression ratio. At a couple of % points, it's typical of what can be achieved with lazy strategy.



versionthreadsCompression RatioSpeedDecoding






Zhuff0.7-t22.584285 MB/s550 MB/s
ZhuffHC0.1-t22.78155.6 MB/s618 MB/s
ZhuffMax0.1-t22.82235.0 MB/s640 MB/s


The impact on compression speed, on the other hand, is pretty massive : 60% slower.
Note the positive impact on decompression speed, since we are once again promoting larger matches.

Such trade-off is unlikely to be interesting in a symmetric scenario, but there are situations where asymmetry is the rule. You can imagine for example a communication between a large, powerful server and a small embedded sensor : the superior compression rate will translate into less data exchanged, which means less battery used for the sensor. Another typical situation is the "distribution" process, where the file is compressed once, and will be downloaded and decoded many times afterwards. In such situations, the extra % of compression ratio is welcomed, even if it means more compression power.

ZhuffMax does not yet completely deserve its name, since there are stronger (and slower) parsing strategies available (namely Optimal parsing). But if i do find time to implement new parsing concepts, i will implement them into ZhuffMax. No miracle should be expected : even optimal can bring only a few more % compression, so we are already close to maximum.


You can grab ZhuffMax at Zhuff's homepage.

As a sidenote, maybe i should consider merging all these versions of Zhuff together. After all, they all decode the same format...

Sunday, April 24, 2011

LZ4 source code now Opened

 The "great" thing with all those statistics tools provided by Google is that you can know exactly when, where and why people got to consult your website. Call me "big brother"...
In this particular example, the most frequent Google request leading to this website has been "LZ4 source code" for quite a few weeks. So at least it shows a genuine interest for it.

Well, if you have been typing these few words lately, the wait is over : LZ4 is now an Open Source project, hosted on Google Code.

The version proposed is the "Ultra Fast" one, ready for multi-threading purposes. It has been ex-purged of all windows-specific calls, to increase the capability to build LZ4 for any other system, using a standard C compiler.
An example "main.c" code is also provided, featuring simple file i/o interface. Those willing to start from a ready-to-use Visual Studio project will find one in the "download" category.

Well, hope this helps. Now it's a matter of heating your compilers and test suites...

[Edit] License has been changed to permissive BSD-style

Thursday, April 21, 2011

Zhuff + MMC = Zhuff HC

 What can you get marrying MMC with Zhuff ? Well, a stronger version, called ZhuffHC.

MMC is an improved implementation of a full match search function. It always provides the best possible match length in a given search window. In the case of Zhuff, this window is 64K.
MMC is much faster than a simple Hash Chain algorithm, and requires no "early end" trick to keep pace on badly crafted files. The result it provides is guaranteed to be the best possible. So it is quite the opposite from a fast scan strategy.

Integrating MMC with Zhuff has been refreshingly simple. It just required to remove the fast scan matchfinder, and replace it with a single call to MMC matchfinder. In the process, i decided to slightly simplify MMC interface, a newer version of which has been uploaded to Google Code. Differences are very minor, but it should make the code easier to read.

ZhuffHC inherits all the benefits of regular Zhuff, including multi-threading, number of cores detection, drag'n'drop support, benchmark mode, and so on. It improves compression ratio by a fair amount (+8%). That's not as large an improvement as it has been for LZ4HC, which means that both the Huffman encoding stage and improved sampling were already providing a nice compression boost, eating into the potential gains of a full search.
All this comes at a steep price on compression speed though. At just 28 MB/s per core, it is just a fraction of its older brother.

An important point is that both versions, Zhuff and ZhuffHC, are fully compatible : this is in fact exactly the same format, just the search time is different.

As an interesting side effect, decoding speed is improved on compressed files produced by ZhuffHC (decodable by Zhuff too). This is because the more compressed version consists of less but longer matches. With less copy operations to handle, the decoder works 13% faster.




versionthreadsCompression RatioSpeedDecoding
Zhuff0.7-t12.584147 MB/s278 MB/s
Zhuff0.7-t22.584285 MB/s550 MB/s
ZhuffHC0.1-t12.78128.3 MB/s312 MB/s
ZhuffHC0.1-t22.78155.6 MB/s618 MB/s


You can grab the new version here.

Sunday, April 10, 2011

A look into zip

 If there is one compression format which is well-known and popular, it is Zip. None other has reached such presence.

Zip was invented by Phil Katz, creator of PKWare, in 1989. Yes, that was 22 years ago. The format was made freely available to anyone, an unusual  move back in these days. This openness was key to bring Zip to popularity. It provided other companies, such as WinZip, the opportunity to create their own version of Zip and distribute it without the cost, hassle and risk to license. As a consequence, Zip quickly became the "default" compression format for PC systems, and beyond.

Later, an open-source version of the algorithm was created by Jean-Loup Gailly, named gzip for the program, and zlib fo the library. Zlib is probably one of the most deployed library today, being used in a mind-blowing variety of devices and systems.

All these programs have something in common, they share the same compression format, called deflate.

Deflate, the core of Zip files, is now well defined, being described in IETF RFC 1951. Such normalization organism contribute to give great credit to the format, considered reliable enough for anyone to build its program above specifications.

And indeed, the definition is clear enough. Well, at least to whoever has a bit of knowledge of entropy coding, more specifically Huffman one.

The driving idea in starting this analysis was to assess if Zhuff algorithm could be re-used in a way that would produce zip-compatible data streams, with the speed of Zhuff. The quick answer is "not easily", since the differences are non negligible.

The very first property of Zip streams is that they entangle all information (literal, offset, lengths) in a single stream. This stream, however, uses 2 trees, one for offset, and another one for literals, match lengths, and "end of block" marker.
Consequently, the more complex tree, handling 3 types of values, has more than 256 elements. It must check if the value is a literal, a match length (triggering an offset) or an end-of-block (triggering a new tree description, or end of file).

This is in stark contrast with Zhuff, which relies on 5 separated data streams, resulting in very tight loops. Each stream is decoded very fast, thanks to Huff0. More streams, but faster ones.

The complex part of the format is the header tree format, which is itself huffman compressed after being transformed using specific values for repeating patterns and streams of zero. It looks pretty hard at first, but since the description is complete, it can nonetheless be implemented without ambiguity.

Now, since the changes to Zhuff (and by consequence to Huff0) would be quite consistent, i wonder if it is useful to try this exercise. After all, who needs a Zip at Zhuff's speed ?

Tuesday, April 5, 2011

Comparing compressors : new ranking system

(Note : a new page is dedicated to benchmark results, with updated graph)
In the previous note regarding CompressionRatings results, a new method to compare compression programs has been suggested.
The main idea is that the relative "ranking" of a compressor may depend on the "speed" we want it to perform at. This representation is more complex, since a compressor is no longer purely "better" or "worse" than another one, as a simple ladder system.

With the ranking now being also dependent on the target speed, we need something better : a 2D graph has been proposed to visualize the relative performance of several compression programs. Although the proposed scenario and the figures it produces are successful in producing a winner for all speeds, the unfortunate side effect is that the differences tend to become visually small with slower speeds, since the timings to be represented become extremely large.

Therefore, i've been trying to produce a new ranking system, visually clearer, based on the same input and the same methodology.
We keep the initial scenario (compress a file, send it over a bandwidth limited I/O, and then decode the file). Total timings are calculated and compared.
The main new idea is that, whatever the selected I/O speed, the ranking always ends up with a value in a fixed range (say between 0 and 100), "fastest" alternative constantly being represented at the top of the chart.
All other compressors follow, at a relative distance which depends on time difference, effectively magnifying differences between compressors for better clarity, but keeping the ranking intact.

The result is a serie of lines, which produce the following diagram :

Click on graph for enlarged display

The ranking looks now much clearer to observe.

Zhuff  takes the lead on a broad segment from ~6 MB/s to ~75 MB/s.
Below 6 MB/s, stronger compression alternatives win the competition, with 4x4 (-3) briefly leading at the 3-5 MB/s range, then losing it to Nanozip (-cd) and FreeArc (-m2) which are roughly equivalent.

A new contender also gets into the fray, with QuickLZ effectively re-benched using a more favorable strategy. With these new figures, it takes the lead in term of raw speed. However, remember that at such fast speeds, RAM Drive bandwidth is saturated. Therefore it tells more about the I/O code than raw compression performance, which is supposed to be the objective of this comparison.

In order to partly compensate I/O saturation effect, we can use another metric provided by the same benchmark, which is the CPU occupation ratio. When I/O is saturated, this value can be used to extrapolate what it could have been if there was no saturation. In order to not give an unfair advantage to this strategy, the extrapolation only ramps up to 350% (instead of 400% for a perfect occupation ratio).

It produces the following output :

Click on graph for enlarged display

Well, this is mostly the same graph as the previous one, with just LZ4 and QuickLZ swapped.
This is expected, since only 3 programs do qualify as saturating the I/O Ramdrive (QuickLZ, Zhuff and LZ4). The new lead from LZ4 comes from its extremely fast decoding speed which is between 2 to 3 times faster than QuickLZ. This can be effectively represented thanks to CPU occupation extrapolation, whereas I/O saturation does virtually erase this performance difference.

That being said, it is also worth observing that Qpress, the I/O code for QuickLZ, is making a very good job at extracting more bandwidth from the saturated RamDrive. Although arguably not strictly a compression related code, it is nonetheless a good inspiration for future improvements.

[Update] : a new benchmark page is now proposed on this blog, with an updated and improved graph.

Friday, April 1, 2011

Zhuff crowned as best Speed-Oriented compression program

 After the recent upgrade of Zhuff to multi-threading, this program has been once again evaluated by the public benchmark website CompressionRatings.

And results are fairly good : Zhuff takes first place at the Speed Oriented ranking, leading the chart by a sizable margin.
Looking at the detailed benchmark chart, the lead over previous king (the excellent 4x4 from Bulat Ziganshin) is especially telling, with +65% compression speed and +45% decoding speed. The difference would be even larger if bandwidth was not effectively limited by the RAM Drive speed. Finally, compression ratios remain fairly close, with a 4% advantage to 4x4.

As a sidenote, don't be fooled by the illusion that Zhuff and LZ4 are "about the same speed" : this result is entirely related to RAM Drive I/O saturation. Without this limitation, LZ4 would fare much better, as can be guessed from its CPU occupation ratio (only 160% to reach saturation, which means 100% for the RAM Drive driver, and "just" 60% (less than one core!) for the decoding algorithm).

Which lead us to the next question : how can we know which compressor is best suited for which speed ?

Using CompressionRatings data, we can build an interesting graph, using a simple file transmission scenario. In this scenario, file is compressed, then sent over a limited-bandwidth pipe, and then decoded. We take the simplistic assumption that there is no other delay in between. Total time is calculated and compared

(click on graph for larger display)

Looking at the graph, Zhuff is the better choice for pipes ranging from 5 MB/s to 200 MB/s. Above that point, LZ4 takes the lead (difference would be much more significantly if there was no RAM Drive I/O limitation issues, but well, that's the data we've got). Below that point, compression power matters more, with 4x4 becoming competitive at -3 level, then FreeArc and Nanozip getting into the fray.
It is also worth noting that more mainstream compressors, such as winrargzip7zip and even the parallel version pigz, do not reach top position in this graph.

You can download the latest version of Zhuff at its homepage.

Tuesday, March 29, 2011

Ross William's LZRW

Once upon a time...
Well, okay, it's not that far far away, but still, early 90's seems like prehistoric era for personal computing by now.
Ross Williams created a fairly interesting family of LZ compressors, featuring excellent speed and memory properties, and still providing good compression ratio (by these days standards), which he modestly called LZRW, for Lempel-Ziv-Ross Williams.

It has since become a classic in early compression algorithms, still being a competitive one even by today standards. Indeed, it's also the basis of several modern compressors, such as QuickLZ.

The main idea behind LZRW is this one : instead of providing an offset to the LZ  decoder, just provide it with the pointer position into a table.

It looks complex or crazy at first, but it is quite logical, following Dr Ross's mind into his successive versions of LZRW :
Back in these days, computers featuring some 64KB of RAM were still commonplace. Better ones could reach an impressive 256KB. Therefore, since table sizes were seriously limited, capability to track all positions in the buffer to find the best match was, well, mostly out of reach.

LZRW does not target best maximal compression, but fast and practical (ie memory efficient) compression level. By accepting scarcity of references, it created some "holes" in addressable space, which translates into compression inefficiency.
Hence the main idea : by using the table position instead of the offset, there is no more any hole in addressable space : as long as the table is completely filled, all positions have, roughly, an equivalent probability to appear (well, roughly enough...). Furthermore, all matches, even long range ones, use exactly the same number of bits to be described, making them more efficient.

The downside is that the decoder must reference the data exactly as the compressor did, making it more complex (and slower) than more standard LZSS. But still, this trade-off is very acceptable, since the speed penalty is not too large.

The LZRW scheme features a "best spot" for optimal compression ratio : the larger the table, the better its "match find capability", hence a better compression ratio. However, the larger the table, the more bits it requires to describe the table slot to the decoder, hence a worse compression ratio.

The "best spot" is found where both forces collide.

Or is it ?
One problem is, the optimal "nb of bits" is not the same depending on the file to be compressed.
For example, with "enwik8", this optimal number is 14 bits.
For "open office", it is 19 bits.
For very small files, this number can go down to 10 or less.
Therefore, the question is : how to find the optimal table size, in order to get the "optimal compression ratio" for any given input ?

Alas, that question has no easy answer, and i still have to find one myself...

Saturday, March 26, 2011

Best of both worlds : mixed entropy

There is one thing that really amazed me when i started experimenting with entropy codings : Huffman is very good, in spite of its relatively old age and design simplicity. So good in fact that it can beat the much more modern, complex and slower Arithmetic coding in some circumstances.

For illustration, just have a quick look at entropy coders benchmark. The overall tendency is relatively clear : when no probability is too "large", the precision loss of Huffman is too small to be noticeable. Results are therefore relatively equivalent with Arithmetic.

Indeed, Huffman can even beat Arithmetic, and there are two good reasons for this :

- First, Huffman headers are much more compact than arithmetic ones. Since only bit length are necessary, it can be efficiently represented by a tree or a map, which can themselves be compressed easily. By contrast, arithmetic needs frequency count (or normalized frequency count), which are typically much higher values, with little opportunity for optimizations. As a consequence, Huffman headers are at least 4x smaller, and differences in the range of 6x-10x are common.

- Second : since headers are smaller, it's also possible to work with smaller Huffman blocs. Smaller blocs means statistics better describe the current bloc, which translates into improved compression ratio. This is called entropy adaptation.
As a consequence, it is possible to work with blocs of 16KB with huff0 (below this threshold, header costs start to override entropy adaptation benefits), while optimal bloc size for range0 is larger, at 128KB, due to big headers.
Entropy adaptation is the main cause which explains Huffman better compression ratio in some scenario.

However, on the other end, Arithmetic takes the lead, especially when one probability gets beyond 50%. And the closer it gets to 100%, the higher the difference. Huffman gets stuck to its best possible representation (1 bit per element), while arithmetic can continue to get benefits from higher probability levels. This is very visible on the lower part of benchmark table (results of 'enwik8-lz4hc-run' and 'loong').

Granted, these conclusions are related to the selected "semi-static" methodology, which requires a description header per bloc. A fully dynamic implementation would provide different results, much to the advantage of Arithmetic, since it suffers more from header size.
However, dynamic probability comes with its own price, namely speed, and wrong probability estimations.

In order to preserve high speed, i'm still working on block-based entropy coding.
However, in a will to improve compression ratio, i was wondering if it could be possible to benefit from both huff0 and range0 properties.

Enter HuffX, my new entropy coder, which for the time being is mostly a fusion of huff0 and range0, with a dynamic selector.

In its first incarnation, progresses are visible, as advertised in below benchmark results :



HuffX default with basic properties of huff0, featuring identical ratio, output and speed.
Things are getting interesting when input sequence is more squeezed : HuffX gradually provides more data blocs to range0.
In a very interesting "moot point", HuffX successfully outperforms both huff0 and range0, while retaining most of the speed of huff0. This is the kind of result i'm aiming at.
Then, HuffX ends up providing all blocs to range0, featuring ratio and speed close to it ... but not identical. HuffX still retains the 16KB bloc size of its older brother, which is a bit too small for pure range coding.

These first results show that improved compression ratio comes with a speed cost attached, as expected. Therefore, it still remains unclear if HuffX can find its way into fast compressors such as Zhuff or Etincelle.
Still some work ahead...