After Stephan's SqueezeChart statement on LZ4 being the fastest compressor seen on his public benchmark, LZ4 was once again put to the test, this time by Sami Runsas 's Compression Ratings.
This benchmark is very interesting, since its policy is very clear, its benchmark corpus is intentionally heterogeneous and carefully selected (so much that i use it regularly for calibration), and its emphasis on speed and precision makes it a reliable comparison tool, especially for fast compressors.
It is also heavily multi-thread oriented. Therefore, in order to sustain the comparison, LZ4 needed to be upgraded with multi-threading capabilities, since it would stand no chance when running on a single core against the best-of-breed multi-threaded. That's where enters version 0.9....
And it fares pretty well on its first try.
LZ4 gets top position on both compression and decompression speed, while also providing better ratio than previous king of the hill, the excellent 4x4 from Bulat Ziganshin.
And the difference is pretty large : +50% for compression speed, +25% for decompression speed using less than a full thread to max out, and +9% compression ratio.
As a complement, it's interesting to observe that the decoding speed of LZ4 does not vary by increasing the number of threads. That's because decoding speed is so fast that half a CPU usage is enough to exhaust RAMdisk steam.
With such a speed available, compressing/decompressing data costs about the same time as just copying it ... from a main memory RAM Drive. It opens the way to almost free compression for, well, any usage, even on-the-fly compression of temporary data for instance.
LZ4 is available on its homepage.
Thursday, March 3, 2011
Tuesday, March 1, 2011
Multi-threading compression available : LZ4 v0.9
Previous versions of LZ4 were displaying promising results using multi-threading, but only in benchmark mode. Now this feature is made fully accessible, with real "file i/o" interface.
You will need a very fast disk drive to experiment with it. In effect, only a RAM Drive can expect to keep fast enough steam to feed LZ4, especially when using multi-threading modes.
I can't really test beyond 2 threads, since i only have dual-core systems within reach, but the code should prove pretty scalable with quad-core systems and more (up to 32 threads).
You can download and test this new version on the LZ4 homepage.
You will need a very fast disk drive to experiment with it. In effect, only a RAM Drive can expect to keep fast enough steam to feed LZ4, especially when using multi-threading modes.
Block subdivision method was selected, due to its design simplicity and scalability. Compared with previous versions, even single thread mode greatly benefits, since multi-threading requires asynchronous data loading, which means that reading and writing to the disk is done in parallel with compression.
You can download and test this new version on the LZ4 homepage.
Sunday, February 27, 2011
LZ4 on steroids : version 0.8
Back to playground. It's always nice to work with LZ4, since its layout is very easy to test new ideas.
In this case, i wanted to check the principle of selective sampling, but this time purely in order to improve speed. This is a bit different from Zhuff, which used this idea to improve compression.
As stated in an earlier post, increasing speed is achieved by decreasing the number of samples. However, since LZ4 is a pure LZ77 compressor (using offset to identify repeated patterns), it also means that reducing samples can only decrease compression. Therefore, this is a matter of trade-off.
The end result achieved with LZ4 version 0.8 seems really interesting. On average, we end up with a 0,05% compression loss, and achieve a more than 10% speed boost. This is almost an unfair trade-off.
Don't be fooled by this "10%" average value, since it hides some very vast differences. Effectively, some files will gain a lot, while others will not get any benefit. For instance, an almost incompressible file can get a speed improvement of more than 500%, while another file with no possibility to "discard" elements, such as enwik8, will not see any performance improvement. But this is still an interesting bet "on average".
While at it, i also modified the multi-threading code branch. The new behavior is now much closer to an I/O algorithm, which makes it more representative of real usage. It also results in a slightly faster operation.
And finally, there is a new checksum algorithm, which ensures compression results are correct.
The resulting binary is rightly available here, and can be tested on your system.
version | Compression Ratio | Speed | Decoding | |
LZ4 "Ultra Fast" | 0.7 | 2.062 | 232 MB/s | 805 MB/s |
LZ4 "Ultra Fast" | 0.8 | 2.061 | 260 MB/s | 810 MB/s |
LZ4 "Ultra Fast" - 2 threads | 0.7 | 2.062 | 430 MB/s | 1510 MB/s |
LZ4 "Ultra Fast" - 2 threads | 0.8 | 2.061 | 500 MB/s | 1510 MB/s |
Friday, February 25, 2011
Multi-threaded compression experiment : LZ4 version 0.7
Putting into practice conclusions from an earlier post on multi-threading, I finally created a code branch using multi-threading for LZ4.
This is an opportunity to present version 0.7, which for now limit multi-threading to the "benchmark mode" only.
Results are interesting : speed improvement is very noticeable.
However, using 2 threads does not translate into exactly X2 performance (we are more into the +80-90% range).
There are at least 2 reasons for this (that i have identified).
First, not all segments compress as fast as others. This is expected : compression speed is partly dependent on file type being compressed. On a large file, some segments are easier and faster to compress than others. This is exactly what's happening here : i'm effectively impacted by the "slowest segment" which decides the final speed of algorithm.
As an interesting side-effect, using more threads than cores can sometimes lead to increased performance, which is counter-intuitive, but is simply a proof that by "distributing the load", we avoid having one thread with an "easy part" to compress, then waiting for the second one with a "worse part" to handle.
Second, there are some limits which cannot be improved, in this case, the bus speed between CPU and main memory. On reaching 2GB/s, there is a kind of "wall speed" which cannot be crossed. This effect is not too large with "only" 2 threads, but on quad-core systems is likely to have an impact, especially for decoding speed. There is nothing which can be done to improve upon this. But hopefully, this is not that much of an issue, since working at "memory bus speed" is quite fast enough.
Anyway, here are the results for the current version :
Since i only own a 2-cores system, i can't really test for more threads. But the code allows for up to 32 threads. So feel free to test on your own rig.
This is an opportunity to present version 0.7, which for now limit multi-threading to the "benchmark mode" only.
Results are interesting : speed improvement is very noticeable.
However, using 2 threads does not translate into exactly X2 performance (we are more into the +80-90% range).
There are at least 2 reasons for this (that i have identified).
First, not all segments compress as fast as others. This is expected : compression speed is partly dependent on file type being compressed. On a large file, some segments are easier and faster to compress than others. This is exactly what's happening here : i'm effectively impacted by the "slowest segment" which decides the final speed of algorithm.
As an interesting side-effect, using more threads than cores can sometimes lead to increased performance, which is counter-intuitive, but is simply a proof that by "distributing the load", we avoid having one thread with an "easy part" to compress, then waiting for the second one with a "worse part" to handle.
As a consequence, a more clever way to "segment" input data would help to improve the issue.
Second, there are some limits which cannot be improved, in this case, the bus speed between CPU and main memory. On reaching 2GB/s, there is a kind of "wall speed" which cannot be crossed. This effect is not too large with "only" 2 threads, but on quad-core systems is likely to have an impact, especially for decoding speed. There is nothing which can be done to improve upon this. But hopefully, this is not that much of an issue, since working at "memory bus speed" is quite fast enough.
Anyway, here are the results for the current version :
version | Compression Ratio | Speed | Decoding | |
LZ4 "Ultra Fast" | 0.7 | 2.062 | 232 MB/s | 805 MB/s |
LZ4 "Ultra Fast" - 2 threads | 0.7 | 2.062 | 430 MB/s | 1510 MB/s |
Since i only own a 2-cores system, i can't really test for more threads. But the code allows for up to 32 threads. So feel free to test on your own rig.
You can download the new version here.
Tuesday, February 22, 2011
Huff0 : Early detection offers better speed (v0.7)
In my earlier release, i skipped the integration of a simple to understand but more difficult to execute feature : early detection of border-cases segments.
This definition regroups both "not compressible segments" and "single character segments". The previous algorithm was correctly detecting these cases, but during the construction of the Huffman tree.
The new algorithm ensures these situations are detected before any operation is done to build the tree. It does so by counting differently, in a new structure created for this purpose. The net result is a speed improvement for files which feature such situations :
Detailed performance assessment :
This (mostly) closes the gap with Range0 regarding the detection speed of not compressible segments, which is the main case to consider for real-life situations.
You can download and test the new version here :
http://fastcompression.blogspot.com/p/huff0-range0-entropy-coders.html
This definition regroups both "not compressible segments" and "single character segments". The previous algorithm was correctly detecting these cases, but during the construction of the Huffman tree.
The new algorithm ensures these situations are detected before any operation is done to build the tree. It does so by counting differently, in a new structure created for this purpose. The net result is a speed improvement for files which feature such situations :
Detailed performance assessment :
Huff0 v0.6 | Huff0 v0.7 | |||||
Ratio | Compress | Decoding | Ratio | Compress | Decoding | |
Not compressible | ||||||
enwik8.7z | 1.000 | 810 MB/s | 1.93 GB/s | 1.000 | 870 MB/s | 1.93 GB/s |
Hardly compressible | ||||||
win98-lz4hc-lit | 1.024 | 465 MB/s | 600 MB/s | 1.024 | 485 MB/s | 600 MB/s |
audio1 | 1.070 | 285 MB/s | 280 MB/s | 1.070 | 285 MB/s | 280 MB/s |
Distributed | ||||||
enwik8-lz4hc-lit | 1.290 | 204 MB/s | 194 MB/s | 1.290 | 204 MB/s | 194 MB/s |
Lightly Ordered | ||||||
enwik8-lz4hc-offh | 1.131 | 197 MB/s | 184 MB/s | 1.131 | 197 MB/s | 184 MB/s |
Ordered | ||||||
enwik8-lz4hc-ml | 2.309 | 214 MB/s | 195 MB/s | 2.309 | 214 MB/s | 195 MB/s |
Squeezed | ||||||
office-lz4hc-run | 3.152 | 218 MB/s | 202 MB/s | 3.152 | 218 MB/s | 202 MB/s |
enwik8-lz4hc-run | 4.959 | 245 MB/s | 224 MB/s | 4.959 | 245 MB/s | 224 MB/s |
Ultra compressible | ||||||
loong | 275 | 785 MB/s | 2.93 GB/s | 275 | 860 MB/s | 2.93 GB/s |
This (mostly) closes the gap with Range0 regarding the detection speed of not compressible segments, which is the main case to consider for real-life situations.
You can download and test the new version here :
http://fastcompression.blogspot.com/p/huff0-range0-entropy-coders.html
Sunday, February 20, 2011
Zhuff : new version v0.6
You can download it on the forum.
Saturday, February 19, 2011
Improved sampling for better compression
When building a classic compression algorithm, it is expected to use past data to populate the predictor for future data. It all works pretty well, and by giving recent data a higher probability of representing future ones, it also provides an automatic "adaptation" property, which is highly desirable.
As a typical behavior, most compressors (zip, 7zip, rar, etc.) use an entire search window to build a dictionary of references. For best compression, all elements in the window must be registered, and sorted.
Now all this cost a lot of memory and CPU cycles.
For better speed, and memory savings, the dictionary can be made significantly smaller. This is one of the basis for LZ4, Zhuff and Etincelle results. But how to select which data should be part of dictionary, and which one should not ?
There are several considerations at stake. First, about speed.
The more elements in the dictionary, the slower the algorithm. So we want as little as possible, to limit the number of updates and sort operations. Basically, each time a reference in the table is erased by a newer one without being used, it was a useless update.
Updating only on the most probable references makes the best use of CPU cycles.
Second, compression.
For a pure LZ model, using offset to identify streams to copy, any missing position in the search windows can only hurt compression, since it means that the "range of references" is not fully populated. Therefore, this strategy is a tricky business. We have to measure the effects of missing references on compression ratio, to check if the cost is worth the gains.
For indirect models, things are getting better. Indirect model means that we are referencing a position in a table, instead of a directly calculable position. The table must be updated with exactly the same rule on both compression and decoding side. This obviously means more work and memory needed on the decoder side. This family includes LZRW and ROLZ reference models.
This is much better, since by not referencing some positions, we don't create a penalty due to "unused" slots. Indeed, if we intentionally skip "bad" positions in favor of "good" ones, we should even improve compression !
This is what effectively happens with Etincelle, when "incompressible segment detection" algorithm was added. It improved both the speed and the compression rate together. A good bet.
So, where are we with these early considerations ?
Mostly, we know that selecting only the "good" positions to be part of the dictionary will offer excellent speed improvement. But where are they ?
After much runs, i ended with a ton of log files and statistics, providing the following "general" conclusion : positions most likely to become useful references are "near the borders of segments". A segment in this definition is either a match sequence, or a continuous stream of literals. The larger the segment, the more the rule applies.
I therefore tried to check this result with Zhuff, and quickly witnessed some compression rate improvements. Selecting the right balance between speed and compression ratio is another (complex) story, but at least it provides some new room for improvement.
Looking even farther, the idea that some positions are "more likely" than others could lead to attributing to references a "probability", instead of a simple index, which is just a quick shortcut for a "flat probability" model. Now, that would make a deep change compared to current model in use within most LZ-based compressors. But let's keep that idea for a later note....
Saturday, February 12, 2011
A Homepage for MMC
MMC (Morphing Match Chain) is a new search algorithm i've been toying with by end of 2010. It's been presented in November, and is apparently a novel technique, with interesting properties. It seems the fastest search algorithm around, at least for Greedy Match LZ compression.
LZ4HC uses it as its core, and this is the primary reason for its performance.
Now, MMC was discussed in detail with experts in a compression forum, but it was lacking a homepage to present it comprehensively. This is now fixed.
You'll find in this blog a homepage for MMC, describing the algorithm from a high level perspective, with some test results. The algorithm has been implemented in an open-source code (GPL v2), which can be downloaded on Google Code.
LZ4HC uses it as its core, and this is the primary reason for its performance.
Now, MMC was discussed in detail with experts in a compression forum, but it was lacking a homepage to present it comprehensively. This is now fixed.
You'll find in this blog a homepage for MMC, describing the algorithm from a high level perspective, with some test results. The algorithm has been implemented in an open-source code (GPL v2), which can be downloaded on Google Code.
Subscribe to:
Posts (Atom)