This is an opportunity to present version 0.7, which for now limit multi-threading to the "benchmark mode" only.
Results are interesting : speed improvement is very noticeable.
However, using 2 threads does not translate into exactly X2 performance (we are more into the +80-90% range).
There are at least 2 reasons for this (that i have identified).
First, not all segments compress as fast as others. This is expected : compression speed is partly dependent on file type being compressed. On a large file, some segments are easier and faster to compress than others. This is exactly what's happening here : i'm effectively impacted by the "slowest segment" which decides the final speed of algorithm.
As an interesting side-effect, using more threads than cores can sometimes lead to increased performance, which is counter-intuitive, but is simply a proof that by "distributing the load", we avoid having one thread with an "easy part" to compress, then waiting for the second one with a "worse part" to handle.
As a consequence, a more clever way to "segment" input data would help to improve the issue.
Second, there are some limits which cannot be improved, in this case, the bus speed between CPU and main memory. On reaching 2GB/s, there is a kind of "wall speed" which cannot be crossed. This effect is not too large with "only" 2 threads, but on quad-core systems is likely to have an impact, especially for decoding speed. There is nothing which can be done to improve upon this. But hopefully, this is not that much of an issue, since working at "memory bus speed" is quite fast enough.
Anyway, here are the results for the current version :
version | Compression Ratio | Speed | Decoding | |
LZ4 "Ultra Fast" | 0.7 | 2.062 | 232 MB/s | 805 MB/s |
LZ4 "Ultra Fast" - 2 threads | 0.7 | 2.062 | 430 MB/s | 1510 MB/s |
Since i only own a 2-cores system, i can't really test for more threads. But the code allows for up to 32 threads. So feel free to test on your own rig.
You can download the new version here.
I've got a dual socket L5639 system with 24 threads and 72GB of ddr3 1066 that i'll test this on
ReplyDelete