Monday, June 6, 2011

LZ4 - Improved performance

Since Przemyslaw Skibinski provided a benchmark dedicated to comparing fast compressors, i've been interested in a direct comparison between LZ4 and Google's snappy.

Snappy is a very fast compressor, but its source code is quite more complex than LZ4. At first, it may look like this complexity will cost quite some CPU cycles, but this is not matched by the results : snappy proves to be a very speedy competitor.
Therefore, I've been having a closer look into LZ4, looking for some potential wasted cycles, but couldn't find a way to make the code even leaner than it already is. A few tiny % could be grabbed here and there, but the speed difference remained too large to be clearly explained by the algorithm alone.

It turns out that the difference comes mostly from a simple reason : cache policy.
LZ4 was designed with L2 cache limitation in mind. In order to ensure fast compression, all data should be retrieved from the L2 cache, which is relatively fast, at about 10 cycles per access.
Fair enough, but there is even faster : the L1 cache can provide access within 3 cycles or less. It looks like a small difference, but it does add up with each data access. In the end, the total speed difference can reach 20%, exactly what's missing to LZ4 to reach snappy's speed.

And ... that's all it takes. Modifying the performance parameter "HASH_LOG" to keep most data accesses within the L1 cache proved enough to cover the difference.
Therefore, starting from r10 revision, the default performance parameter is 12, for 16KB, which is just the right amount for Intel L1 data cache. AMD users may try 13 for 32KB, since their L1 data cache is twice bigger.
Addendum : note that the above results are correct only for 32 bits systems. 64 bits ones will spend twice more memory.

There is just a little catch : with less memory to fill its tables, LZ4 miss more opportunities to find matches, translating into worsened compression ratio. I've partly mitigated the effect with a more thorough search, but there is still a measurable compression ratio deficit. In any case, if the compression ratio is more important for you, just increase "HASH_LOG" value to were it was before (17, for 512KB). And if ratio is really very important, move away from fast scan, and start implementing a more intensive search routine.

LZ4 was not created with such memory requirements in mind, but it's refreshing to see this algorithm accommodate these new conditions so easily.

Another interesting side effect of using less memory for compression is faster decompression. At first, it seems counter intuitive, since the decompression algorithm is not affected by the change, and does not need any memory at all beyond input and output buffers. But a reason can be found : with less matches, the decompression algorithm will have less "segments" to copy, and its speed is tied to this number : better have a few large copy operations than many small ones.

Well, if you want to give it a try, just download the latest source code version from Google code. Typical speed gain is about 20% for compression and 5% for decompression, while ratio is typically 2% less, although obviously your mileage will vary depending on the file to be compressed.

Update : well, another update (r11), another performance improvement. This time, it targets GCC compiler optimization and benefits greatly to decompression speed.

Update 2 : a compression benchmark comparison has been completed by m^2 on his Atom-based system.


  1. I found that LZ4 compresses much faster and much smaller if you change MINMATCH to 8.

    The smaller compression is probably is because a larger MINMATCH prevents many small matches from replacing large matches with hash table key collisions.

    The faster speed is from not wasting time processing small runs.

  2. The above comment about "much faster and much smaller" was tested using a BWT encoded list of 2 million names of the format "JOHN A & SALLY T SMITH". I pre-encode my data as BWT so that I can do fast lookups on the data immediately after decompressing an LZ4 block.

    I also modified my copy of the LZ4 decompressor to build character histogram and occurance counter "helper tables" during the decompression pass.

    On decompression, if a record index has not already been saved, I "unBWT" the table saving the positions in the BWT of all the linefeed characters in "unBWT" sequence so I can immediately unBWT any selected record (name) by record number.

    I am using Yuta Mori's BWT code. Thanks guys!

    - geekmaster

  3. I also noticed that when I compile my code to optimize for SIZE instead of speed (with GCC), my modified decompression loop runs a LOT faster. I suspect that it is because the loop kernel may not fit in L1 instruction cache when optimized for speed. My loop kernel is big because it is doing simultaneous BWT "helper table" builds during the decompression.

    My LZ4 compression is faster when optimized for speed. I probably need to split compression and decompression into separate source files so that I can compile the object modules with whichever compiler optimization switches give fastest code for each operation.

    I need to test to see whether building the tables in a separate pass might be faster due to L1 cache effects...

    If you want speed, you really need to read Agner Fog's excellent information on code optimization.

    You may want to see if you expand the Pareto curve on your code using these methods.

    Good luck! ;-)

  4. FYI, your 32-bit copies can have a significant slowdown on Core2 processors due to not being 32-bit aligned. The standard memcpy function typically has multiple CPU-dependent code paths and uses the fastest available instructions (MMX, SSE2, etc.) for your processor. Agner Fog's faster replacement memcpy also forces boundary aligment using multiple segment copies. Agner says the standard memcpy is almost as fast as his when using 16-byte boundary aligment. I plan to modify my LZ compression code (based on LZ4) to create a different compression format that enforces 16-byte alignment between compressed runs and uncompressed runs. I think that inserting padding in the compressed block to force run alignment could be offset with a second pass of the block already in L2 cache to fill in the padding with small runs. This will take some research, but using an LZ compression file format that forces 16-byte alignment matches between compressed and uncompressed runs could yield HUGE speed increases when using memcpy to copy the runs.

  5. Dude! I just changed my LZ4 MINMATCH to my CHACHELINESIZE (64). It compresses MUCH faster and the resulting file is MUCH smaller that previous results, when compressing my BWT data.

    I keep all my data as BWT because I can search it very fast. I use reversed source data so I can search prefixes instead of suffixes, and I can decode my BWT in forward direction instead of reverse...

    I am very pleased with the results of using your LZ4 compression on my BWT data, especially after adjusting MINMATCH = CACHELINESIZE.


  6. This comment has been removed by a blog administrator.

  7. Oops... nevermind... The decompressed output files do not match the original uncompressed text.

    The mismatches are about the length of the new MINMATCH - 4.

    Apparently, the LZ4 code cannot use a different MINMATCH size without adjustments.

    The above timing tests are not valid, but it is still worth retesting with different MINMATCH lengths when LS4 can handle that.

    I got so excited about the speedups that I blogged before verifying the decompressed code.

    It because obvious that something was wrong when MINMATCH = 2048 compressed to 177 KB in 7 mS. If we keep increasing MINMATCH we can compress to nothing in no time. Unreal. ;-(


  8. This comment has been removed by a blog administrator.

  9. Hi Rob

    Sure, you're very welcomed with your experiments.

    The trap you fell in is a very common one indeed, and i also experienced it myself in my early days at compression.

    Any time a gain seems "too good to be true", or just too easy, the decoder test is the only one which provides the final proof. And btw, you realised that rule very quickly, which is pretty good indication that you're an active and experienced programmer :)


  10. I modified the LZ4 compressor so I can skip small matches (where it checks for a hash table match). Increasing the size of ignored matches makes the compressed size larger, showing that skipping small matches does not help. I did not test the speed.

    However, I think that changing the file format to contain runs with matching 16-byte alignment between input and output would make memcpy() (or Agner Fog's faster replacement) much faster at these copies than the 32-bit unaligned copies.

    Re: experience, I have been programming since the punch card days (late 60's). In fact, I owned a personal keypunch machine, and later an ASR-33 teletype in my living room, and an extra phone line for dedicated modem use (with acoustic coupler) for late-night programming binges...

    Lot's of "programming on the bare metal" back before Operating Systems were common, and writing device drivers and multitasking operating system kernels later. Lots of robotics and motion control code too (mostly the code that runs INSIDE them and talks directly to the I/O chips).

    I still have my first home-built computer with LEDs and toggle switches, that I did electronic music on.

    Strangely, my first job was as a COBOL programmer, and my current employment is maintaining mixed COBOL and TASM (with some C modules I added recently).

    I have to keep a 32-bit development system around for the old DOS tools, but I have occasionally used Win2K in VirtualBox on my Win7 laptop for "road trips".

    I want to speed up your really fast LZ4 code using ideas from Agner Fog's website. I do not like his library function always testing CPU type at the front though -- I think each type should have its own library (different DLLs?), and the libray loaded to match the CPU. Functions should not have multiple code paths optimized for each CPU. Instead, each CPU should have its own LIBRARY. -- At least that is my opinion.

    At a bare minimum, your copy loop should use fancy SSE2 (or whatever is best) instruction, preferably with cache prefetch control.

    Re: too good to be true -- I trusted the speedups until I passed the cache line size, when it really got too good too fast. Oops.