Comments on RealTime Data Compression: LZ4 - Improved performance

I modified the LZ4 compressor so I can skip small ...

2011-07-13T02:12:11.299+02:00

I modified the LZ4 compressor so I can skip small matches (where it checks for a hash table match). Increasing the size of ignored matches makes the compressed size larger, showing that skipping small matches does not help. I did not test the speed.

However, I think that changing the file format to contain runs with matching 16-byte alignment between input and output would make memcpy() (or Agner Fog's faster replacement) much faster at these copies than the 32-bit unaligned copies.
http://www.agner.org/optimize/#asmlib

Re: experience, I have been programming since the punch card days (late 60's). In fact, I owned a personal keypunch machine, and later an ASR-33 teletype in my living room, and an extra phone line for dedicated modem use (with acoustic coupler) for late-night programming binges...

Lot's of "programming on the bare metal" back before Operating Systems were common, and writing device drivers and multitasking operating system kernels later. Lots of robotics and motion control code too (mostly the code that runs INSIDE them and talks directly to the I/O chips).

I still have my first home-built computer with LEDs and toggle switches, that I did electronic music on.

Strangely, my first job was as a COBOL programmer, and my current employment is maintaining mixed COBOL and TASM (with some C modules I added recently).

I have to keep a 32-bit development system around for the old DOS tools, but I have occasionally used Win2K in VirtualBox on my Win7 laptop for "road trips".

I want to speed up your really fast LZ4 code using ideas from Agner Fog's website. I do not like his library function always testing CPU type at the front though -- I think each type should have its own library (different DLLs?), and the libray loaded to match the CPU. Functions should not have multiple code paths optimized for each CPU. Instead, each CPU should have its own LIBRARY. -- At least that is my opinion.

At a bare minimum, your copy loop should use fancy SSE2 (or whatever is best) instruction, preferably with cache prefetch control.

Re: too good to be true -- I trusted the speedups until I passed the cache line size, when it really got too good too fast. Oops.

Hi Rob Sure, you're very welcomed with your e...

2011-07-12T23:12:50.012+02:00

Hi Rob

Sure, you're very welcomed with your experiments.

The trap you fell in is a very common one indeed, and i also experienced it myself in my early days at compression.

Any time a gain seems "too good to be true", or just too easy, the decoder test is the only one which provides the final proof. And btw, you realised that rule very quickly, which is pretty good indication that you're an active and experienced programmer :)

Regards

2011-07-12T21:52:46.153+02:00

This comment has been removed by a blog administrator.

Oops... nevermind... The decompressed output files...

2011-07-12T21:48:53.636+02:00

Oops... nevermind... The decompressed output files do not match the original uncompressed text.

The mismatches are about the length of the new MINMATCH - 4.

Apparently, the LZ4 code cannot use a different MINMATCH size without adjustments.

The above timing tests are not valid, but it is still worth retesting with different MINMATCH lengths when LS4 can handle that.

I got so excited about the speedups that I blogged before verifying the decompressed code.

It because obvious that something was wrong when MINMATCH = 2048 compressed to 177 KB in 7 mS. If we keep increasing MINMATCH we can compress to nothing in no time. Unreal. ;-(

Sorry...

2011-07-12T20:44:08.009+02:00

This comment has been removed by a blog administrator.

Dude! I just changed my LZ4 MINMATCH to my CHACHEL...

2011-07-12T19:59:00.872+02:00

Dude! I just changed my LZ4 MINMATCH to my CHACHELINESIZE (64). It compresses MUCH faster and the resulting file is MUCH smaller that previous results, when compressing my BWT data.

I keep all my data as BWT because I can search it very fast. I use reversed source data so I can search prefixes instead of suffixes, and I can decode my BWT in forward direction instead of reverse...

I am very pleased with the results of using your LZ4 compression on my BWT data, especially after adjusting MINMATCH = CACHELINESIZE.

Awesome!

FYI, your 32-bit copies can have a significant slo...

2011-07-12T17:42:33.565+02:00

FYI, your 32-bit copies can have a significant slowdown on Core2 processors due to not being 32-bit aligned. The standard memcpy function typically has multiple CPU-dependent code paths and uses the fastest available instructions (MMX, SSE2, etc.) for your processor. Agner Fog's faster replacement memcpy also forces boundary aligment using multiple segment copies. Agner says the standard memcpy is almost as fast as his when using 16-byte boundary aligment. I plan to modify my LZ compression code (based on LZ4) to create a different compression format that enforces 16-byte alignment between compressed runs and uncompressed runs. I think that inserting padding in the compressed block to force run alignment could be offset with a second pass of the block already in L2 cache to fill in the padding with small runs. This will take some research, but using an LZ compression file format that forces 16-byte alignment matches between compressed and uncompressed runs could yield HUGE speed increases when using memcpy to copy the runs.

I also noticed that when I compile my code to opti...

2011-07-12T17:15:04.440+02:00

I also noticed that when I compile my code to optimize for SIZE instead of speed (with GCC), my modified decompression loop runs a LOT faster. I suspect that it is because the loop kernel may not fit in L1 instruction cache when optimized for speed. My loop kernel is big because it is doing simultaneous BWT "helper table" builds during the decompression.

My LZ4 compression is faster when optimized for speed. I probably need to split compression and decompression into separate source files so that I can compile the object modules with whichever compiler optimization switches give fastest code for each operation.

I need to test to see whether building the tables in a separate pass might be faster due to L1 cache effects...

If you want speed, you really need to read Agner Fog's excellent information on code optimization.

You may want to see if you expand the Pareto curve on your code using these methods.

Good luck! ;-)

The above comment about "much faster and much...

2011-07-12T17:03:52.610+02:00

The above comment about "much faster and much smaller" was tested using a BWT encoded list of 2 million names of the format "JOHN A & SALLY T SMITH". I pre-encode my data as BWT so that I can do fast lookups on the data immediately after decompressing an LZ4 block.

I also modified my copy of the LZ4 decompressor to build character histogram and occurance counter "helper tables" during the decompression pass.

On decompression, if a record index has not already been saved, I "unBWT" the table saving the positions in the BWT of all the linefeed characters in "unBWT" sequence so I can immediately unBWT any selected record (name) by record number.

I am using Yuta Mori's BWT code. Thanks guys!

- geekmaster

I found that LZ4 compresses much faster and much s...

2011-07-12T16:43:39.366+02:00

I found that LZ4 compresses much faster and much smaller if you change MINMATCH to 8.

The smaller compression is probably is because a larger MINMATCH prevents many small matches from replacing large matches with hash table key collisions.

The faster speed is from not wasting time processing small runs.