Tuesday, April 7, 2015

Sampling, or a faster LZ4

 Quite some time ago, I've been experimenting with some unusual sampling methods, in an attempt to find a better way to compress data with LZ4.

The main idea was as follows : LZ4 hash table is getting filled pretty quickly, due to its small size. It becomes the dominant limitation, both for compression ratio and speed. In many cases, a hash cell is overwritten many times before being actually useful (i.e. produce a match). So, could there be some better way to update the hash table, which would update it less often, but in the end, update it more efficiently (i.e. limit wastes from over-writing) ?

It turned out my expectation were too optimistic. Any time I tried to reduce the update rate, it would result in a correspondingly reduced compression ratio. With that experiment failed, I settled for an "optimal" sampling pattern, which became the core of LZ4.

Recently, I've revisited this method. After all, getting a lower compression ratio at a faster speed is not necessarily a bad outcome. It depends on user expectation. So maybe, should a user be allowed to select its own "optimal" speed/compression ratio, he may actually prefer another trade-off than the default one.

Enter LZ4_compress_fast(). It's a new function , available only in developer branch for the time being, which handles a single new parameter : int acceleration.

The concept is fairly simple : the higher the value of acceleration, the faster the compression. Correspondingly, compression ratio decreases too. It can be pretty fine-tuned, each acceleration level providing a little 3-4% speed boost, meaning one could select quite exactly its preferred speed range.

In order to get a taste of this new parameter, a few limited tests were run on the same corpus using different acceleration values. Here are some early results :

                    Compression   Decompression   Ratio
memcpy                4200 MB/s      4200 MB/s    1.000
LZ4 fast 50           1080 MB/s      2650 MB/s    1.375
LZ4 fast 17            680 MB/s      2220 MB/s    1.607
LZ4 fast 5             475 MB/s      1920 MB/s    1.886
LZ4 default            385 MB/s      1850 MB/s    2.101

Silesia Corpus in single-thread mode, Core i5-4300U @1.9GHz, compiled with GCC v4.8.2 on Linux Mint 64-bits v17.

It provides some hint of the relatively wide range of newly accessible speed/compression trade-offs.

The new function prototype is currently only accessible within the "dev" branch. It's still considered experimental, but may find its way into next release r129, depending on user feedback.

Having a parameter to accelerate, rather than strengthen, compression is an unusual concept, so it's not yet clear if it's a very good one. What do you think ? Is a faster and programmable version, trading compression ratio for more speed, a good idea to fit into LZ4 API ?

Edit : LZ4_compress_fast() is released as part of LZ4 r129.


  1. Here's an idea: should throughput be measured in raw MB/s or compressed MB/s?

    Clearly tweaking ratio has an impact here...

    1. It's always raw (uncompressed) MB/s, for both compression and decompression.

  2. Interesting. I've estimated "Compression + Transfer + Decompression" time for some transfer rates based on your benchmark [1]. The result is when transfer rate is between 100..300MBytes/s, LZ4_fast is faster than baseline.
    It matches to typical HDD's speed [2]. Also usual ARM machines speed is 1/3 to 1/10 of PCs, it may be good for SD cards [3].

    [1] https://gist.github.com/t-mat/8f271a7c5844946e990d
    [2] HDD Charts 2013 : http://www.tomshardware.com/charts/hdd-charts-2013/-02-Read-Throughput-Maximum-h2benchw-3.16,2900.html
    [3] SD Cards 2014 : http://www.tomshardware.com/charts/sd-cards-2014/-1-Sequential-Read-MB-s,3472.html

  3. Hi Yann,

    I think it's an excellent idea. Network devices using your method could adjust the compression ratio on the fly to keep the output link close to saturation. This way they would ensure to always achieve the highest uncompressed bandwidth, which is often what really matters in the end. For example, when sending disk backups over a dedicated link, what matters is the time it takes to synchronize data, not the remaining available bandwidth. When the CPU is the bottleneck, your method ensures we can reach higher speeds.

  4. Nice work! Being able to have control over the tradeoff between ratio vs. throughput is extremely useful.

  5. I would definitely like this. With flash storage becoming cheaper, memory intensive applications can use more of it to page out data and recover it. But compression is still required to keep it in target or to reduce cpu bottleneck when needed. Looking forward for this!