Thursday, August 15, 2013

Inter-block compression

 One of the new capabilities defined into the LZ4 Streaming format is inter-block compression. While the concept is simple to formulate (the next block shall be compressed using information from previous block(s)), its execution has been non existent so far.

Up to now, the LZ4 compression utility is doing something relatively simple : cut the data to be processed into independent blocks, compress them one by one (or decode them one by one), assemble the result into destination file or stream.
It is simple and it works.

The problem : the smaller the block size is, the worse the compression ratio.
It can be clearly noticed on the following table :

File : Silesia.tar / Independant blocks
Block Size     LZ4     LZ4 HC
4 MB          47.97%   36.82%   
1 MB          48.03%   36.99%
256 KB        48.27%   37.68%
64 KB          (*)     40.41%

(*) Note : the 64KB version of LZ4 is using a dedicated algorithm which improves compression ratio, and is therefore not comparable with previous block sizes. Nonetheless, for the record, the result is 48.34%.

As can be seen, the situation gets worse and worse with each block size reduction. That's because the first 64KB of each block starts with an empty dictionary, worsening compression capabilities. LZ4 HC is, by the way, much more affected than the Fast LZ4, since it achieves better benefit from earlier data (and therefore loses the most) thanks to its better parsing methodology.
One can also note that, starting with 64KB block size, the decrease in compression ratio is very sensible. It can only get worse from this point towards smaller sizes.

One work around solution is to simply not use small block. It works fairly well : lz4c uses the maximum block size (4MB) for file compression as default.
But there are circumstances when this choice is not possible.

For example, let's suppose your application is sending packets in real time to a server. Because the application is supposed to react in real-time with low-latency, you can't wait to amass 4 MB of data before compressing it and sending it to the server. Data must be sent at application-defined moment, no later.

Such a packet of data is typically small, at most a few KB.
This is very small : if compressing just the packet, the compression is probably going to be poor.

Even more importantly, data packets tend to have a lot of common elements between them, such as headers, and identifiers. But since the repetition can only be noticed across multiple packets, it is lost if each packet is processed independently.

So the idea is to compress the next packet using information from previous packets.

It may seem simple, but this directly contradicts some current assumption in the LZ4 algorithm, which states that each compression function starts from a clean state.

Well, no longer. The soon-to-be release r102 of LZ4 features new compression functions to continue the compression process from where it was stopped previously.
To realize this operation, the compression context is now an explicit argument of the new functions. The context must be created manually, using a dedicated function, and then provided at each round to the new compression functions.
(Note : The usual compression functions are still there, and work the same, so there is no modification for existing LZ4 users.)

As a consequence, these new functions should improve compression ratio. And that's what can be measured :

File : Silesia.tar / Inter-block compression
Block Size     LZ4     LZ4 HC  (HC Independent blocks)
4 MB          47.95%   36.76%      (36.82%) 
1 MB          47.95%   36.76%      (36.99%)
256 KB        47.96%   36.77%      (37.68%)
64 KB         47.97%   36.78%      (40.41%)

There are a few learnings from this experiment :

1) Compression ratio do improve a little bit. This was expected : with previous data available, the initial 64KB of data of each block can be fully compressed at its potential.
One can notice that, once again, LZ4 HC benefit more from it. That's still for the same reason : better parsing, making better usage of available dictionary.

2) Compression ratio do not degrade with smaller block sizes.
Or, to be fair, it does worsen, but by a tiny amount compared to previous situation.

This property is key : it opens the perspective for very small block sizes (such as network packets, of a few KB), while keeping a good compression ratio, close to optimal conditions.

This was the target property of this exercise. It represents a significant step towards an LZ4 Streaming API.

5 comments:

  1. You mentioned in this post "current assumption in the LZ4 algorithm, which states that each compression function starts from a clean state." Does this mean that every call to the compression function does not make use of a dictionary/previous compressed data? Even in the network packet scenario you described, and each packet data is compressed independently (but poorly)?

    Also, are you assuming that because we are mostly dealing with TCP traffic, the inter-block implementation will not suffer from lost/missing packets? Otherwise, will missing packets result in problems during decompression?

    Can you also explain the difference between the LZ4 streaming format (in http://fastcompression.blogspot.fr/2013/04/lz4-streaming-format-final.html) and the compressed data format (in http://fastcompression.blogspot.sg/2011/05/lz4-explained.html)? What should I expect to see if I were to use the current version of LZ4 on packets? Would I see the streaming format (e.g. 4-byte magic number etc) on every packet?

    BTW, your blog is very informative. Thank you for explaining the LZ4 algorithm.

    ReplyDelete
    Replies
    1. OK, that's a lot of questions, so let's try to answer them one by one.

      > Does this mean that every call to the compression function does not make use of a dictionary/previous compressed data?

      When using the "normal" version (such as LZ4_compress() for example), yes, indeed. Each compression is independent.

      In order to benefit from previous data, it is necessary to use newer functions, such as LZ4_compress_continue().

      Delete
    2. > are you assuming that because we are mostly dealing with TCP traffic, the inter-block implementation will not suffer from lost/missing packets ?

      This is out of scope. Of course, for decompression to work correctly, it is mandatory to receive all previous data in sequence.
      Such guarantee is the purpose of another layer, such as TCP. In the case of UDP, you will have to either build an intermediate layer to guarantee the same as TCP, or to forget the idea of using previous packets, and compress each packet independently.

      Note that, for the scenario involving independent packet compression, it is still possible to use a "static dictionary". Each packet will remain independent, but will benefit from the dictionary.

      "static dictionary" compression is also part of the LZ4 Stream specification.

      Delete
    3. > Can you also explain the difference between the LZ4 streaming format and the compressed data format ?

      The streaming format is a layer above the compression format.

      The streaming format defines a way to cut input data into blocks (chained or independent), and check for data integrity.

      Inside each block, the LZ4 compression format is used.

      Delete
    4. Thank you so much for taking to time to reply!

      Delete