RealTime Data Compression: LZ4 explained

Thursday, May 26, 2011

LZ4 explained

At popular request, this post tries to explain the LZ4 inner workings, in order to allow any programmer to develop its own version, potentially using another language than the one provided on Google Code (which is C).

The most important design principle behind LZ4 has been simplicity. It allows for an easy code, and fast execution.

Let's start with the compressed data format.

The compressed block is composed of sequences.
Each sequence starts with a token.
The token is a one byte value, separated into two 4-bits fields (which therefore range from 0 to 15).
The first field uses the 4 high-bits of the token, and indicates the length of literals. If it is 0, then there is no literal. If it is 15, then we need to add some more bytes to indicate the full length. Each additional byte then represent a value of 0 to 255, which is added to the previous value to produce a total length. When the byte value is 255, another byte is output.
There can be any number of bytes following the token. There is no "size limit". As a sidenote, here is the reason why a not-compressible input data block can be expanded by up to 0.4%.

Following the token and optional literal length bytes, are the literals themselves. Literals are uncompressed bytes, to be copied as-is.
They are exactly as numerous as previously decoded into length of literals. It's possible that there are zero literal.

Following the literals is the offset. This is a 2 bytes value, between 0 and 65535. It represents the position of the match to be copied from. Note that 0 is an invalid value, never used. 1 means "current position - 1 byte". 65536 cannot be coded, so the maximum offset value is really 65535. The value is stored using "little endian" format.

Then we need to extract the matchlength. For this, we use the second token field, a 4-bits value, from 0 to 15. There is an baselength to apply, which is the minimum length of a match, called minmatch. This minimum is 4. As a consequence, a value of 0 means a match length of 4 bytes, and a value of 15 means a match length of 19+ bytes.
Similar to literal length, on reaching the highest possible value (15), we output additional bytes, one at a time, with values ranging from 0 to 255. They are added to total to provide the final matchlength. A 255 value means there is another byte to read and add. There is no limit to the number of optional bytes that can be output this way (This points towards a maximum achievable compression ratio of ~250).

With the offset and the matchlength, the decoder can now proceed to copy the repetitive data from the already decoded buffer. Note that it is necessary to pay attention to overlapped copy, when matchlength > offset (typically when there are numerous consecutive zeroes).

By decoding the matchlength, we reach the end of the sequence, and start another one.

Graphically, the sequence looks like this :

Click for larger display

Note that the last sequence stops right after literals field.

There are specific parsing rules to respect to be compatible with the reference decoder :
1) The last 5 bytes are always literals
2) The last match cannot start within the last 12 bytes
Consequently, a file with less then 13 bytes can only be represented as literals
These rules are in place to benefit speed and ensure buffer limits are never crossed.

Regarding the way LZ4 searches and finds matches, note that there is no restriction on the method used. It could be a full search, using advanced structures such as MMC, BST or standard hash chains, a fast scan, a 2D hash table, well whatever. Advanced parsing can also be achieved while respecting full format compatibility (typically achieved by LZ4-HC).

The "fast" version of LZ4 hosted on Google Code uses a fast scan strategy, which is a single-cell wide hash table. Each position in the input data block gets "hashed", using the first 4 bytes (minmatch). Then the position is stored at the hashed position.
The size of the hash table can be modified while respecting full format compatibility. For restricted memory systems, this is an important feature, since the hash size can be reduced to 12 bits, or even 10 bits (1024 positions, needing only 4K). Obviously, the smaller the table, the more collisions (false positive) we get, reducing compression effectiveness. But it nonetheless still works, and remain fully compatible with more complex and memory-hungry versions. The decoder do not care of the method used to find matches, and requires no additional memory.

Note : the format above describes the content of an LZ4 compressed block. It is the raw compression format, with no additional feature, and is intended to be integrated into a program, which will wrap around its own custom enveloppe information.
If you are looking for a portable and interoperable format, which can be understood by other LZ4-compatible programs, you'll have to look at the LZ4 Framing format. In a nutshell, the framing format allows the compression of large files or data stream of arbitrary size, and will organize data into a flow of smaller compressed blocks with (optionnally) verified checksum.

47 comments:

michDecember 14, 2012 at 12:27 AM
I read your spec to reimplement from the description. I think it is complete, the only thing that surprised me is that the matches are allowed to overlap forwards. It might be worth mentioning. My impl was intended to experiment with vectorization but in the end it did not work.
ReplyDelete
Replies
Frank HilemanApril 30, 2013 at 3:29 PM
Thank you for creating a clear and easily understood specification.

I believe you should add a specification of the size limit of literal length and match length. As specified currently, a correct decoder must be able to process an infinite number of bytes in either field. The best would be to specify the maximum value (not length) of either field as a power of two. The maximum length can then be inferred.

There is a typographical error in your specification: "additional" is the correct spelling.
ReplyDelete
Replies
IkemMay 5, 2013 at 1:11 AM
I like to see a bigger version of the tiny picture.

Would you be so kind and upload one?
ReplyDelete
Replies
UnknownJune 20, 2013 at 8:17 PM
Did you also evaluate Base-128 Varints (how Protocol Buffers encode ints) for lengths? I assume they might be slightly smaller, but slower, since they require more arithmetic operations.
ReplyDelete
Replies
AnonymousJanuary 16, 2014 at 12:12 PM
can u tell me how much memory consume by lz4 during uncompression
ReplyDelete
Replies
अमित दोडकेApril 19, 2014 at 11:18 AM
how to compress full Directory or Folder?
ReplyDelete
Replies
CyanApril 21, 2014 at 4:24 PM
Compressing directory, or even file attributes is outside of the scope of LZ4. LZ4 has the same responsibility as zlib, and therefore compresses "stream of bytes", irrespective of metadata.

To compress directory, there are 2 possible methods :

1) On Windows : use the LZ4 installer program, at http://fastcompression.blogspot.fr/p/lz4.html. It will enable a new context menu option by right-clicking a folder : "compress with LZ4". The resulting file will be the directory compressed. You can, of course, regenerate the directory by decompressing the file (just double-click on it).

2) On Linux : use 'tar' to aggregate directory content, pipe the result to lz4 (exactly the same as gzip).
ReplyDelete
Replies
Mr ZAugust 9, 2014 at 6:57 AM
Thanks for the excellent and concise description. There's only one detail that seems to be missing (or maybe I missed it): You say that the token gets divided into two four bit fields. However, I don't think you said how those pack into the byte.

Are they packed little endian with the first field in bits 0-3 and the second field in bits 4-7, or big endian?

I suppose I could go look at the code. It just seems a shame that it isn't included with this otherwise-complete looking description.
ReplyDelete
Replies
UnknownAugust 20, 2014 at 11:29 AM
First of all thanks for this amazing LZ4 and also for this short explanation.
If I got this right, according to offset being hard-coded to 2-bytes we can leverage match back only up yo 64k. And this could explain why in my use case the compression ratio is not increasing when I try to feed it with bigger buffers even if the data repeated is very frequent.
So here it comes the question: theoretically, if memory is not an issue, by increasing the 'offset' to e.g. up to 32bits (no more fixed size at that point) and making the hash-table bigger, we could achieve a more than trivial improvement in compression ratio on bigger buffers.
Does this sound good or am I missing something?
Thanks in advance.
ReplyDelete
Replies
AnonymousNovember 13, 2014 at 11:39 AM
I am using lz4mt multi-threaded version of lz4 and in my workflow I am sending thousands of large size files (620 MB) from client to server and when file reaches on server my rule will trigger and compress file using lz4 and then remove uncompressed file. The problem is sometimes when I remove uncompressed file, I am not able to get compressed file of right size its because lz4 returns immediately before sending output to disk.
So is there any way lz4 will remove uncompressed file itself after compressing as done by bzip2.
Input: bzip2 uncompress_file
Output: Compressed file only

whereas
Input: lz4 uncompress_file
Output: (Uncompressed + Compressed) file
If possible please tell me as soon as possible.
ReplyDelete
Replies
AnonymousNovember 24, 2014 at 3:35 PM
Submit an Internet Draft (https://datatracker.ietf.org/submit/)?
ReplyDelete
Replies
AnonymousDecember 5, 2014 at 12:01 AM
Is it correct to assume that this format can't compress more than 255x?

Assuming that you have a giant block memory that's simply cleared to 0, when you encode match length it'll keep doing adding length a byte at a time, so you'll only get 2^8 length per byte...

I have some data that I'd like to use lz4 on, but it occasionally has some very, very long runs, so that worries me a bit.
ReplyDelete
Replies
TerjeJanuary 8, 2015 at 1:25 PM
Yann, it looks like an SSE version of LZ4 decode can be up to twice as fast as pure C code, this is at least what my testing seems to show!
ReplyDelete
Replies
TerjeJanuary 8, 2015 at 7:12 PM
PS. Sorry about the formatting, the blog sw stripped my nice indenting. :-(
ReplyDelete
Replies
TerjeJanuary 9, 2015 at 8:29 AM
I have downloaded the Silesia Corpus, and like you I don't see the same huge wins as I got from a set of very small files (less than 64 KB output for each of them).

OTOH, any win is a win, right? :-)

I'm thinking that it would be useful to implement dictionaries (& chaining) by simply allocating an initial 64 KB buffer before the input file read location, and to get rid of most of the buffer overflow tests by making the output buffer large enough that it would be sufficient to check or those conditions when having an extended length chunk, then switch to slower/safer code when/if you get too close to the end.

I.e. an extended version of what you are currently doing in some places.

I am still using VS2008 which might be the cause of some minor performance differences, but probably not anything significant: LZ4 decoding with SSE should be dominated by two things:

Branch misses and any extra memory costs due to unaligned operations and/or partial memory overwrites when I have RLE coding of non-power-of-two chunks.

I will try and see if it is better to always copy 16-byte chunks instead of getting 22-32 bytes of sometimes overlapping chunks, but I don't expect that to be a win: I suspect pretty much all odd-offset RLE runs to be so short as to fit in a single iteration.

I still think that using PSHUFB to fixup short RLE offsets to be very nice, avoiding a lot of single-byte copies, but when such RLE compression turns out to be rare, it doesn't really matter.

Re. stepSize16: This was from a version that used a single SSE register for RLE chunks, limiting the step size to the 9-16 byte range instead of the current 22-32. The difference between those two seems miniscule.
ReplyDelete
Replies
TerjeJanuary 9, 2015 at 9:59 AM
I've done some more testing, it seems like the only really significant optimization is in the handling of RLE patterns of up to 16 bytes which I handle with stores only instead of copy operations: This does reduce the memory/cache traffic significantly.

Using just the PSHUFB trick to fixup the first 16 bytes so that the rest can fall into the full-size (16 or 32 bytes) copy loop is 10-20% slower on webster, x-ray and xml. (All compressed with -9)
ReplyDelete
Replies
lingmaakiOctober 12, 2016 at 8:50 AM
Generally speaking most modern compression algorithms give roughly the same compression, and with regard to the number of cores that you can use at once, it is up to you to decide how many you want to use. Generally speaking (unless you are creating large archives) there is no reason to need more than one though. In addition, with multiple cores doing the compression, the bottleneck may become the hard drive. Legacy zip compression is akin to the Deflate method in 7-zip, and will offer the most compatibility between different compression software.

Ling
ReplyDelete
Replies
Bon BonSeptember 21, 2020 at 9:28 PM
How can I recognize that a particular sequence is the last one?
The mere fact that its literal length is 5 surely isn't enough, since there might be other literals 5 bytes long, of course.
Similarly, match length being 0 is not enough, since after adding minmatch=4 it is no longer 0 but 4, and there surely can be matches of that length as well.
So what is the actual condition that indicates that a particular sequence is the final one?
Or maybe it is being recognized by the fact that there is no backref field after the literal? I can't think of anything else that could indicate that we're done, but this seems to be a bit of a stretch, because if the file is corrupted and it so happens that the corruption just erased the backref word, this should be an error, but instead it would be recognized as the final sequence, with no error :q
Can you help me with solving this mystery? I can't find the answer anywhere in the specification.
ReplyDelete
Replies