Thursday, May 26, 2011

LZ4 explained

 At popular request, this post tries to explain the LZ4 inner workings, in order to allow any programmer to develop its own version, potentially using another language than the one provided on Google Code (which is C).

The most important design principle behind LZ4 has been simplicity. It allows for an easy code, and fast execution.

Let's start with the compressed data format.

The compressed block is composed of sequences.
Each sequence starts with a token.
The token is a one byte value, separated into two 4-bits fields (which therefore range from 0 to 15).
The first field uses the 4 high-bits of the token, and indicates the length of literals. If it is 0, then there is no literal. If it is 15, then we need to add some more bytes to indicate the full length. Each additional byte then represent a value of 0 to 255, which is added to the previous value to produce a total length. When the byte value is 255, another byte is output.
There can be any number of bytes following the token. There is no "size limit". As a sidenote, here is the reason why a not-compressible input data block can be expanded by up to 0.4%.

Following the token and optional literal length bytes, are the literals themselves. Literals are uncompressed bytes, to be copied as-is.
They are exactly as numerous as previously decoded into length of literals. It's possible that there are zero literal.

Following the literals is the offset. This is a 2 bytes value, between 0 and 65535. It represents the position of the match to be copied from. Note that 0 is an invalid value, never used. 1 means "current position - 1 byte". 65536 cannot be coded, so the maximum offset value is really 65535. The value is stored using "little endian" format.

Then we need to extract the matchlength. For this, we use the second token field, a 4-bits value, from 0 to 15. There is an baselength to apply, which is the minimum length of a match, called minmatch. This minimum is 4. As a consequence, a value of 0 means a match length of 4 bytes, and a value of 15 means a match length of 19+ bytes.
Similar to literal length, on reaching the highest possible value (15), we output additional bytes, one at a time, with values ranging from 0 to 255. They are added to total to provide the final matchlength. A 255 value means there is another byte to read and add. There is no limit to the number of optional bytes that can be output this way (This points towards a maximum achievable compression ratio of ~250).

With the offset and the matchlength, the decoder can now proceed to copy the repetitive data from the already decoded buffer. Note that it is necessary to pay attention to overlapped copy, when matchlength > offset (typically when there are numerous consecutive zeroes).

By decoding the matchlength, we reach the end of the sequence, and start another one.

Graphically, the sequence looks like this :

Click for larger display

Note that the last sequence stops right after literals field.

There are specific parsing rules to respect to be compatible with the reference decoder :
1) The last 5 bytes are always literals
2) The last match cannot start within the last 12 bytes
Consequently, a file with less then 13 bytes can only be represented as literals
These rules are in place to benefit speed and ensure buffer limits are never crossed.

Regarding the way LZ4 searches and finds matches, note that there is no restriction on the method used. It could be a full search, using advanced structures such as MMC, BST or standard hash chains, a fast scan, a 2D hash table, well whatever. Advanced parsing can also be achieved while respecting full format compatibility (typically achieved by LZ4-HC).

The "fast" version of LZ4 hosted on Google Code uses a fast scan strategy, which is a single-cell wide hash table. Each position in the input data block gets "hashed", using the first 4 bytes (minmatch). Then the position is stored at the hashed position.
The size of the hash table can be modified while respecting full format compatibility. For restricted memory systems, this is an important feature, since the hash size can be reduced to 12 bits, or even 10 bits (1024 positions, needing only 4K). Obviously, the smaller the table, the more collisions (false positive) we get, reducing compression effectiveness. But it nonetheless still works, and remain fully compatible with more complex and memory-hungry versions. The decoder do not care of the method used to find matches, and requires no additional memory.

Note : the format above describes the content of an LZ4 compressed block. It is the raw compression format, with no additional feature, and is intended to be integrated into a program, which will wrap around its own custom enveloppe information.
If you are looking for a portable and interoperable format, which can be understood by other LZ4-compatible programs, you'll have to look at the LZ4 Framing format. In a nutshell, the framing format allows the compression of large files or data stream of arbitrary size, and will organize data into a flow of smaller compressed blocks with (optionnally) verified checksum.