Thursday, April 12, 2012

Detecting errors

 After thinking a blip second about it, it seems clear to me that an important feature expected from the framing format is a good enough guarantee that the regenerated data is correct.
Surely there are also valid usages of LZ4 which may not need error detection and its associated CPU and bandwidth overhead, hence this feature will remain optional in the final specification, but nonetheless recommended.

This requirement is really very common. And a reasonable method to check for unintentional errors (such as transmission or storage ones) can be provided with a simple checksum. Typically, a good 32-bits checksum will make it relatively improbable (less than 1 in a billionth) that random changes stay unnoticed. This method is typically used by Zip.

One could note that the LZ4 decoder already includes some form of detection for malformed input. It's correct, but it's not enough, because it only checks that provided instructions are valid. The main objective was to forbid any attempt to read or write outside of the memory buffers. Hence, it cannot detect a corrupted literal within the compressed stream for example.

On the other side, one could also require the method to guarantee the input data never was modified, neither randomly nor intentionally (data integrity). Such objective, however, seems out of scope.
To begin with, it require an electronic signature, at least 128-bits such as MD-5. But even MD-5 is now considered too weak for this use, and shall be superseded by stronger 160-bits SHA-1 and such.
One obvious issue is speed. What's the use of a super-fast compression / decompression algorithm if the checksum function is several times slower ?
But the most important issue is that the signature is not enough. The signature's mission is to make it almost impossible to find a modification to the original source file which produces the same electronic signature. But since the electronic signature is provided within the framing format, a malicious attacker would just need to change both the content and the signature.
Therefore, an effective protection shall also require encryption, or a separate signature storage and transmission, which is way beyond the objective of this framing format.
So we'll just consider random data corruption from now on.

The next question is : what has to be checksum-ed ?
Due to the popularity of the Zip format, an obvious answer could be "the original data".
But let's look into it, it's not that obvious.

Take the snappy framing format for example. It checksums the original data of each chunk individually, with each chunk size being <= 32KB. The main advantage is that an incorrect chunk is detected immediately, there is no need to wait for the end of the stream for this result. This is especially important when the stream is very large (for example 1GB), and the consumer process wants to use the decoded data without waiting for the last chunk.
The main drawback, however, is silent truncation. Imagine for example that the last few chunks are missing. There will be no way to detect that.

Another possible answer is to checksum the compressed data instead of the original one.
This is valid. A compressed stream can only be decoded into a single result. Hence, protecting the compressed data from errors is the same as protecting the original data.
And it brings several advantages with it.
First, by checking the checksum against the compressed stream, it warns against data corruption even before attempting to decode.
Second, since compressed data is generally smaller than original one, the checksum function will be faster to execute. Both properties have positive impact on CPU consumption.

Note however that for the first property to be true, each chunk must be checksumed, not just the entire stream, because the decoding starts immediately on reading the first chunk from i/o, both to save time and memory.

OK, so considering these advantages, i'm almost sold to the method of checksuming the compressed chunks. The most important drawback i would still worry about is the silent truncation issue of snappy-frame.

It seems however this can be easily solved. One just needs to ensure that the "continuation" information is provided. And there are several possibilities to do it. For example, provide the size of the original data into the framing format, and calculate the number of (fixed size) chunks. Or embed the continuation information into a field associated with each chunk, and make it part of the checksumed data. There are surely other possibilities. The main idea is that the "end of stream" information must be provided, not just detected on hitting the EOF marker.

One last property that could be missing is using the checksum as a kind of "weak signature". For example, in the zip format, since the checksum is calculated from the original data, it is possible to "check" that the file present into the zip file is likely the expected one.
But i'm not sure if such usage has any real value, not even convenience. Has stated earlier, such "signature" is much too weak to be trusted, so it can merely be used as an "hint", not much else.

As usual comments are welcomed.
There is another point to discuss, which is the checksum algorithm itself.
We'll keep that for another blog note.

Sunday, April 8, 2012

A file container format for LZ4

 With increased usage of LZ4 come new use cases, and with them new requirements.

The recent ports of LZ4 to Python and Perl are two excellent examples, demultiplying the possible usages by making the algorithm available to a broader range of applications, including easy-to-deploy scripts.

In this context, a simple question is asked : how can one be sure that a file compressed with, for example, the Python port of LZ4, will be decodable by the C# port of LZ4 ?

Which is indeed a very good question. And the sorry answer is that : there is no such guarantee. At least, not yet.

LZ4 has been developed as an in-memory compression algorithm : it takes a memory buffer as an input, and provides the compressed result into an output memory buffer. Whatever is done before or after that operation is beyond its area of responsibility. The fact that these buffers may be read, or written, to disk is just one of many possibilities.

This is a correct positioning. There are too many possible usages for LZ4, and a lot of them will never bother with on-disk files or network streaming.

Nevertheless, it happens that one natural usage of compression algorithm is file compression, and it applies to LZ4 too. In this case, the compressed result is written to a media, and may be decoded later, possibly by a completely different system. To ensure this operation will be always possible, even when using any other version of LZ4 written in any language, it is necessary to define a file container format.

To be more complete, a container format has been defined into lz4demo, but it was never meant to become a reference. It is too limited for this ambition. Therefore, a new container format specification is necessary.

The ideas in this post will be largely borrowed from this older one, where a container format was discussed and defined for Zhuff. Most of the points are still applicable for LZ4.

We'll start by defining the objectives. What are the responsibilities of a file container format ? Well, mostly, it shall ensure that the original file will be restored entirely and without errors. This can be translated into the following technical missions :
  • Detect valid containers from invalid input (compulsory)
    • Detect / control errors (recommended)
      • Correct errors (optional)
  • Help the decoder to organize its memory buffers 
    • buffer sizes, and number of buffers (compulsory)
    • provide user control over buffer sizes (optional)
  • Organize data into independent chunks for multi-threaded operations (optional)
  • Allow to quick-jump to a specific data segment within the container (optional)
  • Be compatible with non-seekable streams, typically pipe mode (compulsory)
  • Enforce alignment rules (optional)
  • Allow simplified concatenation of compressed streams (optional)

This is already quite a few missions. But most of them will be possible, as we'll see later.

Another important thing to define is what's not within this format missions :
  • Save the file name (debatable)
  • Save file attributes and path
  • Group several files (or directory) into a single compressed stream
  • Provide optional user space, typically for comments (debatable)
  • Split the compressed stream into several output files
In my view, these elements should be part of an archive format.
While the file container format takes care of regenerating original content, the archive will also take care of directory structure. It will be able to save file names, and re-organize them in a better way to help compression. On decoding, it will regenerate the entire tree structure, placing each file at its correct location, with sometimes also original attributes.
An archive format therefore occupy a higher layer in the structure. And it typically depends itself on such container formats.

Last but not least, we'll make the definition of "file" a bit broader, Unix-like, in order to also encompass "pipe'd data". The "file container format" then becomes a "stream container format", with files being one of the possible streams.

All these elements can be provided in a relatively compact and efficient header.
But first, let's discuss the most important features.

[Note] Considering the comments, it seems the word "container" does not properly define the objective of this specification. Another word shall be found. "Framing format" is the current candidate.