Surely there are also valid usages of LZ4 which may not need error detection and its associated CPU and bandwidth overhead, hence this feature will remain optional in the final specification, but nonetheless recommended.
This requirement is really very common. And a reasonable method to check for unintentional errors (such as transmission or storage ones) can be provided with a simple checksum. Typically, a good 32-bits checksum will make it relatively improbable (less than 1 in a billionth) that random changes stay unnoticed. This method is typically used by Zip.
One could note that the LZ4 decoder already includes some form of detection for malformed input. It's correct, but it's not enough, because it only checks that provided instructions are valid. The main objective was to forbid any attempt to read or write outside of the memory buffers. Hence, it cannot detect a corrupted literal within the compressed stream for example.
On the other side, one could also require the method to guarantee the input data never was modified, neither randomly nor intentionally (data integrity). Such objective, however, seems out of scope.
To begin with, it require an electronic signature, at least 128-bits such as MD-5. But even MD-5 is now considered too weak for this use, and shall be superseded by stronger 160-bits SHA-1 and such.
One obvious issue is speed. What's the use of a super-fast compression / decompression algorithm if the checksum function is several times slower ?
But the most important issue is that the signature is not enough. The signature's mission is to make it almost impossible to find a modification to the original source file which produces the same electronic signature. But since the electronic signature is provided within the framing format, a malicious attacker would just need to change both the content and the signature.
Therefore, an effective protection shall also require encryption, or a separate signature storage and transmission, which is way beyond the objective of this framing format.
So we'll just consider random data corruption from now on.
The next question is : what has to be checksum-ed ?
Due to the popularity of the Zip format, an obvious answer could be "the original data".
But let's look into it, it's not that obvious.
Take the snappy framing format for example. It checksums the original data of each chunk individually, with each chunk size being <= 32KB. The main advantage is that an incorrect chunk is detected immediately, there is no need to wait for the end of the stream for this result. This is especially important when the stream is very large (for example 1GB), and the consumer process wants to use the decoded data without waiting for the last chunk.
The main drawback, however, is silent truncation. Imagine for example that the last few chunks are missing. There will be no way to detect that.
Another possible answer is to checksum the compressed data instead of the original one.
This is valid. A compressed stream can only be decoded into a single result. Hence, protecting the compressed data from errors is the same as protecting the original data.
And it brings several advantages with it.
First, by checking the checksum against the compressed stream, it warns against data corruption even before attempting to decode.
Second, since compressed data is generally smaller than original one, the checksum function will be faster to execute. Both properties have positive impact on CPU consumption.
Note however that for the first property to be true, each chunk must be checksumed, not just the entire stream, because the decoding starts immediately on reading the first chunk from i/o, both to save time and memory.
OK, so considering these advantages, i'm almost sold to the method of checksuming the compressed chunks. The most important drawback i would still worry about is the silent truncation issue of snappy-frame.
It seems however this can be easily solved. One just needs to ensure that the "continuation" information is provided. And there are several possibilities to do it. For example, provide the size of the original data into the framing format, and calculate the number of (fixed size) chunks. Or embed the continuation information into a field associated with each chunk, and make it part of the checksumed data. There are surely other possibilities. The main idea is that the "end of stream" information must be provided, not just detected on hitting the EOF marker.
One last property that could be missing is using the checksum as a kind of "weak signature". For example, in the zip format, since the checksum is calculated from the original data, it is possible to "check" that the file present into the zip file is likely the expected one.
But i'm not sure if such usage has any real value, not even convenience. Has stated earlier, such "signature" is much too weak to be trusted, so it can merely be used as an "hint", not much else.
As usual comments are welcomed.
There is another point to discuss, which is the checksum algorithm itself.
We'll keep that for another blog note.