The recent ports of LZ4 to Python and Perl are two excellent examples, demultiplying the possible usages by making the algorithm available to a broader range of applications, including easy-to-deploy scripts.
In this context, a simple question is asked : how can one be sure that a file compressed with, for example, the Python port of LZ4, will be decodable by the C# port of LZ4 ?
Which is indeed a very good question. And the sorry answer is that : there is no such guarantee. At least, not yet.
LZ4 has been developed as an in-memory compression algorithm : it takes a memory buffer as an input, and provides the compressed result into an output memory buffer. Whatever is done before or after that operation is beyond its area of responsibility. The fact that these buffers may be read, or written, to disk is just one of many possibilities.
This is a correct positioning. There are too many possible usages for LZ4, and a lot of them will never bother with on-disk files or network streaming.
Nevertheless, it happens that one natural usage of compression algorithm is file compression, and it applies to LZ4 too. In this case, the compressed result is written to a media, and may be decoded later, possibly by a completely different system. To ensure this operation will be always possible, even when using any other version of LZ4 written in any language, it is necessary to define a file container format.
To be more complete, a container format has been defined into lz4demo, but it was never meant to become a reference. It is too limited for this ambition. Therefore, a new container format specification is necessary.
The ideas in this post will be largely borrowed from this older one, where a container format was discussed and defined for Zhuff. Most of the points are still applicable for LZ4.
We'll start by defining the objectives. What are the responsibilities of a file container format ? Well, mostly, it shall ensure that the original file will be restored entirely and without errors. This can be translated into the following technical missions :
- Detect valid containers from invalid input (compulsory)
- Detect / control errors (recommended)
- Correct errors (optional)
- Help the decoder to organize its memory buffers
- buffer sizes, and number of buffers (compulsory)
- provide user control over buffer sizes (optional)
- Organize data into independent chunks for multi-threaded operations (optional)
- Allow to quick-jump to a specific data segment within the container (optional)
- Be compatible with non-seekable streams, typically pipe mode (compulsory)
- Enforce alignment rules (optional)
- Allow simplified concatenation of compressed streams (optional)
This is already quite a few missions. But most of them will be possible, as we'll see later.
Another important thing to define is what's not within this format missions :
- Save the file name (debatable)
- Save file attributes and path
- Group several files (or directory) into a single compressed stream
- Provide optional user space, typically for comments (debatable)
- Split the compressed stream into several output files
While the file container format takes care of regenerating original content, the archive will also take care of directory structure. It will be able to save file names, and re-organize them in a better way to help compression. On decoding, it will regenerate the entire tree structure, placing each file at its correct location, with sometimes also original attributes.
An archive format therefore occupy a higher layer in the structure. And it typically depends itself on such container formats.
Last but not least, we'll make the definition of "file" a bit broader, Unix-like, in order to also encompass "pipe'd data". The "file container format" then becomes a "stream container format", with files being one of the possible streams.
All these elements can be provided in a relatively compact and efficient header.
But first, let's discuss the most important features.
[Note] Considering the comments, it seems the word "container" does not properly define the objective of this specification. Another word shall be found. "Framing format" is the current candidate.