Thursday, November 24, 2011

A format for compressed streams, part 2

 Following my recent post, here is what i could come up with after a few minutes of thinking : 2 bytes might be enough for my need.

Well, mostly. To begin with, a magic number is still necessary to distinguish a valid stream from random garbage. 4 bytes seems the standard way to go for that role. By using an unusual bit layout, it should make stream confusion pretty rare (almost 1 in 2^32= 4 billions if correctly selected).

Since a stream identity must lead to a compatible compression decoder, this gave me the idea of merging decoder version with magic number. The more the number of supported versions, the less efficient is the 4 bytes header at distinguishing "garbage" streams. But well, since we start at 32 bits, even a reduced identifier length (28-30 bits) should do its job properly.

The next thing we need is a header version. This is different from decoder version, since we may add new properties to the stream format while not modifying the compression decoder itself. It's impossible to guess how many versions will be developed in the future, so let's settle with 2 bits for now (4 values) and reserve a 3rd bit just in case. Even if we run out of version number, it will be possible to use one of the last values to add a few bytes to the header, where version number can be extended.

Next thing, we need to retrieve the size of the (uncompressed) stream. Well, if it can be known. That's not so sure by the way, since in a pipe scenario, there is no way to "probe" for the size of the input stream.
So there is a bit to tell if the stream size is provided into the header. By the way, how much bytes should be used for this size ? 8 Bytes is all around (that's 16 Exabytes, no less), but maybe too much in most circumstances, where 4-bytes (4 Gigabytes) is good enough. So i'll use 2 bits to distinguish between "No Size provided", 8-Bytes size and 4-Bytes size. While at it, since I've one more available value, i'll open the possibility for 2-Bytes size (64KB).

OK, so now, maybe the stream is cut into multiple independent pieces, called "chunks", for easier multi-threading and random access ? I need another bit to tell.

If there are several "chunks", what are their size ?
For simplicity, in this first version, i will only allow identical "uncompressed" chunk sizes into a single stream. But which one ?
Although I've got my own idea for file-compression scenario with Zhuff compression, maybe some users will have different needs. The smaller the chunk size, the worse the compression ratio, but the better for memory usage and random access. So, in order to open the stream container format to new usage scenarios, i'm willing to take a risk, and make chunk size selectable.
What values should be authorized ?
For a first version, a sensible simplification is to only allow sizes which are power of 2, so a few bits will be enough.
Anyway, a "chunk size" should not be too small, otherwise compression ratio can be so bad that it is hardly effective. So let's say that the minimum chunk size shall be 1KB.

OK, so if we use "chunk size = 0" to tell that there is no chunk (single continuous stream), in 4 bits, it is possible to represent any 2^n value from 1KB to 16MB, which seems enough for my needs. Let's keep a 5th bit in reserve, just in case new usages require larger chunks.

Fine, I've got my main stream parameters. I still need some bits for other planned functions, such as the presence of an offset table for random access, header and data chunk checksums, and alignment rules. But there are still enough bits for that within a 2 bytes option header.

So, it gives :

4 Bytes : Base Magic Number + Decoder Version
2 Bytes : Options (detailed)
- bits 0-1 : Header Version
- bit 2 : Reserved (extended version number) (must be zero)
- bits 3-6 : Chunk Size (or no chunk)
- bit 7 : Reserved (extended chunk size) (must be zero)
- bits 8-9 : Stream Uncompressed Size format (0, 2, 4 or 8 Bytes)
- bits 10-15 : Reserved (Offset Table, Checksums, Alignment) (must be zero)
0, 2, 4 or 8 Bytes : Stream Uncompressed Size

Then repeat N Times (with N = Number of chunks)
4 Bytes : Compressed Chunk Size
Data Chunk

So far so good. This format even allows me to naturally handle non-compressed chunks : if "compressed chunk size = uncompressed chunk size", then obviously chunk is uncompressed.

But a non trivial difficulty emerges : should the Stream Uncompressed Size be unknown, what happens for  the last chunk ?
Well, in this situation, the last chunk's uncompressed size is unknown.
This does not happen if we know Stream Uncompressed Size : we can automatically calculate the number of chunks, and the uncompressed size of the last chunk.

To solve this situation, i will "flag" the last chunk. Let's use the highest bit for the flag. In this case, "compressed chunk size" will be followed by a 4-Bytes value, which is the "uncompressed chunk size". Problem solved.

Another trick regarding the last chunk would be to detect its end simply because we reach the end of input (EOF marker). Then, no flag would be needed. The "compressed size" could also be easily calculated, since the end of the compressed stream is known. It would save a few bytes.

The problem is, it only works if the stream is the only data into the file or the pipe. Should it be followed by any other data, this detection method would not work.

Could that happen ? Surely. For instance, multiple streams are stored into a single archive file. A potential way to solve the problem is then to require the "higher" format to know the size of each compressed streams. It would then be in charge to tell it to the stream decoder.
Such capability seems common for an encapsulating format, but i feel nonetheless uneasy with the requirement...


  1. Very sensible. This is quite similar to what I use.

    It sounds like you are assuming your chunks are independent? There is some value in having chunks even if they are not independent.

    Also using compressed size == uncompressed size is indeed good and valuable. There are some more good tricks related to that.

    For example if compressed size is very close to uncompressed size but not quite there, you can gain a lot of speed by just kicking it up to uncompressed. Also you can use compressed size = 0 as a special case for flagging all zero chunks, which are reasonably common in practice.

  2. Yes, chunks are independant, for multi-threading considerations.

    Within chunks, i may have a smaller sub-division called "blocks". Blocks aren't independant. They exist for memory usage and entropy adaptation. I consider "blocks" to be part of the compression algorithm itself.

    LZ4 has no "blocks". But Zhuff has. A Zhuff's block is 256KB (max).

    Hope this is clear. If better wording exist, please let me know.

  3. Good point for the zeroe'd chunks. It does indeed happen on many binaries, although small chunk sizes should be preferred for this (the typical 4MB chunk size that i'm using for Zhuff is too large).

    Food for though :
    I initially thought about keeping the zero value to tell that the chunk is "uncompressed". It would have helped in a very specific situation : if chunk size = 64KB, and if Compressed Chunk Size is written using 2 bytes.
    Then, i don't have the value 65536 available to tell the chunk is uncompressed. 0 would have filled the gap.

    In the end, i dropped the idea due to the need to also store a flag to detect the last chunk. It further reduces such possibilities.

  4. For compatibility between the different language libraries, please encode the compressed length in the output. Right now each library does it themselves, but I'm not sure if they do it the same way. It would make it easier to transition from Snappy, since it does it that way.