Thursday, March 21, 2013

A Streaming format for LZ4



It is a long time since I'm willing to produce a suitable streaming format for LZ4. As stated earlier, the format defined into lz4demo was not meant to become widespread, and is too limited. It's also completely unsuitable for network-oriented protocols.

As a consequence, you'll find in the following link a nearly-final specification of the LZ4 streaming format, in OpenDocument format.

It's only "nearly" final, since there are still a few questions left, summarized at the end of each chapter.

However, the main properties seem settled. The format accomodates for a large selection of buffer sizes, authorizes sequential and independant blocks, embed a few checksum options, accept arbitrary flushes at any point, and even define provisions for preset dictionaries mode.

At this stage, the very last questions seem ready to be settled in the next few weeks. Inputs, comments are welcomed.




[Edit] progresses :
Settled thanks to your comments (here and direct email):

  • Endian convention : more votes for Little Endian.
  • Stream Size : 8 bytes seems okay
  • Stream checksum : removed (v0.7), block-level checksum seems enough
  • High compression flag : removed (v0.8), seems not useful enough
  • Block Maximum size : reduced table (v0.9), from 1KB to 4MB
  • Block size : simplified compressed/uncompressed flags (v1.0)


[Edit] answering spec-related questions directly into the post

Jim> suggestion is to allow different checksums, with a 16-bit word identifying which hash

Actually, it was my initial intention.
But i eventually backed off. Why ?

One of LZ4 strong points is its simple specification, which makes it possible for any programmer to produce an LZ4-compatible algorithm of its own, in any language. To reach this goal, complexity is always fought and reduced to a minimum.

If multiple hash algorithms are possible, then the decoder will have to support them all to be compatible with the specification. It's a significant burden, which will deter or slow down people willing to produce, test and maintain their own decoding routine.

Obviously, the streaming format proposed here is not supposed to be "the most appropriate for any usage". I intend it to become "mainstream", cross-platform, and to replace the current "static format" of "lz4demo". But there will always be some specific circumstances in which another format will do a better job.


Matt> deflate format supports preset dictionaries, but nobody uses them. I would drop it.

Actually, I had a few requests for such a feature. The idea is that pre-set dictionaries can make a great difference when it comes to sending a lot of small independant blocks.


Matt> Do you need checksums at all?

I think so. The format is for both short-lived transmission data, and for storage one. Checksum is almost mandatory for the second use-case. Anyway, in the proposal, Checksum is still an "option", so it can be disabled by the sender if it seems better for its own use case.


Matt> Do you need both block and stream checksums? Probably not.
Mark> Stream Checksum: I don't see the point if data block checksum gives the appropriate protection 

That's the good question. It's kind of 50/50 when it comes to evaluating comments.
The simplicity argument looks good though.
[Edit 2] Stream checksum is removed (v0.7+), keeping only block-level checksum


Mark> Do you really think your block maximum size values make sense (...) All in all, I tend to think it is a too wide choice anyway

Good point. In v0.9, the choice of values for block maximum size has been reduced.


Matt> Do you need variable sized fields just to save a few bytes in the header? Probably not. 
Mark> I would make that "compressed flag" explicit, and thus keep only one "data size" field disregarding if the data was compressed or not. (...) . I'm not even sure you would need two possible sizes just to save one byte per block. 

Good points. Apparently, this part looks too complex, and would deserve some re-thinking.
[Edit 2] : the specification regarding block size is simplified in v1.0.


Adrien> Would it be better to let users select their own dictionary identifiers, rather than requiring Dict-ID to be the result of passing the dictionnary through xxHash-32 ?

Good point . This behavior mimics the spec of RFC1950 (zlib). The only advantage i can think of is that it allows the decoder (and encoder) to check if the dictionary is the right one. I'm unsure if this is really useful though....

Takayuki> LZ4 stream may be (pre) loaded on Ready Only Memory. In this case, temporal masking 0 for BC.Checkbits is difficult.

Correct. The current proposal is to load the header into RAM in order to perform the masking and checksum there. Is that an issue ?
Another proposal could be to not check the checkbits for ROM pre-loaded streams, since potential transmission error is nullified for such scenario.

22 comments:

  1. xxhash is not the fastest hash algorithm for every platform. My suggestion is to allow different checksums, with a 16-bit word identifying which hash/checksum is used in the stream.

    To comment on your open-ended questions:

    Stream size: is 8 bytes too big? No, and in 10 years you'll be glad you had it that large. To state this another way, 4 bytes is too small.

    Endian: Little-endian is common on x86. Big-endian is common on ARM and network appliances. I'd say both have the same distribution, but I would keep little-endian.

    Stream CRC: not necessarily redundant since 1. the programmer can ignore it if they want and 2. it can be used to determine decompression failure. I think both should be included.

    High-compression mode flag: I don't feel that is useful. It doesn't affect anything.

    ReplyDelete
  2. Not sure your blog is the right place for all of this, but here we go:

    A few small comments "sur la forme":
    - You may want to explain why the existing stream formats (with minor adaptations to change the algorithm to LZ4) did not fit your goals. Not talking about LZ4demo here.
    - You should put yourself as author of the document + small simple copyright, and if relevant, indicate what changes v0.2 and v0.3 brought compared to 0.1
    - I notice a mix of terminology use in the introduction, between "LZ4" the algorithm, and "LZ4" the streaming format discribed in your document. Not sure however on how to more clearly separate those.
    - Maybe you could share this document on Google Docs (or equivalent), for anyone to see it online without download.
    - Are you planning a "random access"-friendly format as well? Filesystem implementors would be happy :-)
    - small typo in the explanation of "Flag checkbits" (":" too much ?)
    - "(Dict-ID)|(Stream Checksum) is only present if the associated flag is set." -> add "(in the descriptor)"

    Now regarding "le fond"
    - Endianness: if you REALLY can't choose (I would do little-endian as that is the vast majority of CPUs nowadays), why not do it the TIFF way and having a flag telling you which one is used?
    - Stream Checksum: I don't see the point if data block checksum gives the appropriate protection
    - 8 bytes for (optional) stream size "is enough for eveyone"
    - Stream descriptor flag, why not start with "00" as version value number?
    - what if you want to mix independant and not independant blocks, due to the kind of input data? The only way seems now to concatenate two streams, correct?
    - I know that you are not targetting high-end servers with 64Gb of RAM, but looking at the ratios resulting in compressing very small buffers of data, do you really think your block maximum size values make sense for small bit values (everything < 4kB) ? I would easily trade those small values for bigger at the end (> 8Mb), but that's probably because of my use cases (big text files). Not even sure LZ4 results with such big blocks may be that nice. However, in my proposal, the link between bit value and buffer size would not be as straightforward as it is now. All in all, I tend to think it is a too wide choice anyway.
    - Flag checkbits: each possible field (even optional ones) should be protected with checkbits/bytes. Now, I must admit I don't know what the literature/math says about what protection do those 4 bits really give you, and if at first read I am not mistaken, there are certain fields "unprotected"
    - I think the compressed / uncompressed size blocks/logic of a data block needs some rethinking. now in some corner cases you need this additional "original size" field because you misused the "compressed size" field to indicate not a size, but if the block is compressed altogether.
    I would make that "compressed flag" explicit, and thus keep only one "data size" field disregarding if the data was compressed or not. An idea would be to spare on bit of such a data size field, but that would mean 32kB instead of 64kB for 2 bytes sizes. By the way, you don't explain how one is supposed to know if the "compressed size" field is 2 or 4 bytes.

    Voila!

    Mark van Leeuwen

    ReplyDelete
    Replies
    1. Mark> - Are you planning a "random access"-friendly format as well? Filesystem implementors would be happy :-)

      Yes i do, but not in this specification, because the properties seem too different.

      See my answer to Carter :

      http://fastcompression.blogspot.fr/2013/03/a-streaming-format-for-lz4.html?showComment=1363948308131#c8726857856571464257

      Delete
    2. Mark> Stream descriptor flag, why not start with "00" as version value number?

      Good point. Because "00" is more "common", and therefore selecting "01" instead acts as a kind of "error detection" (on top of magic number).

      I agree it is a moot point though. Does it matter ?

      Delete
    3. At "version 1.0" of the spec, no it doesn't, "01" would be fine. But if for some reason in the future, you've reached "11" and you need to go to a fourth version, then having "00" indicating fourth version isn't that nice. But definitly not something that would "kill" the whole format :-)

      Delete
    4. Mark> what if you want to mix independant and not independant blocks, due to the kind of input data? The only way seems now to concatenate two streams, correct?

      Correct.
      I see no serious usecase which would deserve a mix of independant and related blocks within a single stream. It would be a mess for the decoder.

      Delete
    5. Mark> Flag checkbits: each possible field (even optional ones) should be protected with checkbits/bytes.

      That's an interesting proposal. For information, that's not what's happening with zlib's RFC 1950, where optional field DICT-ID is not protected by FCHECK.

      However FCHECK is also defined in such a way that it is simple to phrase, but difficult to extend. The definition within LZ4 Streaming Format, which is based on xxHash-32, is more easily extendable.

      Delete
  3. That's a lot of good comments, thanks; I guess i will need more than a comment thread to answer them all. Mmmmh, what about editing the original article ?

    ReplyDelete
  4. I will continue to use the comment thread to answer to non-spec-related questions or suggestions.

    > - Maybe you could share this document on Google Docs (or equivalent), for anyone to see it online without download.

    That's an excellent idea. The document has been converted to be compatible with Google Docs, and the link now points to this version.

    Btw, i learned in the process that my initial MS-Office-created document was not compatible, in spite of being OpenDocument ! Heck, these vendors are always successfull in putting a few proprietary elements here and there...

    ReplyDelete
    Replies
    1. The table for "Block Maximum Size" disappeared in the google docs version.

      Delete
    2. Thanks for pointing that out. It's corrected now.

      Note that i do consider your comment that this table probably needs a shake up.

      Delete
  5. > You may want to explain why the existing stream formats (with minor adaptations to change the algorithm to LZ4) did not fit your goals. Not talking about LZ4demo here.

    Well, maybe you know more than i do about this.

    An existing format is, typically, RFC1950.
    But, it's specifically meant to be used with zlib. So it can't accept LZ4 as compression algorithm.

    Do you know of other, more versatile, formats which could fit the bill ?

    ReplyDelete
    Replies
    1. I must admit it is not that easy to find a "generic" solution on the pure stream format side. RFC1950 and RFC1951 are quite basic, and archive format sometimes take shortcuts and wouldn't allow the "stream" goal.
      Still, it may be worth to take a look at various archive format to see if you missed a nice and useful addition addition to the LZ4 stream format. May take some time to go thru the list thought, and I honestly am pretty OK with what you came up with already ;-)
      Startpoint could be http://en.wikipedia.org/wiki/Archive_format#Archive_formats

      A+
      Mark

      Delete
  6. By ‘Stream Appending’ you surely mean concatenation?

    ReplyDelete
  7. one piece of information that would also handy in the format info would be
    a) is the underlying information an array of fixed size elements, or essentially regarded as a bytes stream
    b) if it can be regarded as an array of uniformly sized things, how big is each element in bytes?

    the Blosc compression library used by the python scientific computing folks at continuum analytics https://github.com/FrancescAlted/blosc provides this info in the format.

    It also has the secondary upside that it becomes possible to provide a decompression implementation that can index into the compressed data in a meaningful way (at least potentially).

    ReplyDelete
    Replies
    1. Good point.

      For the time being, it is a byte stream.

      I initially wanted to merge both "random access" and "stream" format into a single specification, but finally ruled against it.
      The main issue is that both format have vastly different properties, and therefore would be better off being defined independantly.

      "Random access" requires :
      - Fixed block size, fully filled
      - Blocks must be independant
      - All block sizes regrouped at the beginning, to immediately point to the area to decode
      - Full size MUST be known, in order to evaluate the size of offset table
      - Problems (if not impossibility) for concatenation

      Well, quite frankly, it looks like 2 different beasts.

      Delete
    2. agreed, those should be two specs. The random access one will probably be much more complex.

      Delete
  8. > By ‘Stream Appending’ you surely mean concatenation?

    Yes, maybe it's better worded this way

    ReplyDelete
  9. Mark> - You should put yourself as author of the document + small simple copyright,

    Very good point, thanks. Done now.

    ReplyDelete
  10. Regarding the data blocks, what about the following proposition:
    - first bit indicate if content is compressed ("1", "0" if incompressible)
    - bit 2: is the data size expressed using 14 bits ("0") or 22 bits ("1"), ie. 2|3 bytes minus 2 bits ?
    - bits 3 to 16 or 24: data size (disregarding if block is compressed or incompressible)

    As you see, I would recommend limiting the data size of a block to 4Mb.
    Even further, I'm not even sure you would need two possible sizes just to save one byte per block. In extenso, what if you would say that the data size is always expressed as 3 bytes - the one bit flag to say if it is compressed or not? (max data sie in a block of 8Mb).
    Is there a statistically measured "sweet spot" for the LZ4 algorithm when it comes to compression ratio for a max block size and dictionnary? (I'm afraid I know the answer)

    ReplyDelete
    Replies
    1. Good point. I may simplify this part in a next version.

      Delete