Friday, May 25, 2012

Useful compressed streaming properties

 The previous articles were primarily targeted at error detection and memory buffer management, which are, arguably, very important features. We'll continue digging into the properties which seem necessary to build a suitably universal compressed streaming format.


  • Compressed data appending
    There are some use cases in which newly created compressed data is simply appended to an existing file or stream. Some time later, the file will be decoded and/or filtered, searching for data of potential interest.
    The decoder must not choke with such input. It shall simply continue the decoding, dealing with each compressed data one after another.

    This feature is relatively simple to support : it is merely necessary to not assume that, after the current compressed stream, an EOF marker will necessarily happen.

    The change is simple, but it also means that a single stream may host several compressed data sets. This situation is reminiscent of the well specified and documented gzip RFC 1952 :
    "A gzip file consists of a series of "members" (compressed data sets)."
    This is a good fit for this situation, so we'll use the same naming convention.

  • Authorized members
    A typical compressed stream will have a single member.
    The member will, necessarily, start with a header, which allows its screening as a "valid" or "invalid" input.
    A simple way to achieve this selection is to start with a "magic number", which then is compared to a list. If it's not in the list, input is rejected as "invalid".
    The "magic number" is enough to tell what kind of decoder is necessary, and which header parameters follow.
    To ensure proper screening from invalid (noisy) input, the magic number shall be large enough for most of its values to be considered "invalid", thus reducing the perspective of "false positive" to negligible levels.

    There is however a specific category that could be defined before-hand : skippable members.

    Skippable members do not contain compressed data. Instead, they may, for example, contain user comments, or a description of the compressed content (for example, file properties), or an electronic signature,  it is even possible to have a program suitable for a certain environment, well, whatever.
    The point is that in some circumstances, these members are not strictly necessary, and may be skipped entirely to retrieve just the compressed content.

    Skipping such member requires, obviously, to know its size.
    Therefore, the specification shall allow : 
    • A range of authorized "magic number values", known as skippable members
    • A mandatory "member size" field
    • Beyond that, any data in the member is "user-specified"

      Any magic number which fall outside of the list of authorized member is considered "invalid", and stop the decoding process, flagging the input as "corrupted".

  • Clear endian convention
    Some CPU read data using Little Endian convention, others use Big Endian convention.
    Other conventions exist too (such as PDP endian), but it's fair to say they are rare.

    Anyway, the point is that, to ensure all these platforms can exchange data between themselves, a single convention shall be selected and enforced

    In my view, Little Endian is the most common convention these days, with x86/x64 on one side and ARM on the other side. Together, they almost "own" the CPU market, TV game console being the main noticeable market outside of their reach.

  • Compatibility with non-seekable streams
    Being compatible with "streaming mode" explicitly excludes dependency on "seekable" property.

    This use case targets pipes, in either input or output mode, or both.
    Piped input data is provided "on the fly" by a generating process, such as tar for example. There is no way to "know its size". It will be discovered on hitting the end of the stream.
    Piped output data is immediately consumed by a user process. There is no way to "come back" to an earlier position and correct or add an information there : every new information must necessarily be appended.

    When both input and output are piped, it creates a set of restrictions.

    Hopefully, it is not so difficult to overcome.
    First, a stream doesn't work "byte by byte". It is in fact a succession of memory buffers. One of them is the compression buffer.
    Knowing the size of input source data is useful in one case : when its smaller than the default block size. This is fine, since in this case, all input can be "loaded" into the compression buffer, and then compressed. Since we then write the result to the output buffer, it's possible to write there the size of input. During decoding, this size will allow allocation of just the necessary amount of memory, instead of the full block-size.
    This is a useful but small benefit : we could have allocated the buffer at its maximum "block" size instead.

    When input is larger than a single block (=buffer size), we have a problem, and we can't "write" the size of the input, since we don't know it yet.
    Well, it's not that much of a problem. LZ4 doesn't "need" this size to decode. And the buffer is already being allocated to its maximum block size anyway.

    In other words, knowing the uncompressed source size is not mandatory. We can live without.
  • Enforce data alignment
    LZ4 does not require any data alignment, so it does not look necessary.

The next article will look more closely at "members" properties.