Wednesday, September 11, 2013

Towards a Streaming Interface for LZ4

 After settling a Streaming Format for LZ4, it seems about time to define a proper API for it.

LZ4 is quite stable in "block mode" scenarios. They are quite simple. "Take a memory zone here, and compress it there"; with obviously, its reverse operation : "take a compressed memory zone there, and decompress it here". These scenarios and close variants seem today properly supported.

The streaming API objective is to extend usages beyond block mode.
The API shall serve a fairly typical communication scenario between distant machines, in which data may not be easily prepared into a single source buffer (in case of multiple sources), and must be sent at arbitrary (application-defined) moments, with the goal to reduce latency.
It's not necessarily the only scenario the API will serve, but "live communication" is interesting enough; it requires a strong set of properties, difficult to fulfill, and is therefore a likely superset of many other, simpler, use cases.

A streaming interface requires to define a direction. For each direction, there is a data producer, acting as a compressing pipe, and a data consumer, requiring a decompressing pipe. (A full bi-directional communication, therefore, requires 2 directions.)
As a consequence, two roles, and two interfaces, have to be defined.

From a definition perspective, starting from the end can be more instructive, in order to properly see and take into consideration desirable properties. So let's start by the decoding side.


Streaming API for decompression

The LZ4 streaming format describes a stream as a series of blocks.
The fundamental assumption is that each block is complete, to be decoded.
This can be easily checked, thanks to the block prefix, which tells its size. When enough data is provided, the compressed block can be considered "complete", and decompression can be performed.

The main advantage of this method is that it can use existing field-tested decoding functions. However, it also introduces a few limitations.

  • Unfinished input block :
    If the full compressed block is available directly from source buffer, there is no problem.
    But if not, there is a need to copy the piece of block into a temporary buffer, in order to complete it later on.

    This situation could for example happen in a packet scenario, where each packet might be small (for example about 1500 bytes over Ethernet), and several ones are needed to represent even a small compressed block.

    Nonetheless, just the possibility that it "may" happen translates into a requirement to allocate some memory for a temporary input buffer, just in case. Another more complex solution would be to allocate it "just in time", which is, the first time an unfinished block is received.

    Fortunately, the streaming format defines an upper bound to this temporary buffer size, since each block has a maximum size. Therefore, this memory requirement is predictable.
    A way to limit the amount of temporary memory is to use small maximum block sizes. The decoder could then simply refuse to decode streams with a maximum block size above a selected limit.

    Currently, the LZ4 streaming format defines a few maximum block sizes values, from 64KB to 4MB. There may be a need to define more sizes (specifically at the lower end), and, maybe, let users define their own sizes. In this case, the format specification will have to evolve to integrate this requirement.

  • Size of destination buffer :
    The only thing that the decoder knows when it starts decoding a block is its maximum decoded size.
    Should the destination buffer be not large enough to contain the largest possible block size, there is a risk that the decoding process will fail.

    There are several ways to handle this situation. The easiest one is to impose a restriction on authorized destination buffer size. Simply put, if the destination buffer is not large enough, the decoding process will not even start.

    Such a restriction certainly makes the streaming interface easier to code, but can be considered too limiting. For example, maybe the user process wants to decode data directly at its final destination. In this case, the destination buffer is just big enough to handle the exact amount of data expected, not a single byte more.

    To face such a circumstance, a temporary output buffer would be allocated. Whenever a decoded block is too large to fit into the destination buffer, the block will be decoded into this temporary output buffer instead; then, the relevant piece of block will be copied into the destination buffer.
    Such "memory copy" operation seems obviously sub-optimal, but it does the job done.

    This mechanism still lets opened the choice for the priority buffer.
    If the destination buffer is not large enough to contain the largest possible block size,
    should the block be decoded first into the temporary output buffer, and then the relevant piece get copied into the destination buffer,
    or,
    should it be decoded first into the destination buffer, just in case it would be large enough, and then backup to the temporary output buffer when it does not work ?

    The second choice basically trusts more the user programmer, at the cost of heavier performance penalty if the bet wasn't correct (2 decoding function calls instead of one). On the other hand, if the bet was correct, it saves one memory copy operation.
    The first choice is more middle-ground, costing one memory copy operation and a single decoding function call in all circumstances.

    A complex proposal could be to "heuristically guess" the best choice. For example, the algorithm would start with choice 2 (trust the programmer, and decode directly into destination buffer), and then revert to choice 1 after a few fails.
    Another possibility is to have 2 separate functions, so that the programmer can directly select what he wants.

  • Chained blocks :
    In order to improve compression ratio, sometimes dramatically for small packets, the LZ4 streaming format is able to encode new blocks using the previously decoded 64KB from previous blocks.
    This mechanism produces terrific compression improvements, especially for packets which have a lot of headers, fields and/or identifiers in common.

    The logical consequence is that the previous 64KB must be known by the decoding process. Since there is no guarantee that prior decoded data still sits at its decoded position (and it is most probably not), it seems there is no other way than to save these 64KB into a temporary buffer.

    A fairly logical choice would be to put this 64KB into the "temporary output buffer" mentioned in previous paragraph, thus merging both requirements. It inflates the size of this temporary output buffer by 64KB.

    Another consequence is that choice 2 (of previous paragraph) does no longer make sense, therefore only choice 1 seems relevant. Choice 2 can still make sense for in-place decompression of independent blocks. However, for a streaming scenario, chained block is really the advised setup, since it greatly improves compression ratio.

    If there is no other possibility than to first decode into a temporary output buffer, then it seems to make sense to provide, as a result, a pointer and a length into this buffer, rather than copying the result into another destination buffer. It will be up to the user process to decide what to do with this data.

  • Too much input for your output
    This situation can happen when input contains several full size blocks.
    Since decoding is going to be done in a temporary output buffer, which size is controlled by the decoding stream, it is guaranteed to decode at least one full block.
    But beyond that point, there is no such guarantee.

    As a consequence, with not enough output buffer left to decompress remaining input data, the streaming interface must deliver what it can, and then indicate that the job is not completed.
    One relatively simple way to achieve this is to output a Boolean "more_to_come" signal. When it is set, the user process is informed that more data needs to be decoded, and should therefore call again the decoding function, after disposing of current output (since the next output is likely to overwrite it).

So here we have a reasonably complex set of requirements for the decoding process. And it already has a few perspectives for improvements.

For example, most of requirements regarding temporary buffers come from the fact that decoding function must handle complete blocks.
Should a function able to decode partial blocks exist, it would eliminate the need for a temporary input buffer.
The size of the temporary output buffer, which currently must be 64KB + MaxBlockSize, could receive an upper limit of 64KB + 64KB = 128KB (This obviously would only matter for large values of MaxBlockSize).

Going one step further, a decoding function able to handle data copy from "out of buffer" positions would reduce memory needs to just 64KB, on top of offering some perspectives for in-place decompression. By avoiding the requirement to decompress first into a temporary output buffer, and then copy the relevant result at its final destination, it could improve performance.

On the other hand, a new decoding function able to decode "partial blocks" would require some more complex logic, tests, branching, and state information. As a consequence, all this complexity costs a fair share of performance, losing some speed in the process. It's unclear if the final result would be faster than using intermediate buffer. But at least, it would shave off a few buffers.

There are apparently several optimization steps which could be attempted beyond the first delivery. The main idea is that, such future improvements will, ideally, have no impact on the API itself.

(To be continued...)

No comments:

Post a Comment