Tuesday, May 15, 2012

Dealing with blocks side-effects

 Continuing on the previous post analysis of the lz4demo's current framing format, another side-effect created by this simple format is "latency".

Since the fixed block size is 8MB, the codec must wait for the first block to be completely filled before starting any decoding or compressing operation. It effectively defers processing by a few microseconds.

This issue may not seem large, especially if underlying I/O is fast. Nonetheless, not all I/O are fast, and even in such cases, an 8MB "starting time" is bound to be measurably worse than a 256KB one for example.

As a consequence, a framing format with a smaller block size would offer better and smoother processing throughput.

Which leads us to a last and important issue : independent blocks.
While this strategy is good for simplicity and multi-threading, it's bad for compression : it translates into a worsened compression ratio on the first 64KB of each block.

With block sizes of 8MB, this effect is negligible (affecting compression ratio by less than 0.1%). However, the smaller the block size, the worse the side-effect. With small block sizes, this effect can no longer be neglected.

Therefore, should the blocks remain independent ?

Indeed. By making the next block depending on the previous one, it nullifies the problem of worsened compression ratio. But it also makes it impossible to decode a compressed block independently, with negative consequences on multi-threading and partial decoding capabilities.
Is that really an issue ? Well, it really depends on the use case.

In many circumstances, such as simple file compression or stream compression, it does not matter, since data is encoded and decoded sequentially anyway. Throwing away the capability to multi-thread decompression seems bad, but in fact, most of the time, I/O is unable to cope with LZ4 decoding speed (around 1GB/s). So a single decoding thread is enough to handle almost any I/O load.
Since there is little need for partial decoding, nor for multithreaded decoding, the compression ratio gain looks more useful.

There is just a little remaining problem :
While the decoding function will need few adaptation to handle this new use case, most of the complexity being located into the buffer manager, the compression function on the other hand has to be adapted.

While each block were independant, compression could start with a pristine clean reference table.
But with sequentially dependant blocks, the initialization becomes more complex : the previous 64K needs to be copied in front of the next block, and then loaded/hashed into the reference table, before starting compression. It obviously costs CPU and time.
A variant is to just "translate" the references already loaded into the table as a result of compressing the previous block, but it's limited to "single thread" scenario.

OK, but now that we can reference data from previous blocks, how far should we go ? The natural maximum distance is the "copy window size". This size is, by default, 64KB for LZ4. But it could happen that the receiving end of the compressed stream has not enough memory to store that much data. In such case, the compressor must be careful in not using references beyond the memory capacity of the receiver. In other words, it must deliberately discard long-distance copy operations.

Should such a use case be part of the generic framing format or not ?
My answer would be : it's okay, as long as an easy solution can be found.

How could that work ? Once again, let's focus on the decoder side.
I'll imagine a controller with only 4K memory available as buffer.
A simple way to handle such case is by dividing this space into 2 parts : 2K for the "previous block", and 2K for the "block to decode". So we end up with :
- Block size = 2K = memory limit / 2
- Maximum copy reference = lower limit of previous block

Obviously, there are other possibilities, such as cutting data into even small parts (for example 1K blocks) and having a back reference of up to 3 previous blocks. But as a first approximation, it seems these variant will provide almost equivalent results while being more complex.

This situation can be summarized with a simple rule : never reference data beyond one block distance.

With only this simple rule in place, it seems the default LZ4 framing format could be made compatible even with environments with severely limited RAM, provided the encoder selects a suitable block size.

2 comments:

  1. Just a comment on the multithreaded decompression. You wrote:

    "Throwing away the capability to multi-thread decompression seems bad, but in fact, most of the time, I/O is unable to cope with LZ4 decoding speed (around 1GB/s). So a single decoding thread is enough to handle almost any I/O load."

    While this is true, there are scenarios where you cannot dedicate a single hardware thread to decompression for long periods of time.

    For example, in our Frostbite engine we perform I/O asynchronously and process all data in "Jobs" which are ideally small units of code+data which get scheduled among cores (and SPUs on the PS3). This also happens while the game is running full steam, so there are many other jobs scheduled at the same time. In order for the job scheduling to work well, it's good if most of the jobs are of a similar size and don't vary too much from frame to frame. Currently we essentially use zlib with 64kb chunking. This works well because we can schedule decompression wide and thus get lower latency, even though the overall CPU time consumed is probably slightly higher than if we ran decompression for all available data in a single job.

    ReplyDelete
  2. I fully agree with your comment, which provides a very good insight on how ressource management actually "works" into a real-world application. It makes perfect sense for a fully loaded game engine to "spread" the load into small units, if only for latency and memory usage reasons.

    Please note that the latest serie of articles is directly related to a "generic framing format", above LZ4 compression format, intended for file compression storage or network stream compression. It is a narrower use case.
    For "in-memory" compression scenarios, such as game's ressources or database memory cache, i believe it is better to directly use the current "raw LZ4 compression format" and build a custom (optimised) framing on top of it (such as the 64kb chunks you describe for Frosbite).

    Anyway, the current plan is to provide both modes, independant blocks and linearly dependant blocks, as part of the framing format specification, so that the best one can be selected depending on situation.

    ReplyDelete