After taking care of the decompression stream, it's now time to have a look at the compression side.
As stated earlier, a compression stream delivering independent blocks is fairly easy to setup. So the objective of today is to define a proper behavior using chained block methodology, which uses data from previous blocks to better compress the next one. By extension, this strategy also works with static dictionaries.
An ideal compression stream should be able to :
- accept input from any source and any size
- deliver compressed output into destination buffer of any size.
- use as little working memory as possible
The last point is the most interesting one.
A (relatively) easy way to create a streaming interface is to use intermediate buffers, for both input and output. These buffers would easily absorb side effects, such as : providing a ridiculously small output buffer, channeling more input than output, etc.
Obviously, this methodology has downsides too, such as performance slowdown, due to the need to copy data from and into intermediate buffers. But more importantly, it increases memory footprint.
Let's focus on the output buffer for now.
It's relatively easy to skip the intermediate output buffer in the simple case when the remaining space into the destination buffer is "large enough". Then, it's possible to compress directly into the destination buffer, because the compressed block can fit in. It saves a useless memory copy operation, and potentially a memory allocation, if the intermediate output buffer is not yet allocated.
"Large enough" condition is met when one of these conditions is reached :
- Remaining space is larger than a full block size.
- Remaining space is larger than input.
However, if none of these conditions is fulfilled, it becomes harder to decide.
If remaining space into output buffer is smaller than input size (or block size), we could eventually try compression, just in case it would work. After all, if compression ratio is, for example 2:1, an output buffer with barely more than half the size of input could be enough.
The problem is, if it's not enough, then the compression operation will fail, and compression tables would become unusable. This is a major issue. In contrast with decoding, we can't "try and retry" compression, due to the modifications of compression tables.
A potential idea to solve this issue could be to duplicate compression tables before attempting the compression. However, let's remind that the objective is to perform better than "compressing into an intermediate buffer", to avoid some memory copy and allocation. Duplicating the compression tables would create the very same issues, it's doubtful if this technique achieve any saving.
Another idea is to dynamically adapt the amount of input to compress. Basically, since the worst case is that input may be not compressible, compress as much input data as there is remaining space into output buffer. This is possible, because blocks don't need to be full, they just have a "maximum size" (currently, selectable from 64KB to 4MB).
OK, this method works, up to a point.
As stated previously, LZ4 is currently able to produce compressed blocks, which need to be "properly terminated". Small blocks compress worse than larger ones. Fortunately, chained block is an excellent methodology to mitigate this effect.
However, when block sizes start to be become very small (for example, < 100 bytes), some inefficiencies become sensible. For example, the last 5 bytes of any block must be uncompressed. A match cannot start at a distance < 12 bytes from the end of the block. Each block starts with a 4 bytes header, and potentially ends with a 4 bytes checksum. These costs are small, but repeated small blocks will create them more often.
Consequently, there is a "trade-off" to find.
From a compression ratio perspective, an intermediate buffer is more optimal. For example, if the user application provides a destination buffer of 150 bytes, it's likely preferable to compress first into a (larger) temporary buffer, and then deliver the result 150 bytes at a time.
Now, it's possible to imagine some embedded systems with very limited available memory, where allocating another temporary buffer is not possible. In this case, these systems might prefer the inefficiency of the "small blocks" method, rather than no compression at all.
So, maybe, the better way is to provide both methods, letting the programmer select which one works best for his use case.
In the end, trying to manage the complexity of all border cases might be a bit too large for a first delivery. After all, the first delivery is not the last, it's still possible to provide improvements in future releases.
So, a probably faster way to deliver a working streaming interface is to work with intermediate buffers methodology by default, avoiding them in the most obvious use cases. These use cases, by the way, should be the most common ones. By properly documenting them, it increases the odd that the Streaming interface is used in a more optimal way.

Hi Yann,
ReplyDeleteWe use LZ4 in You Doodle. We replaced LZO with LZ4 due to licensing problems. We've noticed LZ4 doesn't perform quite as well on the ARM CPU compared to LZO. Any ideas on that?
LZ4 on ARM is supposed to be fast. It was found to be competitive with other fast algorithms (including LZO). This was the main reason why it was adopted into Linux kernel. For more details see :
Deletehttp://events.linuxfoundation.org/sites/events/files/lcjpcojp13_klee.pdf
Potential solution : you can "force unaligned memory access" :
#define LZ4_FORCE_UNALIGNED_ACCESS 1
Some compiler may incorrectly believe the target ARM system is not able to support unaligned memory accesses, because older ARM cpu could not. But it's changed since quite a few revisions now. It makes a big difference, since LZ4 is accessing 32-bits at a time.
For more detailed investigation, maybe we can take this discussion offline (using http://www.emailmeform.com/builder/form/a70egol6fyVf15zd46jY88), since it's not strictly related to this blog post.