Wednesday, May 30, 2012
Or does it ? Sometimes, data comes from a file into a local storage, or can be stored at the receiving side. In such case, data is merely "traveling", but is not "short-lived".
Does it make any difference ? In fact, yes, it does.
When it comes to sending a file content, this data can be "prepared" in advance. Which means it can be compressed ahead of sending it. Very strong (asymmetric) algorithms can be used for compression, as long as decoding remains "fast enough" to cope with data speed. This leads to major bandwidth reduction, and therefore improve cost and perceived transmission speed.
When it comes to sending "short-lived" data, it means this data did not exist before being produced, and the sole purpose of this data existence is to be sent, and (generally) consumed on receiving end. There is no way to "prepare" such data in advance, it must be compressed "on the fly", which means "fast".
But there is another terrible side effect : compression performance primarily comes from its capacity to "learn patterns", and re-apply them in an optimal way. Which means, for compression to be effective, a minimum of "historic data" must have already been processed for the next data to be properly compressed. With a file content, the history can be the entire file itself, which could mean a lot of megabytes, and therefore excellent perspectives for compression.
The situation is much more severe when data is generated and compressed "on the fly" : maybe the message to be sent is only a few tens of bytes long. How to compress such a thing ?
Let's study this use case.
A first "naive" implementation would simply create a message, make a packet out of it, compress it and then send it.
This implementation is unlikely to bring tangible benefits, since IP packets are typically small, trying to match MTU in order to avoid fragmentation side-effects.
A second, more compression-friendly, implementation, could try to amass enough information before starting to compress it, and then send the compressed data using as many packets as necessary.
This will certainly bring better compression performance, but introduces another problem, latency. Waiting for "enough data to be compressed" can lead to unacceptable delays.
For example, in real-time games, player command must be sent basically a.s.a.p.
As another use case, some systems may generate little data (a temperature probe for example), separated by long cycle duration.
Therefore, waiting for "enough data" is not a valid strategy in such circumstances.
A third, more complex, strategy, would use all past transmitted data as a kind of "dictionary", to help compress the next packet to come.
This basically requires the dictionary to remain "synchronized" at both end, sender and receiver. This is achievable in an "ideal" environment (no loss, no error), which is quite common in fact when using TCP transmission.
So, to sum up, we have some freshly generated data to send, of any size but typically small (a few hundreds of bytes), and we want to use all previously transmitted data as dictionary to improve compression, which requires some kind of "memory" at both sender and receiver end.
This looks possible.
In fact, this is a direct usage of "variable block sizes" concept which i expressly ruled out as "not useful" in an earlier blog note :). Now seems a good time to consider it again...
Such implementation would however require some new functions, able to re-use and save some "history data", instead of starting from freshly clean tables. This will require quite some work to achieve.
As a side effect of such methodology, it also means that such compressed packet are not compatible with stateless protocols : since they depend on previously sent data, they are inherently stateful. But so are TCP sessions anyway...
Monday, May 28, 2012
After spending some time on expected properties at streaming level, let's now get to the core of the objective, regarding the compressed data parameters.
As stated previously, a compressed stream consists of several members, the most important ones being compressed data sets. Each member starts with a header, in order to identify its content. And each header starts with a magic number, a kind of 'ID tag'.
We'll focus here on "LZ4 compressed data set". The stream design above allows adding any future compression algorithm at a later stage.
And let's take as an example the old legacy framing format, defined into lz4demo.
1) There is a magic number, which is 0x184C2102,in little endian format.
2) There are no explicit parameters. In fact, all parameters are implicit.
They are :
- The compressed data set is cut into blocks of 8MB
- Each block starts with a field giving its size (therefore, the compressed size)
- Blocks are independent
- The original data size is not stored. It will be known on decoding completion
- There is no checksum
Well, even with such limitations, the format nonetheless works perfectly fine. It's just a little too restricted to become a "generic format", and therefore, the objective of the specification is to provide more room for parameters selections.
We have already established in previous blog posts that allowing checksum for Error detection is an important selectable feature.
Another important one is the ability to select block size, since they directly control the amount of memory buffers necessary at decoding side.
Let's now study and establish potential needs for a few other properties :
- Source data size
The original size of source data is not an absolute necessity : it's always possible to decode without it, as long as buffer sizes are properly described.
But it is nonetheless useful. For example, thanks to this information, the number of blocks within the current member can be calculated beforehand. Moreover the amount of data to decode from the last block is known.
Or, if there is a single block, the exact amount of memory can be allocated, instead of the block maximum size.
It is also useful to display the processing position (yep, we decoded 80MB, but does that represent 10% or 90% of the stream to decode ?)
However, there are also circumstances in which this data is not known. For example, if the input was piped to the compressing process, then the size will be known only on hitting its end. This might be too late to "retrofit" the output.
Another situation is when several compressed data sets are appended into a single stream : then the "source data size" field only applies to the current data set, but the total size is not known.
Therefore, since it is useful but not compulsory, this information shall be present, but as an option only.
- Uncompressed blocks
A simple but important feature, since it avoids the bandwidth overhead and CPU consumption of the compression format when it's useless.
This could be done very cheaply, by accepting that, if the size of the compressed block is the same as the defined one, then it's necessarily uncompressed.
This suggestion looks simple enough for most blocks, except for the last one, which size is unknown (but capped).
Therefore, it would be necessary to know the size of the last block to compare it to the compressed one, and therefore determine if the block is compressed or not.
Another idea would be : let's give up this complexity, the last block is always compressed, even if compression is either useless or detrimental.
Actually, it's not a good idea to "not handle the last block", since there is a disastrous corner case : supposed that the compressed size of the last block is exactly the size of an uncompressed full block : then the decoding will assume it's uncompressed, leading to data corruption.
This corner case can be avoided by enforcing a simple rule : a compressed block is necessary smaller than original size. Therefore, as the last block has a size <= block size, its compressed size is necessarily < block size. Hence, if the size of this last block is the maximum size, then we are in the specific but valid corner case where the last block size is exactly the maximum size of a block, and is not compressible.
OK, enough of corner cases, let's now be in the normal situation where the last block size is a fraction of the maximum block size. How could we know it is uncompressed ?
This problem could be mitigated by inserting an information to know that we are dealing with the last block. For example, knowing the original size of the source data is enough for this need.
But it's not always available. As said previously, this is just an option, since in some streaming mode, this information is unknown. Therefore we need another indicator.
It could be something as simple as a bit, which simply tells that there is another block to follow, and as a consequence, the current block is full sized. As a bonus, this mechanism also protects against "silent block truncation" (when the compressed stream is cut exactly at the border between 2 blocks).
On reaching the last block, we need another piece of information, either the uncompressed size of the block, or if the block is compressed. The latter seems more compact.
- Zero-filled blocks
This idea was actually proposed by Charles Bloom : it's not rare, for a section of input data, to be filled with zeros.
The idea would be to "mark" such blocks with a specific prefix, such as "0".
For such situation to have reasonable chances to happen, the block size must be small enough. For example, this will probably almost never happen with lz4demo block size (8MB), while this is going to be much more frequent with very small blocks, such as 4KB ones.
- Error correction
While error detection has been much talked about, nothing has been said up to now about error correction.
That's both because this feature is much more complex to provide and of questionable usefulness.
Error correction is mostly useful in situations when there is no way to "resend" erroneous data. This applies to real-time codec (such as voice or video) and stored archive.
The difficulty in both cases is that erroneous data tends to be "bursty". For example, when a storage sector fails, we don't lose just a few bytes, but an entire sector size, typically 4KB. Same for transmission, where the most common error is a missing packet.
Dealing with large burst of errors requires some specific techniques, which unfortunately cost much processing power and memory. As a consequence, the CPU and memory budget for error correction is way beyond LZ4 one, which makes the association a questionable choice.
Therefore, it seems this feature is not expected to be "generic enough" to reserve a place into the generic framing format specification. Obviously, forking is always possible, and even recommended, to support specific features.
- Allow multi-threaded compression and decompression
Multi-threaded compression is easily achievable thanks to the division of input data into "blocks".
Multi-threaded decoding is also possible if those blocks are "independent".
Both mode shall be possible, and selectable
- Variable block sizes
This one is tricky : up to now, we have been talking about "fixed size" blocks only, with only the last block of a compressed data set having an unknown (but capped) size.
The idea here would be to authorize blocks of arbitrary size, instead of fixed ones.
The benefits are two-fold :
- Separate data on "natural boundaries", in order to improve compression ratio and speed
- Allow data insertion of any size
The first point is simple to argue with : such benefit only occurs with very-high ratio (and slow) compression algorithms, such as CM, which "learn" the data behavior through statistics. There is no tangible benefit in trying to do the same for LZ4.
The second benefit is more interesting, since it authorizes some flexibility in archive management.
Actually, this is mitigated by the possibility to concatenate compressed data sets (or "members") together in a single stream or file.
Inserting data could therefore be realized by cutting the initial member into 2 parts, inserting the new member, and concatenating the 3 members together.
As a consequence, it seems the format already supports such scenario, without needing variable block sizes.
- Partial extraction and Quick Jump Table
Another potential idea is that, within a member, one could need to only extract a specific portion.
It's always possible to decode the full data set and get to the required location, but sometimes this might be overkill. For example, one may need a few MB of data which happen to be a few GB away from the starting point.
However, the idea to decode just the necessary part introduces a few restrictions :
- First, the input media should be seekable. It makes little sense to partially decode a piped streams, since the decoding process is likely faster than the pipe itself.
- Second, the compressed data shall be cut into independent blocks. Otherwise, it would be necessary to decode, and therefore read, all previous blocks
- Third, to avoid to decode "too much data", the blocks shall be small enough, with corresponding impact on compression ratio (the smaller the block, the lower the compression ratio).
- Fourth, since the i/o process is likely slower than LZ4 decoding, there is a benefit only if it is possible to quick-jump to the right location immediately.
This can be achieved thanks to a table at the beginning of the compressed file. Such a table can only be filled after compression, and therefore is incompatible with non-seekable output.
- Fifth, such "table" mechanism at member level would be useless in members appending scenarios.
These are quite many restrictions, for the benefit of a hardly-requested feature.
So probably this capability shall be left to a dedicated framing format.
Moreover, should the input stream be seekable, it's still possible to "hop" over blocks without reading/decoding them. This is still slower than a direct jump, but still a sensible potential speed improvement.
- Error detection algorithm
As a quick follow up of selecting-checksum-algorithm, one could note that i had not specified a preferred checksum algorithm, only a preferred checksum size (32-bits).
Although at this stage i'm somewhat inclined to use xxhash-strong, due to its speed and very good distribution property, there is still a chance that the algorithm might be found unsuitable at a later stage. Therefore, some provision should be left to allow another algorithm to take over later on if need be.
Pushing the idea a bit further, one could think "let the user select its own checksum algorithm". While the idea may sound nice, it goes against the principle of interoperability, which is exactly what this framing format tries to achieve. Therefore, only clearly defined checksum algorithms shall be allowed.
I believe this post went through most foreseeable requirements for the LZ4 framing format.
So now seems a reasonable time to start a header specification.
So now seems a reasonable time to start a header specification.