If there is a situation where data is inherently short-lived, it is communication. Data starts its live on the sender side, and end it on the receiving side, a few milliseconds later.
Or does it ? Sometimes, data comes from a file into a local storage, or can be stored at the receiving side. In such case, data is merely "traveling", but is not "short-lived".
Does it make any difference ? In fact, yes, it does.
When it comes to sending a file content, this data can be "prepared" in advance. Which means it can be compressed ahead of sending it. Very strong (asymmetric) algorithms can be used for compression, as long as decoding remains "fast enough" to cope with data speed. This leads to major bandwidth reduction, and therefore improve cost and perceived transmission speed.
When it comes to sending "short-lived" data, it means this data did not exist before being produced, and the sole purpose of this data existence is to be sent, and (generally) consumed on receiving end. There is no way to "prepare" such data in advance, it must be compressed "on the fly", which means "fast".
But there is another terrible side effect : compression performance primarily comes from its capacity to "learn patterns", and re-apply them in an optimal way. Which means, for compression to be effective, a minimum of "historic data" must have already been processed for the next data to be properly compressed. With a file content, the history can be the entire file itself, which could mean a lot of megabytes, and therefore excellent perspectives for compression.
The situation is much more severe when data is generated and compressed "on the fly" : maybe the message to be sent is only a few tens of bytes long. How to compress such a thing ?
Let's study this use case.
A first "naive" implementation would simply create a message, make a packet out of it, compress it and then send it.
This implementation is unlikely to bring tangible benefits, since IP packets are typically small, trying to match MTU in order to avoid fragmentation side-effects.
A second, more compression-friendly, implementation, could try to amass enough information before starting to compress it, and then send the compressed data using as many packets as necessary.
This will certainly bring better compression performance, but introduces another problem, latency. Waiting for "enough data to be compressed" can lead to unacceptable delays.
For example, in real-time games, player command must be sent basically a.s.a.p.
As another use case, some systems may generate little data (a temperature probe for example), separated by long cycle duration.
Therefore, waiting for "enough data" is not a valid strategy in such circumstances.
A third, more complex, strategy, would use all past transmitted data as a kind of "dictionary", to help compress the next packet to come.
This basically requires the dictionary to remain "synchronized" at both end, sender and receiver. This is achievable in an "ideal" environment (no loss, no error), which is quite common in fact when using TCP transmission.
So, to sum up, we have some freshly generated data to send, of any size but typically small (a few hundreds of bytes), and we want to use all previously transmitted data as dictionary to improve compression, which requires some kind of "memory" at both sender and receiver end.
This looks possible.
In fact, this is a direct usage of "variable block sizes" concept which i expressly ruled out as "not useful" in an earlier blog note :). Now seems a good time to consider it again...
Such implementation would however require some new functions, able to re-use and save some "history data", instead of starting from freshly clean tables. This will require quite some work to achieve.
As a side effect of such methodology, it also means that such compressed packet are not compatible with stateless protocols : since they depend on previously sent data, they are inherently stateful. But so are TCP sessions anyway...
Do you mean to create a protocol-agnostic compression layer that could be plugged at both ends of a TCP connection (eg. as GZipOutputStream/GZipInputStream allows to perform in java)?
ReplyDeleteWhen the protocol is known in advance, some systems use a "data dictionary" giving hints on how messages can be compressed most effectively. For example, you can check out the FAST protocol
http://en.wikipedia.org/wiki/FAST_protocol
The Masked Cucumber :)
Yes, i think it is a good description.
ReplyDeleteI'm not too familiar with Java's GZipStreams, so cannot comment on the comparison.
Regarding FAST : i don't know this protocol, and it seems complex enough to not provide an evaluation overnight. However, the method you seem to imply is a kind of "static but selectable dictionary", specially optimized for each protocol.
This is a valid way to improve compression. And moreover, it is compatible with "stateless request" properties, which may be especially suitable for scalability.
However, it is also "protocol dependent", and cannot take advantage of values similarities. This first point is the most limiting one.
For specific applications though, such as banking transactions, it may not be a problem at all.
These guys do this --> http://research.microsoft.com/en-us/projects/coconet/
ReplyDeleteGood point. This is pretty similar.
ReplyDeleteThis comment has been removed by the author.
ReplyDelete(Sorry I'm not a French speaker, I think I deleted my post.)
DeleteNice article. The dictionary synchronization is interesting. A paper titled ("EndRE an end-system redundancy elimination service for enterprises") implemented an end-to-end network deduplication layer and touches on dictionary synchronization a bit.
Another interesting idea for unordered decompression is "Non-Linear Compression" (https://www.usenix.org/conference/hotstorage12/non-linear-compression-gzip-me-not), though it's just an idea and suffers from performance challenges.
Thanks for pointing towards these documents. They are very interesting to read and watch.
Delete