Wednesday, May 30, 2012
Compressed data transmission
Or does it ? Sometimes, data comes from a file into a local storage, or can be stored at the receiving side. In such case, data is merely "traveling", but is not "short-lived".
Does it make any difference ? In fact, yes, it does.
When it comes to sending a file content, this data can be "prepared" in advance. Which means it can be compressed ahead of sending it. Very strong (asymmetric) algorithms can be used for compression, as long as decoding remains "fast enough" to cope with data speed. This leads to major bandwidth reduction, and therefore improve cost and perceived transmission speed.
When it comes to sending "short-lived" data, it means this data did not exist before being produced, and the sole purpose of this data existence is to be sent, and (generally) consumed on receiving end. There is no way to "prepare" such data in advance, it must be compressed "on the fly", which means "fast".
But there is another terrible side effect : compression performance primarily comes from its capacity to "learn patterns", and re-apply them in an optimal way. Which means, for compression to be effective, a minimum of "historic data" must have already been processed for the next data to be properly compressed. With a file content, the history can be the entire file itself, which could mean a lot of megabytes, and therefore excellent perspectives for compression.
The situation is much more severe when data is generated and compressed "on the fly" : maybe the message to be sent is only a few tens of bytes long. How to compress such a thing ?
Let's study this use case.
A first "naive" implementation would simply create a message, make a packet out of it, compress it and then send it.
This implementation is unlikely to bring tangible benefits, since IP packets are typically small, trying to match MTU in order to avoid fragmentation side-effects.
A second, more compression-friendly, implementation, could try to amass enough information before starting to compress it, and then send the compressed data using as many packets as necessary.
This will certainly bring better compression performance, but introduces another problem, latency. Waiting for "enough data to be compressed" can lead to unacceptable delays.
For example, in real-time games, player command must be sent basically a.s.a.p.
As another use case, some systems may generate little data (a temperature probe for example), separated by long cycle duration.
Therefore, waiting for "enough data" is not a valid strategy in such circumstances.
A third, more complex, strategy, would use all past transmitted data as a kind of "dictionary", to help compress the next packet to come.
This basically requires the dictionary to remain "synchronized" at both end, sender and receiver. This is achievable in an "ideal" environment (no loss, no error), which is quite common in fact when using TCP transmission.
So, to sum up, we have some freshly generated data to send, of any size but typically small (a few hundreds of bytes), and we want to use all previously transmitted data as dictionary to improve compression, which requires some kind of "memory" at both sender and receiver end.
This looks possible.
In fact, this is a direct usage of "variable block sizes" concept which i expressly ruled out as "not useful" in an earlier blog note :). Now seems a good time to consider it again...
Such implementation would however require some new functions, able to re-use and save some "history data", instead of starting from freshly clean tables. This will require quite some work to achieve.
As a side effect of such methodology, it also means that such compressed packet are not compatible with stateless protocols : since they depend on previously sent data, they are inherently stateful. But so are TCP sessions anyway...