RealTime Data Compression: MMC last boost : Secondary sorting

Saturday, December 11, 2010

MMC last boost : Secondary sorting

Credit is due here to Piotr Tarsa, which made an abandoned idea for improving MMC achievable.

This idea is called "secondary sorting", and can be explained as follows :
When searching into a given level, MMC is currently merely interested by the character being looked for. Sometimes we find it immediately, and sometimes not. We may keep finding wrong characters many times, until we get to our goal.

What happens then to all these wrong characters we have discovered all along ?
Quite simply, nothing. They just stay there at their position in the chain, waiting for their turn to be looked for, if it ever happen.
It sounds like a pity, as, eventually, we could have used this information to sort all these positions to higher level, making future searches for these characters faster, but also "cleaning" the current level much faster.

However, this is a much more complex algorithm to implement. And not just complex, it's also an expensive one, requiring more structures, memory and tests. As a consequence, i was pretty much convinced that it would no reach the point of becoming worthwhile.

After a rich discussion at encode.ru forum, all problems gradually found their solution thanks to Piotr involvement and smart advises. We are now in a position where the algorithm looks reasonably efficient, and could therefore bring some benefits.

I started with a few statistics extraction, as usual, to better understand the situation.
To my great surprise, i discovered that, in many circumstances, some level can take a very long walk before finding the character being looked for. In worst circumstances, i've reach values in the thousands (probably a symptom of long repetitions). Even in more common situations, we can reach several tens attempts, with some "wrong" characters repeated many times.
It's therefore obvious that secondary sorting will provide a real boost at sorting this data, reducing the number of comparisons required for future searches.

Following this short study, i quickly implemented a limited version of "secondary sorting", just able to handle the last position of wrong characters, resulting in their anticipated promotion. It immediately resulted in a 12% improvement in number of comparisons.
Quite a large benefit for such a minor sorting improvement. Better secondary sorting can only improve this figure.

All looks for the better, except that, at this stage, the benefits do not offset code complexity.

At least not yet ...

RealTime Data Compression

Saturday, December 11, 2010

MMC last boost : Secondary sorting

No comments:

Post a Comment