RealTime Data Compression: 11/28/10

Saturday, December 4, 2010

Multiple level promotions

One of the main advantages of Morphing Match Chain (MMC) method (explained here) is that you can add elements into the chain without actually sorting them. They will get sorted later on, and only if necessary.
As a consequence, many elements will never get sorted, because they are never accessed.
In a "greedy match" search algorithm, it happens very often, and the shorter the search window, the more probable elements will reach their end of life before being searched.

This strategy is key in avoiding many useless comparison operations.

Now it has a drawback : all elements are added into the "baseline" hash chain. They will get sorted later on if the hash chain is called. But my current implementation of MMC is limited to a maximum of one promotion per pass. As a consequence, all these unsorted positions tend to massively get promoted to next level together. Only elements which does not reach the minimum length are left out, but the point is : these "good" elements could have reached immediately higher level, they are artificially limited to current level + 1 due to implementation.

Willing to measure this effect, i made a simple modification to LZ4 to count the highest potential level of each element, and compare it to the current level limitation.
I was expecting some missed opportunities, in the range of 8 levels typically.

This was incorrect.
I found missed promotion opportunities in the range of hundreds level.
Granted, highest numbers are typically the result of very repetitive patterns, which can be found in binary files (and i witnessed many strange repetition length, such as prime numbers 3, 5, 7, or large one such as 120). In this case, missed opportunities can reach several thousand levels.
But outside of these (not so rare) corner cases, i still have witnessed missed opportunities in the range of a hundred levels with file such as enwik8, which does not feature any massive repetition pattern, just some english words and group of words which are more frequents.

This suggests that some sensible gain can be achieved through multi-level promotion.
Sorting elements on a multi-level scheme will not make the current search faster, but will make the next search using same hash much more efficient.

Unfortunately, this also requires complete rewriting of MMC algorithm. The basics behind the search method remain one and the same, but the algorithm needs to be completely different, to track multiple levels in parallel. Quite some work in the waiting...

Thursday, December 2, 2010

Fusion searches - segments & sequences fully integrated

Finally, after spending quite some time at testing a few different directions (more on this in later notes), i finally had enough time and focus to generalize Fusion for any stream of any character.
I opted for the easy way, by extending the table listing existing segments within range. I had another scheme in mind, with the advantage of being memory-less, albeit at unknown performance cost. Maybe for another version.

Nevertheless. I end up with some usable results, and they are quite promising.
As a usual "raw" efficiency measure, i counted the number of comparisons here below :

MMC Segment Fusion Improvement
Calgary 64K Searches 20.3M 6.55M 5.54M 18%
Firefox 64K Searches 70.3M 47.1M 41.9M 12%
Enwik8 64K Searches 188M 187M 187M 0%

Calgary 512K Searches 34.2M 14.6M 10.9M 34%
Firefox 512K Searches 153M 121M 92.1M 31%
Enwik8 512K Searches 435M 435M 434M 0%

Calgary 4M Searches 38.7M 18.5M 14.2M 30%
Firefox 4M Searches 271M 308M 182M 69%
Enwik8 4M Searches 753M 758M 751M 1%

Now, this is satisfactory. Note how Firefox greatly benefits from fusion search support for "non-zero" bytes. Calgary, which mostly contains streams of zero, achieves a little gain compared to Fusion0. Enwik8 hardly contains any stream at all, and therefore benefits the least.

But the tendency is what matters here. The larger the search window, the better the benefit. Firefox is quite impressive at 4M, keeping in consideration that about 65M comparisons (more than one third of total) are just collisions.

And finally, results are now always better than with good old MMC alone, whatever the file and search window size. Objective reached.