RealTime Data Compression: Fusion searches - segments & sequences fully integrated

Thursday, December 2, 2010

Fusion searches - segments & sequences fully integrated

Finally, after spending quite some time at testing a few different directions (more on this in later notes), i finally had enough time and focus to generalize Fusion for any stream of any character.
I opted for the easy way, by extending the table listing existing segments within range. I had another scheme in mind, with the advantage of being memory-less, albeit at unknown performance cost. Maybe for another version.

Nevertheless. I end up with some usable results, and they are quite promising.
As a usual "raw" efficiency measure, i counted the number of comparisons here below :

MMC Segment Fusion Improvement
Calgary 64K Searches 20.3M 6.55M 5.54M 18%
Firefox 64K Searches 70.3M 47.1M 41.9M 12%
Enwik8 64K Searches 188M 187M 187M 0%

Calgary 512K Searches 34.2M 14.6M 10.9M 34%
Firefox 512K Searches 153M 121M 92.1M 31%
Enwik8 512K Searches 435M 435M 434M 0%

Calgary 4M Searches 38.7M 18.5M 14.2M 30%
Firefox 4M Searches 271M 308M 182M 69%
Enwik8 4M Searches 753M 758M 751M 1%

Now, this is satisfactory. Note how Firefox greatly benefits from fusion search support for "non-zero" bytes. Calgary, which mostly contains streams of zero, achieves a little gain compared to Fusion0. Enwik8 hardly contains any stream at all, and therefore benefits the least.

But the tendency is what matters here. The larger the search window, the better the benefit. Firefox is quite impressive at 4M, keeping in consideration that about 65M comparisons (more than one third of total) are just collisions.

And finally, results are now always better than with good old MMC alone, whatever the file and search window size. Objective reached.

RealTime Data Compression

Thursday, December 2, 2010

Fusion searches - segments & sequences fully integrated

No comments:

Post a Comment