One of the main advantages of Morphing Match Chain (MMC) method (explained here) is that you can add elements into the chain without actually sorting them. They will get sorted later on, and only if necessary.
As a consequence, many elements will never get sorted, because they are never accessed.
In a "greedy match" search algorithm, it happens very often, and the shorter the search window, the more probable elements will reach their end of life before being searched.
This strategy is key in avoiding many useless comparison operations.
Now it has a drawback : all elements are added into the "baseline" hash chain. They will get sorted later on if the hash chain is called. But my current implementation of MMC is limited to a maximum of one promotion per pass. As a consequence, all these unsorted positions tend to massively get promoted to next level together. Only elements which does not reach the minimum length are left out, but the point is : these "good" elements could have reached immediately higher level, they are artificially limited to current level + 1 due to implementation.
Willing to measure this effect, i made a simple modification to LZ4 to count the highest potential level of each element, and compare it to the current level limitation.
I was expecting some missed opportunities, in the range of 8 levels typically.
This was incorrect.
I found missed promotion opportunities in the range of hundreds level.
Granted, highest numbers are typically the result of very repetitive patterns, which can be found in binary files (and i witnessed many strange repetition length, such as prime numbers 3, 5, 7, or large one such as 120). In this case, missed opportunities can reach several thousand levels.
But outside of these (not so rare) corner cases, i still have witnessed missed opportunities in the range of a hundred levels with file such as enwik8, which does not feature any massive repetition pattern, just some english words and group of words which are more frequents.
This suggests that some sensible gain can be achieved through multi-level promotion.
Sorting elements on a multi-level scheme will not make the current search faster, but will make the next search using same hash much more efficient.
Unfortunately, this also requires complete rewriting of MMC algorithm. The basics behind the search method remain one and the same, but the algorithm needs to be completely different, to track multiple levels in parallel. Quite some work in the waiting...
No comments:
Post a Comment