Saturday, October 22, 2011

Progressive Hash Series : a new method to find repetitions

 I've been investigating an old idea which was occupying my mind, in order to create a fast approximative match finder for long range searches. Although it took some time to "think" the problem correctly, in the end, the solution looks surprisingly simple.

The idea starts from a familiar structure, a Hash Table.
In order to create a fast look-up algorithm, it is just needed to hash the "minmatch" initial sequence, and store the position into the table. Later on, when checking for the same sequence, we'll find it in its cell.

OK, that's fine, but even without collisions, it does only provide us with a single answer, which is the closest sequence starting with minmatch bytes. But maybe, somewhere else farther, there is a better, i.e. longer match.

Looking at these other possibilities can be done in several ways. A simple method consists in linking all positions sharing the same hash value into a list. It's called Hash Chain, and it works fairly well. But if the sequence is really redundant, the number of positions searched can become terribly high, and with increased depth of search, it becomes prohibitive.

Hence the simple idea : in order to avoid waiting forever, the search is stopped after a number of attempts. The higher the number, the better the result. The lower the number, the faster the search. This is the trade-off.
Basically, it means that for large distance search, there is no shame in making a "partial search" in order to achieve acceptable speed.

OK, so since the number of searches can be arbitrarily limited, why not storing them in a table in the first place ? The row number is given by the hash, and all corresponding positions are orderly stored into the row. This is much faster to run, since there are less memory jumps, and easier to setup.

This method is not so new, and has been especially championed in the early 90's by Ross Williams, for its compressor LZRW-3a. Therefore we'll call it LZRW.
LZRW structure has in my opinion 2 advantages : one is controlled speed, through the size of rows (=Nb of attempts), and the other one is controlled memory, through the selection of table size.

This latest property is critical : most "full search" methods require a huge amount of memory to work properly. By "huge", i mean something in the range of 10x (with the exception of Hash Chains, but they are too slow for long distance searches anyway). So you want to search within a 256MB dictionary ? You need more than 2.5GB of memory. Farewell 32 bits systems.

One has to wonder : is it the right method ? Such amount of memory is simply prohibitive. Maybe by accepting a "less perfect" search but using less memory, we may nonetheless achieve a better compression rate thanks to the use of longer search distances. This is a position defended by Bulat Ziganshin, for its compressor FreeArc. For this scenario, LZRW comes as an obvious candidate : you can for example setup a 256MB search table, and use it to look into a 1GB dictionary. Now, long distance searches look affordable !

OK, so LZRW works, but there is no miracle : the longer the search, the more time it costs. In the end, the return on investment can be quite low, and with large number of searches (say for example 256), the search becomes so slow as becoming unpractical.
This is to be expected, all that is guaranteed by the table is that all elements share the same row, hence the same Hash Value. But that's it. So after finding an element with minmatch common bytes, we'll test another, and another, and so on. This is wasteful. What we want after finding minmatch bytes is to find another position with at least minmatch+1 common bytes.

In order to avoid testing each and every position, several methods are possible. One is to build a Binary Tree on top of the row. It is very efficient. A particularly good implementation of this idea was made by Christian Martelock for its RZM compressor (now discontinued). Unfortunately, it also means even more memory, consumed for the BT structure. And in the end, it is not so much different from a full BT structure over the entire search window, except that we lose full search capability.

OK i'm looking for something simpler. After checking for a minmatch candidate, i want to look for a minmatch+1 one, straight away. No search nor wandering.
This can be done quite simply : just hash the "minmatch+1" sequence, and store its position directly into an hash Table.
OK, so there are several tables you say, one per sequence size ?
Well, why ? No, a single one.

Just share the Hash Table among all the sequences. The simple idea here, is that a given position should be stored only once in the table, either with the hash of its "minmatch" first bytes, or "minmatch +1", or "minmatch+2", well whatever.
So, which sequence size should be stored ? Well, just start with the "minmatch" one. When a new equivalent sequence gets into the same table cell, we just move the old position at its "minmatch+1" hash, replacing its previous slot with the new position. And next time a "minmatch+1" sequence is found, we move the old sequence to "minmatch+2" and so on. So the position will move into the table, up to the point where it is considered "dead", either off limit, or because of elimination rule.

We obviously have to take into consideration the natural occurrence of collisions into this search structure. And it can happen between any sequence size. For example, a newer "minmatch" sequence could be willing to occupy the same slot as a different "minmatch+2" sequence. Well, one has to move on, or be eliminated.
That's where different rules can be applied. The most basic one is that the youngest position always win. It's good enough, and i recommend it. But it can also be mixed with some more complex heuristic, to avoid losing long-distance anchors too fast. Keeping in mind that this structure is created precisely to afford a partial search over a long distance, when there is not enough memory to keep one entry per position.

So we've got our basic idea here, a cross-over between cuckoo hashing and progressively longer hashed sequence. Hence the proposed name, Progressive Hash Series.

Now is time to put this idea into practice. I've been creating a simple long-range match finder based on greedy matching strategy. It's a simple and fast way to compare above methods. The file tested is enwik9, used into Matt's Mahoney LTCB. This file is very redundant, and particularly intensive for the match finder. LZRW and PHS are compared on their capacity to find longer matches (which translates into better compression), and on their respective speeds. Here are the results :

So what do we learn from these figures ?
First, PHS looks more efficient than LZRW. For an identical amount of memory, it converges faster towards optimal match length. This is particularly telling for the 4-attempts version, which is about has efficient as a 80-attempts LZRW.

Note however that for a given number of attempts, PHS is more complex, and requires more time. For example, the 4-attempts PHS is approximately the same speed as a 14-attempts LZRW. So, in the end, we've got the compression power of a 80-attempts LZRW for the cost of a 14-attempts one. It is still a gain.

This is my first review of this new method, and i guess we have only scratched the surface of it. Further refinements are to be expected ahead.
I have not found yet this idea described somewhere on the net. It is not even mentioned in the thorough Match Finder review by Charles Bloom. So, maybe, after all, this idea might be new.

[Edit] As specified by Charles Bloom, merging several hashes of different lengthes into the same table is not new (see comments). However, the update mechanism presented here might be.