After reading a short rant from Charles Bloom, i spent some time reading and studying cuckoo hashing, a relatively new hash methodology, which gets its name from the bird : as you may know, a cuckoo is genetically programmed to hatch into another bird's nest, and when it gets out of its shell, its first act in life is to push other eggs and little baby birds out of the nest, ensuring all future food for its own benefit.
Cuckoo hashing share the same behavior :
Assuming you have 2 hash functions, with relatively low probabilities to produce the same hash from different sequences, you start by hashing your sequence with first hash. If the slot is already occupied, you take it. The previous slot occupant is then re-hashed, using the second hash function. It therefore finds another place for its use. However, if the second place is already taken, then the process starts again : the occupants gets removed, then re-hashed using first hash function, and so on.
In the end, there is a pretty high chance that, with enough shuffling, all elements will find their own place.
Searching into a cuckoo hash table is pretty straightforward : you calculate both hashes, using the 2 hash functions, and then check both positions for sequence presence. It can't be anywhere else, so you're sure to find it in maximum 2 probes if it exists.
This looks interesting, but there is a catch : you can get into a situation where there are infinite loops, such as in the picture W & H, which point to each other.
To counter this risk, you have to respect a "load limit". In the original cuckoo hash implementation, this limit is pretty low, at 50%. However, since then, a simple improvement has been found, using multi-slots bucket or more hash functions. With many more choices to choose from, the load limit rocket to the roof, reaching extreme performances, as high as 99.7% (for 2 Hashes and 8 slots per bucket, for example).
So where does that lead us to ?
When reading these properties, i had one target in mind : Etincelle, my current flagship compressor. To describe its characteristics swiftly, Etincelle has compression performances in the range of the best Zip, but much much faster (calculated at 80MB/s for enwik8 on my Core 2).
Its speed is mostly due to its very fast match search algorithm, which however misses many opportunities due to its very light probing. Etincelle uses this inefficiency to its advantage, by referencing data with very little indexes. This works even at fairly large distances (you can instruct Etincelle to use a 1GB dictionary for example) but is only as efficient as the table fill-rate. I therefore use over-booked hash tables to ensure index efficiency.
Cuckoo hashes with high load limit could help here in making this scheme even more efficient.
Testing this, however, would require some fairly large code changes, to the point of changing Etincelle into something else, obviously a new compressor. The speed is also poised to be slower, has measured by Dr Askitis's hash comparison thesis.
Another potential use of Cuckoo Hash table is a more classical objective within LZ4 : as measured in a previous post, collision rate do reach inconvenient level with Firefox at 4M search depth, totalizing 60 millions comparisons, which is more than 20% of all comparisons. This is obviously a lot, in spite of MMC automatic handling of this situation, albeit in a chain-like fashion. Here also, cuckoo hash could help to bring down these collisions to zero, but at a cost. It remains to be tested if the extra cost and complexity remains worthwhile.

 
 
I used to implemented a cuckoo hash table (lost). It is indeed fast on lookup (of course), but it is very slow on insertion due to the chain of swaps.
ReplyDeleteIndeed. As suggested by other contributors, sometimes it is not worth it to go beyond a certain swapping threshold.
ReplyDeleteAnd apparently, the number of swap just rockets to the roof when a certain fill rate is achieved (and selecting the right H & T values makes a difference here).
Then there are 2 possibilities : either expand the hash table, or just drop the element.
Dropping is indeed a valid possibility when doing compression : it just results in a lesser compression rate, but nothing more serious. This is the kind of speed vs comp.rate we have to do all the time across a compression algorithm.