Monday, November 15, 2010

Unramping benefits

One of the pretty nice effects of segment detection, and one of the main reason for the speed improvements behind latest LZ4HC versions, is that it translates into so many less comparisons to do.
Or does it ?
To verify this claim, i've measured the number of comparisons for different search algorithms (Simple Hach Chain, MMC, and MMC + Segment detection) and for different search windows.
And here are the results
(btw, no way to build simple tables ? that's plain crazy)

                      Hash   MMC  Segment  Improvement 
Calgary 64K Searches  71,6M 20,3M 6,55M      x3.10
Firefox 64K Searches  261M  70,3M 47,1M      x1.49
Enwik8  64K Searches  355M  188M  187M       x1.01

Calgary 512K Searches 300M  34,2M 14,6M      x2.34
Firefox 512K Searches 1,26G 153M  121M       x1.26
Enwik8  512K Searches 1,89G 435M  435M       x1.00

Calgary 4M Searches   368M  38,7M 18,5M      x2.09
Firefox 4M Searches   4,15G 271M  308M       x0.88
Enwik8  4M Searches   8,99G 753M  758M       x0.99

Tendancy is clear : the larger the search window, the less benefits segments brings. Indeed, it even ends up with a net loss for Firefox, at a mere 4M window. Let's imagine for 16M....
And it's pretty normal : segment search algorithm is much closer from simple Hash chain than from MMC, as there is no gradual elimination of positions.

To counter this effect, the best would be a perfect merge between the two search algorithms. They currently work together nicely, but they are still completely separated.

In ideal scenario, segment search would provide the equivalent of "guaranteed prefix", and then forward result to MMC algorithm.
Unfortunately, it's not possible with current data structure, as data pointers for segments would risk being erased in the process.
To counter this effect, i would need an additionnal buffer, storing information i don't want to lose with MMC. Something complex to optimize properly.


  1. What is segment search?

    1. Segment search, in this context, look for "repeated characters", typically a lot of '0'.
      Such situation translate into heavy load (i.e. slow down) for classic match chain and binary tree searches.