Featured

Sunday, December 12, 2010

BST : First experimental results

Still on my quest for better search algorithms, i tried a simple BST (Binary Search Tree) code, in order to compare it with MMC. As stated in an earlier post, BST should be faster than MMC, at least in its current format.
Reality is a bit more complex.

My current BST implementations uses a Hash Table acceleration, using same size and calculation as MMC. For a fair comparison, i've disabled Segment detection from MMC.

The results are somewhat mixed and require some explanations.
First, i tried BST with partial updates. It means that positions inside matches are not taken in consideration. This reduces compression rate, but also considerably improve speed, since repetitions are unlikely to be found very often in the dictionary.

Here are the results. As usual, i'm counting the number of comparisons necessary to complete the file.

Partial Update Test
File        BST      MMC
Calgary    0.98M    1.73M 
Firefox    7.71M    13.1M
Enwik8     33.4M    62.4M

The results are pretty clear : in partial update mode BST wins. With much less comparisons to do, BST is much faster, between 20% and 50% depending on file types. Speeds in the range of 80MB/s to 110MB/s are accessible.

Now, partial update is not really my objective. I want a full search, over the entire search window. This is much more difficult, with a lot more elements to integrate, longer searches, and repetitions putting an heavy toll on any search structure.
Here are the results i started with :

Full Update Test
File        BST      MMC
Calgary    1.46M    5.81M 
Firefox    11.3M    41.1M
Enwik8     47.0M    167.M

Looks like a clear win for BST ? Alas, there is more to it.
From a speed perspective, observations do not match these results. Indeed, MMC is much faster than BST, sometimes just 2x, sometimes up to 10x; there is a massive difference, and it is clearly in favor of MMC.

So where does it comes from ? Code complexity ?

No, remember one of the fundamental properties of MMC : data is sorted only if necessary, and during search. In contrast, BST data is sorted on insertion, which means that all bytes added to the tree produce comparisons. These comparisons must be added to total. And here are the surprising results :

Full Update Test
File       BST.Search BST.Update MMC.Search MMC.Update
Calgary    1.46M      382.M      5.81M      0
Firefox    11.3M      518.M      41.1M      0
Enwik8     47.0M      312.M      167.M      0

Ouch, this hurts. Indeed, many more comparisons are generated while adding positions into the tree than while searching a match. The difference is really huge, and clearly play in favor of MMC.
One can see the pretty large impact of repetitions on Calgary and Firefox, which makes MMC stand very far ahead (even though repetition detection is disabled !).
But even if we just look at Enwik8, where repetitions are nearly non-existent, the update cost is still too large compared with MMC.

Speed results are directly affected : MMC is 2.5x faster for Enwik8, 5x faster for Firefox, and 7x faster for Calgary. Quite a large lead.
Obviously, adding some optimizations for repetitions should help BST. However, as the results with Enwik8 prove, it will not be enough to beat MMC.

Maybe MMC is much more a contender than i initially thought...

Note that these results are using Greedy parsing strategy. The conclusions are likely to be very different if going for Optimal parsing strategy.

No comments:

Post a Comment