As stated when MMC algorithm was first published on encode.ru board, there are alternative methods to search sequences in a dictionary, and one of the most efficient, to date, is BST.
BST stands for Binary Search Tree. There is a nice entry at wikipedia on this concept.
BST key idea is pretty simple : given an element, 2 pointers are associated to it. On the "left" pointers, we'll find others elements which are strictly smaller than current one. On the "right" one, all elements will be larger.
This requires elements to be "comparable", a rule we can easily create in this numeric world.
Now, maybe some of you played at the "guess my number" game as a kid. The idea is simply to find a hidden number, using a minimum number of questions, which answer can only be yes or no.
And the best strategy is always the same : if the "hidden number" is between 0-100, we start at 50, the middle number. Is it larger ? If yes, we continue at 75, if not at 25, and so on, dividing remaining space by 2 at each step.
This method will statistically provide the best result over several games.
The same can be applied here, using BST, over millions of value. In theory, it means we'll get to the result in approximately log2(N) attempts.
Except that, we can't be sure that remaining space is divided into 2 equal parts like in the game. Indeed, the tree could be "degenerated", for example when all elements are inserted in sorted order at root node, resulting in a tree which looks like a chain.
To avoid this situation, some complex mechanisms can be applied, to "balance" the tree, to ensure that all searches get approximately the same (and best) number of attempts before finding the solution. Unfortunately, such mechanisms tend to cost so much overhead as to kill performances for compression algorithms, since the dictionary constantly changes.
So, the question of this post : would BST be any superior to MMC ?
According to Bulat, yes, it should. Because its "compare" relation between parent and child nodes is more powerful than the "is next byte equal" question asked by MMC. In the end, it means that we extract more information from the BST question, and therefore we should converge faster.
But there is another big difference with BST : elements need to be sorted on insertion; while for MMC, they will be sorted during the search. It results in useless comparison operations for BST, since some elements will never be searched during their lifetime.
In a "greedy match" search strategy, this should play a role in favor of MMC, making it a faster alternative.
BST is also vulnerable to long streams of identical characters, since comparison requires more character to decide which one is smaller.
All things considered, BST is a very strong competitor to MMC, and potentially a better one.
So, save any new idea to further improve on MMC, i should spend some time on BST now, to better understand its behavior.
[Edit] I finally made a comparison between MMC and BST, and results are not so favorable to BST :
Maybe MMC is a much better competitor than anticipated...