All compressors that you can download on my site are using a pretty simple logic :
On each position, we try to find a match (the best possible one for LZ4 HC, or just a good enough one for all others). If we do, we encode a match, then move forward by match length, a try again.
This is known as "greedy matching". We just use a match as soon as we get an opportunity for it.
However, even when making a full search for the longest possible match, this is not the best possible output. It's quite possible that, by using the match at current position, we miss a better one on next position.
Need an example ?
Just consider the following sentence :
(...) a roaming attempt (...) really necessary. (...) a really necessary (...)
On reaching the second "a", the match finder will find a 3 characters match : "a_r".
Reaching the final "y" will therefore need 2 match sequences, by then adding "eally necessary".
However, should we wait for the next position, we would create an "a" as as literal.
And then, all the rest of the sentence would be described as a single match.
This is cheaper. Considering LZ4 format (that's just an example), a match sequence costs 3 bytes, and a literal costs 1 byte. This gives us :
Don't wait (greedy) : 3 + 3 = 6 Bytes
Wait one byte : 1 (literal) + 3 (match) = 4 Bytes.
So we would lose 2 bytes with greedy matching.
There are a number of way to improve on this situation. Let's start with the simplest one : lazy matching.
The idea is very simple : on finding a match, we keep it in memory and try to find another and better match at the next position.
If we don't find a better match, then we use the first match.
If we do find a better match, we use this one instead.
The idea can also be applied recursively; that means that if we find a better match, we continue checking the next position, and so on.
This idea is known as "lazy matching", and used for several years in LZ compression programs.
However, it obviously requires more searches, and is therefore slower.
This is a typical trade-off that each compression program must choose.
For maximum compression strategy, advanced parsing is a necessity.
Saturday, December 18, 2010
Thursday, December 16, 2010
MMC : Secondary Promotion : Experimental results
I spent quite some time working on Secondary promotion lately. The algorithm was particularly complex, and ensuring that all corner cases were correctly addressed required many debug iterations.
In the end, i finally completed a new version of SecFull, an aggressive implementation which tries to promote all characters it meets during its travel across the different levels. The changes required the capability to "move back" by one level, since in some cases, chains were broken and needed repair.
A much simpler version was produced in the process, called Sec1. It just tries to promote the last seen character, and only when a valid gateway is confirmed. This ensures that the changes are minimal and well located, keeping all the rest of the code unmodified.
Complexity difference between Sec1 and SecFull is quite large, and therefore it is not so clear which one will bring better benefit.
Anyway, it is now possible to compare. All search algorithms generate exactly the same references, resulting in the same output. Therefore, all changes affect only the speed of searches.
As usual, let's just start by a simple counter of comparisons :
Nb of comparisons - Window Search 64K :
Window Search 512K :
Window Search 4M :
The trend is pretty clear : the more aggressively we promote, the less comparisons are necessary to complete a full search. And the difference is more and more pronounced as the size of search buffer increases.
That's for comparisons. Now, what about speed ?
Surely, the less number of comparisons, the better the speed ?
Well, yes, approximately; but we also have to take in consideration code complexity. If each search is "more costly", then the total may end up being slower.
As can be guessed, Sec 1 is more complex than NoSec, and SecFull is the most complex of all. So, does it pay back ? Here are some results :
Speed - Window Search 64K :
Window Search 512K :
>Window Search 4M :
Time for some serious thinking.
Even with less comparisons, the new algorithms are not guaranteed to be faster. This is especially true at 64K, where all Secondary algorithm fail to produce any gain, even though the number of comparisons is well reduced.
We need larger search buffer to make Secondary promotions worthwhile. And this is reached as soon as 512K, and then growing with search size.
Actually, SecFull seems to win over Sec1 in each and every circumstance. So it can be preferred.
But to be fair, results are a bit disappointing. The very large difference in comparison loops would make us believe that the benefit could be larger. But in the end, it is still relatively mild.
No ground-breaking progress this time...
In the end, i finally completed a new version of SecFull, an aggressive implementation which tries to promote all characters it meets during its travel across the different levels. The changes required the capability to "move back" by one level, since in some cases, chains were broken and needed repair.
A much simpler version was produced in the process, called Sec1. It just tries to promote the last seen character, and only when a valid gateway is confirmed. This ensures that the changes are minimal and well located, keeping all the rest of the code unmodified.
Complexity difference between Sec1 and SecFull is quite large, and therefore it is not so clear which one will bring better benefit.
Anyway, it is now possible to compare. All search algorithms generate exactly the same references, resulting in the same output. Therefore, all changes affect only the speed of searches.
As usual, let's just start by a simple counter of comparisons :
Nb of comparisons - Window Search 64K :
File | NoSec | Sec1 | SecFull |
Calgary | 4.87M | 4.31M | 3.70M |
Firefox | 35.2M | 31.2M | 25.7M |
Enwik8 | 166.M | 139.M | 121.M |
File | NoSec | Sec1 | SecFull |
Calgary | 9.31M | 7.94M | 6.32M |
Firefox | 78.2M | 67.3M | 52.9M |
Enwik8 | 370.M | 285.M | 238.M |
File | NoSec | Sec1 | SecFull |
Calgary | 12.0M | 10.2M | 7.95M |
Firefox | 161.M | 141.M | 114.M |
Enwik8 | 619.M | 472.M | 357.M |
That's for comparisons. Now, what about speed ?
Surely, the less number of comparisons, the better the speed ?
Well, yes, approximately; but we also have to take in consideration code complexity. If each search is "more costly", then the total may end up being slower.
As can be guessed, Sec 1 is more complex than NoSec, and SecFull is the most complex of all. So, does it pay back ? Here are some results :
Speed - Window Search 64K :
File | NoSec | Sec1 | SecFull |
Calgary | 32.8 MB/s | 31.5 MB/s | 32.0 MB/s |
Firefox | 40.5 MB/s | 38.9 MB/s | 40.0 MB/s |
Enwik8 | 30.8 MB/s | 29.9 MB/s | 30.1 MB/s |
File | NoSec | Sec1 | SecFull |
Calgary | 16.6 MB/s | 16.9 MB/s | 17.9 MB/s |
Firefox | 19.4 MB/s | 18.6 MB/s | 19.9 MB/s |
Enwik8 | 13.0 MB/s | 13.2 MB/s | 14.7 MB/s |
File | NoSec | Sec1 | SecFull |
Calgary | 6.5 MB/s | 6.8 MB/s | 8.1 MB/s |
Firefox | 2.3 MB/s | 2.4 MB/s | 2.7 MB/s |
Enwik8 | 1.9 MB/s | 2.4 MB/s | 3.0 MB/s |
Time for some serious thinking.
Even with less comparisons, the new algorithms are not guaranteed to be faster. This is especially true at 64K, where all Secondary algorithm fail to produce any gain, even though the number of comparisons is well reduced.
We need larger search buffer to make Secondary promotions worthwhile. And this is reached as soon as 512K, and then growing with search size.
Actually, SecFull seems to win over Sec1 in each and every circumstance. So it can be preferred.
But to be fair, results are a bit disappointing. The very large difference in comparison loops would make us believe that the benefit could be larger. But in the end, it is still relatively mild.
No ground-breaking progress this time...
Wednesday, December 15, 2010
RZM
Once upon a time...
There was one of those incredibly talented programmer, which could produce in short time what could take years to amateurs as myself. During its short presence in the compression scene, he produced several excellent programs, featuring compression / speed ratio beyond anything that was available back then.
Among these little jewels, was RZM, and LZ-class compressor. Right from the first release, it became a serious competitor to 7-zip, the best reference LZ implementation so far. This was achieved without even using any external filter, nor any strange trick, just plain LZ.
Or was it ? Christian Martelock, the author, described its algorithm as being a "dull ROLZ order-1" compression engine, with 64K positions per context.
This description was close enough from Etincelle to deserve a try.
Etincelle is also an order-1 ROLZ; however, it gets its speed thanks to a quick scan into dictionary. This is in contrast with RZM, which makes a full search.
It was just a matter of little time to implement full search into Etincelle, and get some first results.
RZM is approximately 30% more powerful than Etincelle, therefore, i was wondering if such kind of result was readily reachable.
It was not, obvously. With full search, Etincelle features a 15% improvement, same as LZ4.
So where the other 15% could come from ?
Well, for a start, Christian clearly states that he uses Optimal Parsing to get this result. From experience, Optimal Parsing provides a fair boost to compression ratio, but not a huge one. 3-5% is a typical improvement to be expected.
Another, more obvious, modification would be to move away from Huffman entropy coder, and exchange it for a Range coder. With its improved precision, this would help quite a bit, especially on literal occurrence. I suspect an improvement in the range of 2-3%,
Now, that still leave us with something like 8% to find out. And this part is not so clearly explained.
On reading the thread on RZM at encode.ru, a repeated pattern became apparent : all encoded tokens are context-evaluated.
It's a step above my simple block-based implementations. While i'm simply separating data depending on their type, ensuring homogeneous probabilities, context-evaluation goes a few steps further : it separates probabilities based on one or several selected elements or statistics, which are considered "correlated" to the probability being calculated.
It does not need to be completely deterministic (in which case, we would get a perfect predictor). A loosely coupled correlation is good enough to squeeze the probability table in a way which makes it more accurate, therefore producing better compression.
Another problem in creating too many contexts is sparsity : the more numerous the contexts, the more sparse the data in each of them. It becomes therefore difficult, if not dangerous, to try to extract some probabilities from a too small sample set; this is the "escape estimation" problem, well known since PPM era.
This problem is critical for hard-wired correlation rules, such as PPM. More modern techniques, with dynamic merging & splitting, can somewhat circumvent it.
Exploiting such correlation requires, as a basic minimum, a fully adaptive entropy coder, able to change its probabilities at each step. Although simple in principle, its implementation is not, and can become a serious speed bottleneck. There is, however, a specific sub-case : the binary range coder. With just two output values to take care of (1 or 0), probability updates are very simple. Hence its predominance in most, if not all, the best compression programs today.
So here can have a look at the different steps to progress towards an RZM-like compressor :
1) move to range coder
2) work on optimal parsing
3) move to fully adaptive entropy
4) play with hard wired contexts
5) play with context selection, mixing, and creation
Quite a long road ahead...
There was one of those incredibly talented programmer, which could produce in short time what could take years to amateurs as myself. During its short presence in the compression scene, he produced several excellent programs, featuring compression / speed ratio beyond anything that was available back then.
Among these little jewels, was RZM, and LZ-class compressor. Right from the first release, it became a serious competitor to 7-zip, the best reference LZ implementation so far. This was achieved without even using any external filter, nor any strange trick, just plain LZ.
Or was it ? Christian Martelock, the author, described its algorithm as being a "dull ROLZ order-1" compression engine, with 64K positions per context.
This description was close enough from Etincelle to deserve a try.
Etincelle is also an order-1 ROLZ; however, it gets its speed thanks to a quick scan into dictionary. This is in contrast with RZM, which makes a full search.
It was just a matter of little time to implement full search into Etincelle, and get some first results.
RZM is approximately 30% more powerful than Etincelle, therefore, i was wondering if such kind of result was readily reachable.
It was not, obvously. With full search, Etincelle features a 15% improvement, same as LZ4.
So where the other 15% could come from ?
Well, for a start, Christian clearly states that he uses Optimal Parsing to get this result. From experience, Optimal Parsing provides a fair boost to compression ratio, but not a huge one. 3-5% is a typical improvement to be expected.
Another, more obvious, modification would be to move away from Huffman entropy coder, and exchange it for a Range coder. With its improved precision, this would help quite a bit, especially on literal occurrence. I suspect an improvement in the range of 2-3%,
Now, that still leave us with something like 8% to find out. And this part is not so clearly explained.
On reading the thread on RZM at encode.ru, a repeated pattern became apparent : all encoded tokens are context-evaluated.
It's a step above my simple block-based implementations. While i'm simply separating data depending on their type, ensuring homogeneous probabilities, context-evaluation goes a few steps further : it separates probabilities based on one or several selected elements or statistics, which are considered "correlated" to the probability being calculated.
It does not need to be completely deterministic (in which case, we would get a perfect predictor). A loosely coupled correlation is good enough to squeeze the probability table in a way which makes it more accurate, therefore producing better compression.
Since we benefit more from better correlation, the choice of context is pretty important.
As an example, for literal, the previous character is usually a good indication. Hence it makes sense to use it as a context. That's called Order-1 entropy coding.
Obviously, many more measurable element can be selected. For example : was the previous position a literal, or the end of a match, what was the length of the previous match, and so on. They can also be combined.
But, the prediction will only be as better as the correlation can be. Therefore, mixing random elements together will prove a poor indicator.
The story becomes even more harder when considering that correlation elements may change overtime. This is quite normal if data type being compressed changes, even within a single file. Tracking such changes can become a significant burden, and therefore, faster alternative is to fix the probability model, hard-wiring the correlation rules. This is necessarily less optimal, but obviously much cheaper to calculate.
This problem is critical for hard-wired correlation rules, such as PPM. More modern techniques, with dynamic merging & splitting, can somewhat circumvent it.
Exploiting such correlation requires, as a basic minimum, a fully adaptive entropy coder, able to change its probabilities at each step. Although simple in principle, its implementation is not, and can become a serious speed bottleneck. There is, however, a specific sub-case : the binary range coder. With just two output values to take care of (1 or 0), probability updates are very simple. Hence its predominance in most, if not all, the best compression programs today.
So here can have a look at the different steps to progress towards an RZM-like compressor :
1) move to range coder
2) work on optimal parsing
3) move to fully adaptive entropy
4) play with hard wired contexts
5) play with context selection, mixing, and creation
Quite a long road ahead...
Monday, December 13, 2010
LZ4 : New version (0.9HC)
Well, just when i though that there was nothing else to do to improve the decompression speed of LZ4...
Then i find a corner case where LZ4 does not work so well. Having a sanity check on loong, a file created by Eugene Shelwien, initially to torture range coders with limit conditions, i discovered to my surprise that the decoding speed was not good enough, "only" 500MB/s; unexpected, as i was looking for something much closer to a few Gigabytes.
Looking back into the source, my understanding was that the code in charge of overlapping sequences was too cautious. With just a little more complexity, it could be made much faster.
Overlapping sequences are not too often, but they happen here and there, in each and every file. As a consequence, it resulted in a generic speed boost to all files for decoding.
Logical follow-up, here comes the release of a new LZ4 version, breaking some new decompression speed records. You can download them on LZ4 Homepage.
Talking about bragging around the world...
Then i find a corner case where LZ4 does not work so well. Having a sanity check on loong, a file created by Eugene Shelwien, initially to torture range coders with limit conditions, i discovered to my surprise that the decoding speed was not good enough, "only" 500MB/s; unexpected, as i was looking for something much closer to a few Gigabytes.
Looking back into the source, my understanding was that the code in charge of overlapping sequences was too cautious. With just a little more complexity, it could be made much faster.
Overlapping sequences are not too often, but they happen here and there, in each and every file. As a consequence, it resulted in a generic speed boost to all files for decoding.
Logical follow-up, here comes the release of a new LZ4 version, breaking some new decompression speed records. You can download them on LZ4 Homepage.
version | Compression Ratio | Speed | Decoding | |
LZ4 "Ultra Fast" | 0.6 | 2.062 | 232 MB/s | 805 MB/s |
LZ4HC "High Compression" | 0.9 | 2.345 | 38.2 MB/s | 880 MB/s |
Talking about bragging around the world...
Sunday, December 12, 2010
BST : First experimental results
Still on my quest for better search algorithms, i tried a simple BST (Binary Search Tree) code, in order to compare it with MMC. As stated in an earlier post, BST should be faster than MMC, at least in its current format.
Reality is a bit more complex.
My current BST implementations uses a Hash Table acceleration, using same size and calculation as MMC. For a fair comparison, i've disabled Segment detection from MMC.
The results are somewhat mixed and require some explanations.
First, i tried BST with partial updates. It means that positions inside matches are not taken in consideration. This reduces compression rate, but also considerably improve speed, since repetitions are unlikely to be found very often in the dictionary.
Here are the results. As usual, i'm counting the number of comparisons necessary to complete the file.
Partial Update Test
File BST MMC
Calgary 0.98M 1.73M
Firefox 7.71M 13.1M
Enwik8 33.4M 62.4M
The results are pretty clear : in partial update mode BST wins. With much less comparisons to do, BST is much faster, between 20% and 50% depending on file types. Speeds in the range of 80MB/s to 110MB/s are accessible.
Now, partial update is not really my objective. I want a full search, over the entire search window. This is much more difficult, with a lot more elements to integrate, longer searches, and repetitions putting an heavy toll on any search structure.
Here are the results i started with :
Looks like a clear win for BST ? Alas, there is more to it.
Maybe MMC is much more a contender than i initially thought...
Reality is a bit more complex.
My current BST implementations uses a Hash Table acceleration, using same size and calculation as MMC. For a fair comparison, i've disabled Segment detection from MMC.
The results are somewhat mixed and require some explanations.
First, i tried BST with partial updates. It means that positions inside matches are not taken in consideration. This reduces compression rate, but also considerably improve speed, since repetitions are unlikely to be found very often in the dictionary.
Here are the results. As usual, i'm counting the number of comparisons necessary to complete the file.
Partial Update Test
File BST MMC
Calgary 0.98M 1.73M
Firefox 7.71M 13.1M
Enwik8 33.4M 62.4M
The results are pretty clear : in partial update mode BST wins. With much less comparisons to do, BST is much faster, between 20% and 50% depending on file types. Speeds in the range of 80MB/s to 110MB/s are accessible.
Now, partial update is not really my objective. I want a full search, over the entire search window. This is much more difficult, with a lot more elements to integrate, longer searches, and repetitions putting an heavy toll on any search structure.
Here are the results i started with :
Full Update Test
File BST MMC
Calgary 1.46M 5.81M
Firefox 11.3M 41.1M
Enwik8 47.0M 167.M
From a speed perspective, observations do not match these results. Indeed, MMC is much faster than BST, sometimes just 2x, sometimes up to 10x; there is a massive difference, and it is clearly in favor of MMC.
So where does it comes from ? Code complexity ?
No, remember one of the fundamental properties of MMC : data is sorted only if necessary, and during search. In contrast, BST data is sorted on insertion, which means that all bytes added to the tree produce comparisons. These comparisons must be added to total. And here are the surprising results :
Full Update Test
File BST.Search BST.Update MMC.Search MMC.Update
Calgary 1.46M 382.M 5.81M 0
Firefox 11.3M 518.M 41.1M 0
Enwik8 47.0M 312.M 167.M 0
Ouch, this hurts. Indeed, many more comparisons are generated while adding positions into the tree than while searching a match. The difference is really huge, and clearly play in favor of MMC.
One can see the pretty large impact of repetitions on Calgary and Firefox, which makes MMC stand very far ahead (even though repetition detection is disabled !).
But even if we just look at Enwik8, where repetitions are nearly non-existent, the update cost is still too large compared with MMC.
Speed results are directly affected : MMC is 2.5x faster for Enwik8, 5x faster for Firefox, and 7x faster for Calgary. Quite a large lead.
Obviously, adding some optimizations for repetitions should help BST. However, as the results with Enwik8 prove, it will not be enough to beat MMC.
Maybe MMC is much more a contender than i initially thought...
Note that these results are using Greedy parsing strategy. The conclusions are likely to be very different if going for Optimal parsing strategy.
Subscribe to:
Posts (Atom)