RealTime Data Compression

Saturday, December 4, 2010

Multiple level promotions

One of the main advantages of Morphing Match Chain (MMC) method (explained here) is that you can add elements into the chain without actually sorting them. They will get sorted later on, and only if necessary.
As a consequence, many elements will never get sorted, because they are never accessed.
In a "greedy match" search algorithm, it happens very often, and the shorter the search window, the more probable elements will reach their end of life before being searched.

This strategy is key in avoiding many useless comparison operations.

Now it has a drawback : all elements are added into the "baseline" hash chain. They will get sorted later on if the hash chain is called. But my current implementation of MMC is limited to a maximum of one promotion per pass. As a consequence, all these unsorted positions tend to massively get promoted to next level together. Only elements which does not reach the minimum length are left out, but the point is : these "good" elements could have reached immediately higher level, they are artificially limited to current level + 1 due to implementation.

Willing to measure this effect, i made a simple modification to LZ4 to count the highest potential level of each element, and compare it to the current level limitation.
I was expecting some missed opportunities, in the range of 8 levels typically.

This was incorrect.
I found missed promotion opportunities in the range of hundreds level.
Granted, highest numbers are typically the result of very repetitive patterns, which can be found in binary files (and i witnessed many strange repetition length, such as prime numbers 3, 5, 7, or large one such as 120). In this case, missed opportunities can reach several thousand levels.
But outside of these (not so rare) corner cases, i still have witnessed missed opportunities in the range of a hundred levels with file such as enwik8, which does not feature any massive repetition pattern, just some english words and group of words which are more frequents.

This suggests that some sensible gain can be achieved through multi-level promotion.
Sorting elements on a multi-level scheme will not make the current search faster, but will make the next search using same hash much more efficient.

Unfortunately, this also requires complete rewriting of MMC algorithm. The basics behind the search method remain one and the same, but the algorithm needs to be completely different, to track multiple levels in parallel. Quite some work in the waiting...

Thursday, December 2, 2010

Fusion searches - segments & sequences fully integrated

Finally, after spending quite some time at testing a few different directions (more on this in later notes), i finally had enough time and focus to generalize Fusion for any stream of any character.
I opted for the easy way, by extending the table listing existing segments within range. I had another scheme in mind, with the advantage of being memory-less, albeit at unknown performance cost. Maybe for another version.

Nevertheless. I end up with some usable results, and they are quite promising.
As a usual "raw" efficiency measure, i counted the number of comparisons here below :

MMC Segment Fusion Improvement
Calgary 64K Searches 20.3M 6.55M 5.54M 18%
Firefox 64K Searches 70.3M 47.1M 41.9M 12%
Enwik8 64K Searches 188M 187M 187M 0%

Calgary 512K Searches 34.2M 14.6M 10.9M 34%
Firefox 512K Searches 153M 121M 92.1M 31%
Enwik8 512K Searches 435M 435M 434M 0%

Calgary 4M Searches 38.7M 18.5M 14.2M 30%
Firefox 4M Searches 271M 308M 182M 69%
Enwik8 4M Searches 753M 758M 751M 1%

Now, this is satisfactory. Note how Firefox greatly benefits from fusion search support for "non-zero" bytes. Calgary, which mostly contains streams of zero, achieves a little gain compared to Fusion0. Enwik8 hardly contains any stream at all, and therefore benefits the least.

But the tendency is what matters here. The larger the search window, the better the benefit. Firefox is quite impressive at 4M, keeping in consideration that about 65M comparisons (more than one third of total) are just collisions.

And finally, results are now always better than with good old MMC alone, whatever the file and search window size. Objective reached.

Thursday, November 25, 2010

The cost of History

well well, another idea proved fruitless.
After a quick note on cuckoo hashing , i tried to apply the concept on a compressor using Etincelle's principles, betting that the superior occupation rate of cuckoo's hash table would result in better compression rate.
There was, indeed a gain, but a small one.

In fact, my initial diagnostic was simply wrong : i assumed that Etincelle's tables were incompletely used. It proved incorrect : they are fully used, fully cranked i should say. And that is necessary for the scheme to be efficient.

Indeed, the real question is : which slot should be eliminated when a new entrant gets in. With Etincelle, i'm using pure randomness to free a slot for the new occupant. This obviously means there are some collisions. With Cuckoo hash, the table gets filled more before the first collision occurs. But in the end, both tables are completely filled, therefore collisions are happening at each update.

The only benefit of the new hash function is that i can select which slot will be freed. In my test, betting that the least recently used slot is probably the most unworthy did provide a small compression boost.
But not enough for the cost of it.

I now remember that a pending question i had when designing Etincelle was : why keeping the whole history into memory ? The larger it is, the more it costs to reference into it. If it was possible to only keep a useful part of history, references would become so much smaller, compression would therefore improve.

Alas, you can't guess which part of history is going to become useful. As a consequence, the usual way for compressors has always been to keep everything into memory. When reference size matters, they are reduced thanks to context filtering.

Trying to improve on this situation, i tried my hand at an "explicit history manager", which would tell the decoder which position to keep in memory for later use.
My first implementation proved disastrous. Although references would become much smaller, this was not enough to offset the cost of instructions passed to decoder.
Later, i tried a new implementation, much more dynamic and (alas) much more complex. This time, there was a net gain. Not a large one, but definitely a gain.
From a speed/ratio perspective however, this was really not worth it; so, outside of a pure compression exercise, it proved a wrong path.

So my conclusion, at that time, was i'd better try to find some heuristics which would, on average, correctly guess which part of history to keep in memory.

I never found the correct formula.

It seems this idea would be remotely cousin to another concept used in some vastly different compression algorithms, called SSE (Secondary Symbol Estimation). To summarize this concept, it states that, according to additional statistics (such as latest compression ratio, local guess success, etc.), current guess probabilities will be affected. For example, if the algorithm tries to compress some noise, SSE will simply "take over" past statistics and tell that final estimation is completely random, providing a flat probability for all symbols in alphabet.

This is the kind of thing i would like to mimic : get from a bunch of easy statistics an estimation of current position "worthwhileness" to be stored into memory, for future reference.

Apparently, this still deserves to be properly implemented.

Wednesday, November 24, 2010

CPU cache

Looking for performances on modern CPU architecture ? You can't avoid taking care of memory cache effects.

The logic for cache is pretty simple : access to main memory is (relatively) slow, therefore CPU keeps a part of memory next to it, for faster interaction.

At the lowest part of this strategy, you could consider registers as being "level 0 cache". Registers are within the CPU, and some of them are even electable for some comparison or mathematical operations, even if limited. As an example, the old "saturn" processor, used in several HP handheld calculators, had 4 main registers and 4 store registers. Data in these registers are accessed swiftly and can be manipulated easily. But this is not transparent : you have to explicitly load and save data into them, therefore the term "cache" is quite dubious here.

At the next level, all modern processors do feature nowadays an L1 cache. Not only PC ones, but even low-power CPU for embedded market, such as ARM, do feature an L1 cache. This time, this is truly a copy of main memory, an (almost) transparent mechanism which simply ensure that recently accessed data remain close to processors.
L1 caches are typically very fast, with access times in the range of 5 cycles and better. They are however relatively small, storing just a few kilobytes (Core 2 Duo for example feature 2x32KB of L1 cache per core). L1 caches are typically considered part of the CPU.

Going up another level, most PC processors do also feature an L2 cache. This time, we can really say that data sit "close to" processor, as L2 caches are rarely "inside" the CPU. They are however sometimes part of the same package.
L2 caches are slower (in the range of 20 cycles access time), and larger (between 0.25 and 8MB typically). So data that do not fit into L1 is likely to be found in this second larger cache.

A few processors do also feature an L3 cache, but i won't enumerate on that. Suffice to say that it is an even larger and slower cache.

Then you have the main memory. Nowadays, this memory tend to be quite large, at several GB per machine. Therefore, it is several order of magnitude larger than cache, but also much slower. Performance figures do vary a lot depending on architecture, but we can waver a typical 100-150 cycles number for access time.

So here you have the basic principles : very recently accessed data is re-accessed swiftly thanks to L1 cache (5 cycles), then a bit less recent data is found into L2 cache (20 cycles) then main memory (150 cycles). As a consequence, making sure that wanted data remains as much as possible into memory cache makes a terrific difference on performance.

The rule seems simple enough as it is, but there is more to it. Understanding how exactly work a memory cache is key to craft performances. So in a later note, we'll study its in depth mechanism.

Monday, November 22, 2010

Cuckoo's Hash

After reading a short rant from Charles Bloom, i spent some time reading and studying cuckoo hashing, a relatively new hash methodology, which gets its name from the bird : as you may know, a cuckoo is genetically programmed to hatch into another bird's nest, and when it gets out of its shell, its first act in life is to push other eggs and little baby birds out of the nest, ensuring all future food for its own benefit.

Cuckoo hashing share the same behavior :
Assuming you have 2 hash functions, with relatively low probabilities to produce the same hash from different sequences, you start by hashing your sequence with first hash. If the slot is already occupied, you take it. The previous slot occupant is then re-hashed, using the second hash function. It therefore finds another place for its use. However, if the second place is already taken, then the process starts again : the occupants gets removed, then re-hashed using first hash function, and so on.
In the end, there is a pretty high chance that, with enough shuffling, all elements will find their own place.

Searching into a cuckoo hash table is pretty straightforward : you calculate both hashes, using the 2 hash functions, and then check both positions for sequence presence. It can't be anywhere else, so you're sure to find it in maximum 2 probes if it exists.

This looks interesting, but there is a catch : you can get into a situation where there are infinite loops, such as in the picture W & H, which point to each other.

To counter this risk, you have to respect a "load limit". In the original cuckoo hash implementation, this limit is pretty low, at 50%. However, since then, a simple improvement has been found, using multi-slots bucket or more hash functions. With many more choices to choose from, the load limit rocket to the roof, reaching extreme performances, as high as 99.7% (for 2 Hashes and 8 slots per bucket, for example).

So where does that lead us to ?

When reading these properties, i had one target in mind : Etincelle, my current flagship compressor. To describe its characteristics swiftly, Etincelle has compression performances in the range of the best Zip, but much much faster (calculated at 80MB/s for enwik8 on my Core 2).

Its speed is mostly due to its very fast match search algorithm, which however misses many opportunities due to its very light probing. Etincelle uses this inefficiency to its advantage, by referencing data with very little indexes. This works even at fairly large distances (you can instruct Etincelle to use a 1GB dictionary for example) but is only as efficient as the table fill-rate. I therefore use over-booked hash tables to ensure index efficiency.
Cuckoo hashes with high load limit could help here in making this scheme even more efficient.
Testing this, however, would require some fairly large code changes, to the point of changing Etincelle into something else, obviously a new compressor. The speed is also poised to be slower, has measured by Dr Askitis's hash comparison thesis.

Another potential use of Cuckoo Hash table is a more classical objective within LZ4 : as measured in a previous post, collision rate do reach inconvenient level with Firefox at 4M search depth, totalizing 60 millions comparisons, which is more than 20% of all comparisons. This is obviously a lot, in spite of MMC automatic handling of this situation, albeit in a chain-like fashion. Here also, cuckoo hash could help to bring down these collisions to zero, but at a cost. It remains to be tested if the extra cost and complexity remains worthwhile.

Sunday, November 21, 2010

LZ4 : New version

Yesterday's experiments provided some appreciable gains to LZ4, and therefore it seems valid to share them.

The new version of LZ4HC is, on average, 10% faster.

On top of that, i found an optimization in the decoder, which is valid for any LZ4 version.
As a consequence, both the "Fast" and "High Compression" versions get updated with a nice boost to decompression speed. LZ4 already was the fastest decoder, this makes it even better.

You can download both version at their new LZ4 homepage.

	Compression Ratio	Speed	Decoding
LZ4 "Ultra Fast"	2.062	227 MB/s	780 MB/s
LZ4HC "High Compression"	2.345	31.0 MB/s	830 MB/s

Saturday, November 20, 2010

First results of fusion search

As stated in an earlier post, segment search provides a nice boost to compression speed when compressed files contain some long stream of identical characters. However, this benefit does not ramp up with search depth, as there is no gradual elimination of bad choices, like MMC is doing for normal sequences.

Therefore, the ideal solution would be to "merge" both algorithms. But this is easier said than done.

Finally, after much time at correcting many small inefficiencies and inaccuracies, a first version of this merged algorithm is able to provide some figure, in order to verify the claim.
The current version is only limited to handling streams of zeros. When a segment is detected, it is fully populated with pointers to previous best possible match, then the search continue using the normal MMC algorithm.

This approach results in better elimination of bad positions, and therefore less comparison checks. On the other hand it requires much heavier updates on chain table.

So where does that lead us to ? Here are some extracted statistics, comparing new fusion algorithm with current 0.5 (segment) version of LZ4HC :

MMC Segment Fusion/0 Improvement
Calgary 64K Searches 20.3M 6.55M 5.62M 17%
Firefox 64K Searches 70.3M 47.1M 45.3M 4%
Enwik8 64K Searches 188M 187M 187M 0%

Calgary 512K Searches 34.2M 14.6M 11.1M 32%
Firefox 512K Searches 153M 121M 115M 5%
Enwik8 512K Searches 435M 435M 435M 0%

Calgary 4M Searches 38.7M 18.5M 14.4M 28%
Firefox 4M Searches 271M 308M 295M 4%
Enwik8 4M Searches 753M 758M 758M 0%

There is definitely some improvement, but not that large.
It seems to get even better with larger window size (64K->512K), but the tendency is surprisingly reversed at largest tested size (4M).

There are a number of potential explanations for that.
To begin with, the new Fusion algorithm only deals with streams of zero. The Segment algorithm is still being used for all other streams. As a consequence, there is still room for improvement, as can be witnessed in Firefox at 4M range : we still have more comparisons than with MMC alone. So there are probably some tangible benefits to expect with generalizing Fusion to all byte streams.

Then, at 4M, collision is really large for Firefox, reaching 64M, which is more than 20% of comparisons. It is also more than 1M for Calgary (about 7%). This amount is not reduced by Fusion, therefore its relative benefit importance is reduced by increased share of collision.
The easier solution is to increase hash table, to reduce collision effect. However, as i want directly comparable results, hash table has to remain the same for all search depth and all search algorithms.
Maybe i'll change that policy, to allow adapted hash size per search depth.

Calgary is also a strange beast : its size is less than 4M, so it does not fully benefit from largest window size. Still, the small difference between 512K and 4M is quite unexpected. This observation is valid whatever the search algorithm, so there is definitely something at stake. Maybe the fact that calgary.tar contains some very different file type, which individual size is less than 512K, could explain this behavior.

Next logical step seems to be generalization of Fusion, and measure potential improvements.

Friday, November 19, 2010

Compressing the web

I just had a very pleasant time reading Frederic Filloux 's post on his wish list for a future word processor. In a nutshell, what he is looking for is a context-aware spell checker, so that it does not need to constantly tell the dictionary about new words and surnames. It goes even farther : citing a famous example from Google's Wave, the sentence "Icland is an icland" get automatically corrected into "Iceland is an island", the most probable spelling.

We can easily get a glance of this feature by just mis-spelling a word into Google search engine : although the search is performed, a correction is also automatically proposed.

Now just a word on how this correction is proposed : this is no longer about reference into dictionary. No, Google has enough memory to download (almost) the whole web, study it, and get meaningful statistics out of it. Its engine is not instructed with English lessons, it learns it by itself, by just continuously reading English texts. And it does the same for any other language.
This does not even stop there, it can guess equivalence, correct spelling and grammatical rules out of it. Moreover, it can extract from the current context enough information to propose the proper correction out of several possibilities.

So where does that lead us ?
Guessing the proper next sequence from the current context and from what has been learned with past history, doesn't that sound like compression ?

It must be underlined here that "past history" refers to "the web", at large. Therefore, digging into such a massive amount of data and still generating an answer in a fraction of a second is quite a feat, and should inspire compression algorithms (or the other way round).

As stated in a previous post, deeply looking into a dictionary of size N requires 4xN or 8xN memory. Obviously you can't do that on a dictionary which is "Internet" size, so you don't even think of making a search. Creating a statistical structure out of this huge source is likely to result in much more compact dataset, and as a consequence usable & fast search algorithm.

Looking for a compression algorithm working with stats rather than dictionary ? look no further : think about PPM / CCM.