Tuesday, March 29, 2011
Ross William's LZRW
Well, okay, it's not that far far away, but still, early 90's seems like prehistoric era for personal computing by now.
Ross Williams created a fairly interesting family of LZ compressors, featuring excellent speed and memory properties, and still providing good compression ratio (by these days standards), which he modestly called LZRW, for Lempel-Ziv-Ross Williams.
It has since become a classic in early compression algorithms, still being a competitive one even by today standards. Indeed, it's also the basis of several modern compressors, such as QuickLZ.
The main idea behind LZRW is this one : instead of providing an offset to the LZ decoder, just provide it with the pointer position into a table.
It looks complex or crazy at first, but it is quite logical, following Dr Ross's mind into his successive versions of LZRW :
Back in these days, computers featuring some 64KB of RAM were still commonplace. Better ones could reach an impressive 256KB. Therefore, since table sizes were seriously limited, capability to track all positions in the buffer to find the best match was, well, mostly out of reach.
LZRW does not target best maximal compression, but fast and practical (ie memory efficient) compression level. By accepting scarcity of references, it created some "holes" in addressable space, which translates into compression inefficiency.
Hence the main idea : by using the table position instead of the offset, there is no more any hole in addressable space : as long as the table is completely filled, all positions have, roughly, an equivalent probability to appear (well, roughly enough...). Furthermore, all matches, even long range ones, use exactly the same number of bits to be described, making them more efficient.
The downside is that the decoder must reference the data exactly as the compressor did, making it more complex (and slower) than more standard LZSS. But still, this trade-off is very acceptable, since the speed penalty is not too large.
The LZRW scheme features a "best spot" for optimal compression ratio : the larger the table, the better its "match find capability", hence a better compression ratio. However, the larger the table, the more bits it requires to describe the table slot to the decoder, hence a worse compression ratio.
The "best spot" is found where both forces collide.
Or is it ?
One problem is, the optimal "nb of bits" is not the same depending on the file to be compressed.
For example, with "enwik8", this optimal number is 14 bits.
For "open office", it is 19 bits.
For very small files, this number can go down to 10 or less.
Therefore, the question is : how to find the optimal table size, in order to get the "optimal compression ratio" for any given input ?
Alas, that question has no easy answer, and i still have to find one myself...