Sunday, October 2, 2011

LZ4 in commercial application

 Time for great news. Although i've been informed of a few open-source projects or commercial application working to integrate LZ4 into their production, this is actually the first time that a company has completed a product with it. A product that you can buy by the way. Moreover, it's kind enough to tell publicly about it.

LZ4 is in-use within a video game called 1000 Tiny Claws, by Mediatonic. The company is young but certainly not new, and has already received several good mentions for some of its previous productions, namely "Who's that Flying", and "Monsters (probably) stole my princess" (i really love that title :). Therefore their new game stirs a lot of expectation. Let's wish them great success with their new sky-pirates adventures !



Within the game, LZ4 is used mostly to decompress resources. These are many sprites, background tiles, animations and graphics in "cartoon style". Resources are not limited to graphics, and some text files, profiles and models may also join the fray. But typically graphics tend to inflate memory budget quite much faster than the others.



For this use case, LZ4 can be useful thanks to its very fast decoding speed, making the decoding not just painless, but advantageous (i.e. faster !) compared to using resource in plain uncompressed mode. The decoder is also memory-less, which also helps.



The compression part of LZ4 is not used within the game, but obviously, it has been necessary to compress resources before injecting them into the game. For this use, the high compression version LZ4-HC has been put to the task. Not only does it compress 20-30% better, saving space and bandwidth, it also makes the compressed data even easier to decode. So this is all gains for this use case.

and of course, the all-important credit line...
1000TC engine is work from Gabor Dorka

Note that, at the time the game was created, LZ4-HC was GPL, but there is no problem with that, since LZ4-HC is not shipped within the game. Only the decoder is, which is BSD, and incurs no restriction.

The game is scheduled October 4th, on the PSN. Now is time to frag some monsters...

Tuesday, September 20, 2011

Cost of accessing variables

 I finally found a way to make the -c1 compression mode of LZ4 compatible with multi-threading.

This mode comes from the earlier LZ4-HC (v0.9) branch, which was created long ago, back in these days when none of my algorithms was multi-thread ready. The only objective was performance, and single thread execution. Heavy use of global variables were part of the deal to reach better speed.

This lesson proved right once again.

The first stage of enabling multi-threading is to get rid of all global and static variables. They simply cannot be shared by several threads. At least not easily.
To do this, you simply encapsulate all necessary variables in a container (a pointer towards a structure), and then pass the container as a parameter of the function. This way, each thread has its own set of data to handle.

From a code perspective, the changes are very little : just make all prior variables points towards their equivalent in the structure, and there you go. This way, it is surprisingly fast and simple to make a code multi-thread compatible.

But there was a catch : speed instantly dropped by 20%. This is a lot, and can be easily measured with benchmark tools.

In a bid to solve this issue, i went back once again at the "global variable" effect. The code remained identical, so the change from Global Variable to Container was the only reason for the 20% difference drop. Why ? This is the complex part.

In fact, when you call a variable, using for example :
mytable[number] = 1;
the compiler will generate very different ASM codes depending on the declaration of mytable.

If mytable is declared as a global or static variable, the compiler knows exactly where stands the beginning of mytable in the heap. Then it knows the size of the cells. Therefore, it only needs to generate :
mytable-baseAddress + number x mytable-cellSize
to reach the proper memory address.

Now, if mytable is passed as a parameter within a container, it is all different. The first thing to do is to create a pointer named mytable and which points towards the beginning of the table within the container. 
Now there are 2 possible situations :
1) mytable base Address is stored into a function variable (temporary one, into the stack). It needs to be read to collect mytable -baseAddress, and only then can the addition with number x mytable -cellSize happen. table-cellSize can be deduced from the pointer type.
This generates one more read for each call.
2) the compiler remembers the association. Therefore, mytable will not be stored as a variable into the stack, but dynamically translated into container-baseAddress + table-baseDelta.
I doubt this is any better than the previous proposition.

In order to avoid such a situation, one solution is to declare some of these variables as local ones. They will be created into the stack. Then, the access cost is almost similar to global/static variables : a local variable is accessed as base local stack address plus known delta. This is one more addition operation than for global/static variables, but that's ok, it is almost the same performance.
(Note : a post was created discussing this effect earlier this year : http://fastcompression.blogspot.com/2011/03/multi-threaded-entropy-experiment-huff0.html)

The real issue is that you can't create too much data into the stack. Its size is limited.
So, for larger data sets, i see no other way than to allocate the necessary memory (using malloc()) and access it with a pointer. This is bound to cost some performance compared with global/static variable, since there is one more pointer read per call, but i have yet to find a way around it.

Anyway, with a good mixture of local variables and allocated ones, the speed loss of -c1 mode was limited to 6% in single thread scenario. Not yet invisible, but still a noticeable improvement compared to previous situation. And at least, now, the code scales almost linearly with the number of CPU cores available.

You can grab the latest LZ4 version (v1.1) at its homepage.

[Edit] thanks to Richard Sim for correcting the heap/stack confusion.

Monday, September 19, 2011

LZ4 reaches v1.0

 Finally, a long planned update of LZ4 is out today, reaching v1.0. It's time to integrate all the various compression versions into a single binary, with the user able to select the compression level it wants by a simple switch (-c0, -c1 or -c2).

It is also time to integrate newer versions of the algorithm, since there were quite many improvements since latest release. As a consequence, v1.0 is faster and more efficient, as can be seen in this table :


v0.9v1.0
compression ratio2.0612.075
compression speed260 MB/s295 MB/s
decompression speed810 MB/s990 MB/s
memory usage33MB17MB
If you are using the open-source version of LZ4 or LZ4-HC, it is recommended to update to latest source release.

You can grab this latest win32 binary at the LZ4 Homepage.

Saturday, September 3, 2011

LZ4-HC : High Compression LZ4 version is now Open Sourced

 It's been quite a while since my last post in this blog. Well, with Summer ongoing, i was busy having a life. You know, things that happen outside...
Anyway, with September ringing in, it's time to get back to work.

I've used my first few days at complying to a long standing request : wtf, where is LZ4-HC ?

Well, as i replied a few times, LZ4-HC was not provided; that is, up to now.

LZ4-HC is now Open Sourced, and provided as a sample demo for the MMC Search Algorithm, which it makes heavy use of.
That's the main difference with regular LZ4 : the HC version makes a full search, in contrast with the fast scan of the "regular" LZ4. It results in a compression ratio typically improved by 20%. But the speed is also quite impacted : the HC version is between 6x and 10x slower than the Fast one.

Nonetheless, the compression advantage can be made worthwhile, especially in "offline compression" scenario, when the time necessary to compress a resource is not important, since data can only get decompressed during execution.

There is also a second difference : since MMC is GPL, therefore LZ4-HC is GPL too. Note that it does not change anything regarding LZ4 license, which remains BSD.

Is that an issue ? Not necessarily. Since LZ4-HC and LZ4 use the very same format (and indeed, the decoding function is one and the same), so you can provide the output of LZ4-HC to LZ4, and the other way round.

Therefore, the following scenario is valid : you can use LZ4-HC to compress your data, and ship your commercial product with the compressed data streams and the (BSD) LZ4 decoder. There is no problem with that.

You can also create your own private application using LZ4-HC to compress your resources, without ever disclosing your source code. This is valid as long as the binary is not distributed.

Only if you want to distribute your application will you need to comply to the GPL License to integrate the LZ4-HC compression routine. Nothing unusual here : either open-source your code or acquire a commercial license.

Hopefully, although legality is a boring thing, it's not too complex to understand for a change.

You can grab the LZ4-HC Source Code here; it's been compiled and tested successfully under Windows, Linux 32 bits and Linux 64 bits.

[Edit] : LZ4HC is now Hosted at its own web page : http://code.google.com/p/lz4hc/. The latest version also improves compression ratio by a few %.

Monday, June 6, 2011

LZ4 - Improved performance

Since Przemyslaw Skibinski provided a benchmark dedicated to comparing fast compressors, i've been interested in a direct comparison between LZ4 and Google's snappy.

Snappy is a very fast compressor, but its source code is quite more complex than LZ4. At first, it may look like this complexity will cost quite some CPU cycles, but this is not matched by the results : snappy proves to be a very speedy competitor.
Therefore, I've been having a closer look into LZ4, looking for some potential wasted cycles, but couldn't find a way to make the code even leaner than it already is. A few tiny % could be grabbed here and there, but the speed difference remained too large to be clearly explained by the algorithm alone.

It turns out that the difference comes mostly from a simple reason : cache policy.
LZ4 was designed with L2 cache limitation in mind. In order to ensure fast compression, all data should be retrieved from the L2 cache, which is relatively fast, at about 10 cycles per access.
Fair enough, but there is even faster : the L1 cache can provide access within 3 cycles or less. It looks like a small difference, but it does add up with each data access. In the end, the total speed difference can reach 20%, exactly what's missing to LZ4 to reach snappy's speed.

And ... that's all it takes. Modifying the performance parameter "HASH_LOG" to keep most data accesses within the L1 cache proved enough to cover the difference.
Therefore, starting from r10 revision, the default performance parameter is 12, for 16KB, which is just the right amount for Intel L1 data cache. AMD users may try 13 for 32KB, since their L1 data cache is twice bigger.
Addendum : note that the above results are correct only for 32 bits systems. 64 bits ones will spend twice more memory.

There is just a little catch : with less memory to fill its tables, LZ4 miss more opportunities to find matches, translating into worsened compression ratio. I've partly mitigated the effect with a more thorough search, but there is still a measurable compression ratio deficit. In any case, if the compression ratio is more important for you, just increase "HASH_LOG" value to were it was before (17, for 512KB). And if ratio is really very important, move away from fast scan, and start implementing a more intensive search routine.

LZ4 was not created with such memory requirements in mind, but it's refreshing to see this algorithm accommodate these new conditions so easily.

Another interesting side effect of using less memory for compression is faster decompression. At first, it seems counter intuitive, since the decompression algorithm is not affected by the change, and does not need any memory at all beyond input and output buffers. But a reason can be found : with less matches, the decompression algorithm will have less "segments" to copy, and its speed is tied to this number : better have a few large copy operations than many small ones.

Well, if you want to give it a try, just download the latest source code version from Google code. Typical speed gain is about 20% for compression and 5% for decompression, while ratio is typically 2% less, although obviously your mileage will vary depending on the file to be compressed.

Update : well, another update (r11), another performance improvement. This time, it targets GCC compiler optimization and benefits greatly to decompression speed.

Update 2 : a compression benchmark comparison has been completed by m^2 on his Atom-based system.

Saturday, June 4, 2011

LZ4 for Internet communications

 Among the different comments and contributions received for LZ4, an interesting one is about using LZ4 in a communication scenario.
The idea is pretty simple to describe : sent data is compressed on the fly, and decompressed at destination point. Fair enough. LZ4 requires little resource, especially on the decoding side, so even very tiny communicating devices can accommodate such scenario.

However, this use-case comes with new threats of its own : what if, the packet containing compressed data was in fact maliciously crafted in an attempt to create a buffer overflow scenario ?

Indeed. Internet is a dangerous place, and security comes to mind when using it. The easy answer would be "strengthen your encryption and pray that no one will break it", but that's too cheap. The main point here is to make sure that the LZ4 decoder never tries to write beyond the provided destination buffer. After all, not all malformed packets are malicious attacks, it could simply be a transmission error.

This is proposed in a new decoding function, now available at Google Code :


int LZ4_uncompress(char* source, char* dest, int osize);


I've used this opportunity to also correct a long time comment about compress and decode being too different from each other, so now it is compress / uncompress.
LZ4_decode is still available, for compatibility reason, but flagged as "deprecated".

The main difference here is on the last parameter : it is osize, for "output size" (as opposed to isize within LZ4_decode). Yes, the decoder needs to know how many bytes it has to write, to ensure it never passes beyond the border.

The other difference is the result : when everything is right, LZ4_uncompress will provide the read input size. On the other hand, if the compressed stream has tried to write beyond the output buffer, indicating a malformed stream, the result will be negative to trigger an error, and the value will be the byte position of the wrong instruction.

As a side-bonus, LZ4_uncompress never writes beyond the destination size (as opposed to LZ4_decode which could write up to 3 bytes beyond the limit even in normal situations).
But the real good news is this : in spite of the extra checking, the decoder speed is barely affected. At about a couple of %, this is well within tolerance limits, and the improved safety and memory protection is really worth the change. The main reason is that the extra checks have been successfully nested, so they are not triggered in the "main loop".

Since the stream compression format remains unchanged, it is 100% compatible with already compressed data. As a consequence, this is a recommended upgrade to LZ4 users, LZ4_decode being still available for compatibility and performance mode.


------------------------------------------------------------------------------------
Now for the more complex consequences : stop reading here if this was already long enough to handle :)

Obviously, the file compression program needs to evolve.
The main reason is that the decompression routine needs to provide the block output size to the decoding function, which is easy enough for most blocks, except the last one.

Without prior knowledge of the file size, there is no way to tell the decoder how many bytes it needs to write for this last block. One solution would be to change the file format, in order to store this value somewhere. I've never really paid much attention to the file format, which is only one possible use-case of LZ4 compression routines, but it has remained stable since the first version of LZ4, and i wouldn't like to change it for a little reason, especially if there are solutions to solve the problem.

And a solution there is. Remember that we can recover the decoded block size from the input (compressed) size. All it takes is to ensure that we do not write beyond the output buffer size. So an (advanced) function is dedicated to this role :


int LZ4_uncompress_unknownOutputSize(char* source, char* dest, int isize, int maxOutputSize);


This one is a bit longer. It's effectively a secure version of the previous LZ4_decode function. It ensures that it will never write beyond dest+maxOutputSize. Its result is the size of the decoded data (osize) when the stream was successfully decoded. Otherwise, the result is negative, and indicate the byte position in the input stream containing the wrong instruction.
This function however is slower, by up to 10%, mostly because it has to check both the input and output buffer size limits, so it is not recommended for the general use.

Well, in most circumstances, it is expected that the decoder know the size of the object to be decoded, if only to allocate the right amount of memory. Therefore, the first LZ4_uncompress is expected to be the most useful one.

You can grab the latest source version at Google Code.


------------------------------------------------------------------------------------

As a last blog's note for today :
LZ4 has been ported to Go language, thanks to Branimir Karadzic.
He hosts this new project at GitHub :
https://github.com/bkaradzic/go-lz4

Thursday, May 26, 2011

LZ4 explained

 At popular request, this post tries to explain the LZ4 inner workings, in order to allow any programmer to develop its own version, potentially using another language than the one provided on Google Code (which is C).

The most important design principle behind LZ4 has been simplicity. It allows for an easy code, and fast execution.

Let's start with the compressed data format.

The compressed block is composed of sequences.
Each sequence starts with a token.
The token is a one byte value, separated into two 4-bits fields (which therefore range from 0 to 15).
The first field uses the 4 high-bits of the token, and indicates the length of literals. If it is 0, then there is no literal. If it is 15, then we need to add some more bytes to indicate the full length. Each additional byte then represent a value of 0 to 255, which is added to the previous value to produce a total length. When the byte value is 255, another byte is output.
There can be any number of bytes following the token. There is no "size limit". As a sidenote, here is the reason why a not-compressible input data block can be expanded by up to 0.4%.

Following the token and optional literal length bytes, are the literals themselves. Literals are uncompressed bytes, to be copied as-is.
They are exactly as numerous as previously decoded into length of literals. It's possible that there are zero literal.

Following the literals is the offset. This is a 2 bytes value, between 0 and 65535. It represents the position of the match to be copied from. Note that 0 is an invalid value, never used. 1 means "current position - 1 byte". 65536 cannot be coded, so the maximum offset value is really 65535. The value is stored using "little endian" format.

Then we need to extract the matchlength. For this, we use the second token field, a 4-bits value, from 0 to 15. There is an baselength to apply, which is the minimum length of a match, called minmatch. This minimum is 4. As a consequence, a value of 0 means a match length of 4 bytes, and a value of 15 means a match length of 19+ bytes.
Similar to literal length, on reaching the highest possible value (15), we output additional bytes, one at a time, with values ranging from 0 to 255. They are added to total to provide the final matchlength. A 255 value means there is another byte to read and add. There is no limit to the number of optional bytes that can be output this way (This points towards a maximum achievable compression ratio of ~250).

With the offset and the matchlength, the decoder can now proceed to copy the repetitive data from the already decoded buffer. Note that it is necessary to pay attention to overlapped copy, when matchlength > offset (typically when there are numerous consecutive zeroes).

By decoding the matchlength, we reach the end of the sequence, and start another one.

Graphically, the sequence looks like this :

Click for larger display



Note that the last sequence stops right after literals field.

There are specific parsing rules to respect to be compatible with the reference decoder :
1) The last 5 bytes are always literals
2) The last match cannot start within the last 12 bytes
Consequently, a file with less then 13 bytes can only be represented as literals
These rules are in place to benefit speed and ensure buffer limits are never crossed.

Regarding the way LZ4 searches and finds matches, note that there is no restriction on the method used. It could be a full search, using advanced structures such as MMC, BST or standard hash chains, a fast scan, a 2D hash table, well whatever. Advanced parsing can also be achieved while respecting full format compatibility (typically achieved by LZ4-HC).

The "fast" version of LZ4 hosted on Google Code uses a fast scan strategy, which is a single-cell wide hash table. Each position in the input data block gets "hashed", using the first 4 bytes (minmatch). Then the position is stored at the hashed position.
The size of the hash table can be modified while respecting full format compatibility. For restricted memory systems, this is an important feature, since the hash size can be reduced to 12 bits, or even 10 bits (1024 positions, needing only 4K). Obviously, the smaller the table, the more collisions (false positive) we get, reducing compression effectiveness. But it nonetheless still works, and remain fully compatible with more complex and memory-hungry versions. The decoder do not care of the method used to find matches, and requires no additional memory.


Note : the format above describes the content of an LZ4 compressed block. It is the raw compression format, with no additional feature, and is intended to be integrated into a program, which will wrap around its own custom enveloppe information.
If you are looking for a portable and interoperable format, which can be understood by other LZ4-compatible programs, you'll have to look at the LZ4 Framing format. In a nutshell, the framing format allows the compression of large files or data stream of arbitrary size, and will organize data into a flow of smaller compressed blocks with (optionnally) verified checksum.

Saturday, May 21, 2011

ZhuffMax parsing improvements

 As a small attempt to improve parsing, i've made the lazy matcher of ZhuffMax a bit deeper. It resulted in some small gains, but also new problems.

With a deeper sequence to look for potentially better matches, we indeed increase our chances to find them.
However, in some circumstances, we end up with a serie of matches of increasing lengthes. At some point, the amount of bytes skipped by this recursive search may become so large that we could have used the first match to begin with.

In order to counterbalance this side-effect, we have to keep track of an history of matches, and try to re-use them if we have skipped too many bytes in the search process. It's not complex, and just requires a bit of local memory.

With both mechanisms active, we end up with a small but measurable compression gain. Nothing dramatic, but still, since ZhuffMax is about learning advanced parsing techniques, it's the right place to implement this improvement.
Deeper searches means slower speed though, so the gain comes at some cost.


Benchmark platform : Core 2 Duo E8400 (3GHz), Window Seven 32-bits

versionthreadsCompression RatioSpeedDecoding
ZhuffMax0.2-t12.82514.1 MB/s325 MB/s
ZhuffMax0.2-t22.82528.0 MB/s640 MB/s


You can get the new version of ZhuffMax at Zhuff's Homepage.