Tuesday, September 20, 2011

Cost of accessing variables

 I finally found a way to make the -c1 compression mode of LZ4 compatible with multi-threading.

This mode comes from the earlier LZ4-HC (v0.9) branch, which was created long ago, back in these days when none of my algorithms was multi-thread ready. The only objective was performance, and single thread execution. Heavy use of global variables were part of the deal to reach better speed.

This lesson proved right once again.

The first stage of enabling multi-threading is to get rid of all global and static variables. They simply cannot be shared by several threads. At least not easily.
To do this, you simply encapsulate all necessary variables in a container (a pointer towards a structure), and then pass the container as a parameter of the function. This way, each thread has its own set of data to handle.

From a code perspective, the changes are very little : just make all prior variables points towards their equivalent in the structure, and there you go. This way, it is surprisingly fast and simple to make a code multi-thread compatible.

But there was a catch : speed instantly dropped by 20%. This is a lot, and can be easily measured with benchmark tools.

In a bid to solve this issue, i went back once again at the "global variable" effect. The code remained identical, so the change from Global Variable to Container was the only reason for the 20% difference drop. Why ? This is the complex part.

In fact, when you call a variable, using for example :
mytable[number] = 1;
the compiler will generate very different ASM codes depending on the declaration of mytable.

If mytable is declared as a global or static variable, the compiler knows exactly where stands the beginning of mytable in the heap. Then it knows the size of the cells. Therefore, it only needs to generate :
mytable-baseAddress + number x mytable-cellSize
to reach the proper memory address.

Now, if mytable is passed as a parameter within a container, it is all different. The first thing to do is to create a pointer named mytable and which points towards the beginning of the table within the container. 
Now there are 2 possible situations :
1) mytable base Address is stored into a function variable (temporary one, into the stack). It needs to be read to collect mytable -baseAddress, and only then can the addition with number x mytable -cellSize happen. table-cellSize can be deduced from the pointer type.
This generates one more read for each call.
2) the compiler remembers the association. Therefore, mytable will not be stored as a variable into the stack, but dynamically translated into container-baseAddress + table-baseDelta.
I doubt this is any better than the previous proposition.

In order to avoid such a situation, one solution is to declare some of these variables as local ones. They will be created into the stack. Then, the access cost is almost similar to global/static variables : a local variable is accessed as base local stack address plus known delta. This is one more addition operation than for global/static variables, but that's ok, it is almost the same performance.
(Note : a post was created discussing this effect earlier this year : http://fastcompression.blogspot.com/2011/03/multi-threaded-entropy-experiment-huff0.html)

The real issue is that you can't create too much data into the stack. Its size is limited.
So, for larger data sets, i see no other way than to allocate the necessary memory (using malloc()) and access it with a pointer. This is bound to cost some performance compared with global/static variable, since there is one more pointer read per call, but i have yet to find a way around it.

Anyway, with a good mixture of local variables and allocated ones, the speed loss of -c1 mode was limited to 6% in single thread scenario. Not yet invisible, but still a noticeable improvement compared to previous situation. And at least, now, the code scales almost linearly with the number of CPU cores available.

You can grab the latest LZ4 version (v1.1) at its homepage.

[Edit] thanks to Richard Sim for correcting the heap/stack confusion.

Monday, September 19, 2011

LZ4 reaches v1.0

 Finally, a long planned update of LZ4 is out today, reaching v1.0. It's time to integrate all the various compression versions into a single binary, with the user able to select the compression level it wants by a simple switch (-c0, -c1 or -c2).

It is also time to integrate newer versions of the algorithm, since there were quite many improvements since latest release. As a consequence, v1.0 is faster and more efficient, as can be seen in this table :


v0.9v1.0
compression ratio2.0612.075
compression speed260 MB/s295 MB/s
decompression speed810 MB/s990 MB/s
memory usage33MB17MB
If you are using the open-source version of LZ4 or LZ4-HC, it is recommended to update to latest source release.

You can grab this latest win32 binary at the LZ4 Homepage.