Friday, August 9, 2013

Part 3 : Inlining big functions

 After the success of converting LZ4_decompress() to inline function, I was curious to repeat the performance on LZ4_compress().

The attempt is more complex, because LZ4_compress() is a much larger function, typically unfit for inlining. Also, the different variants use different types (table of U16, U32, or const BYTE*, depending on circumstances).

The result is somewhat mixed, although I do now have a working version using inline instead of macro. Here are a few learnings from this experiment :

1) No allocation within inline functions

It's of course possible to allocate a few variables, but it's not the point to allocate big tables on stack inside inline functions. This is because inline functions can be inlined several times, bloating the memory requirement on stack.

It's an issue for LZ4, since its hash table can be allocated on heap, or on stack, depending on "#define HEAPMODE" trigger. This choice will have to be done outside of the main inlined function.

The good news is that if you allocate on stack and then call the inline function, the inline function will keep the properties of accessing stack variables & tables (most importantly speed) even though the code tells it's working on pointers.

2) Inlining big function is generally not a win

Inlining big function is not "natural" for the compiler.
Inline functions were initially created as a safer replacement to macro, for small snippets of code. The main idea is that calling a function requires some push/pop/jmp operations, costing both CPU cycles and stack memory, which could be avoided. In general, function calling is fast enough, and does not require such optimization. But for small code sections inside hot loops, which are going to be called millions of times, this can make a sensible performance difference.

The capability to "remove branches" inside such code thanks to optimizations is merely a consequence, not the main initial objective.
As a consequence, the language does not even take it into consideration, and only consider the "function call delay" as the main variable to be optimized. Hence, when the function is too large, the usual choice for the compiler is to not inline it.

Therefore, to inline big functions, you have to "force" it (using compiler-specific extensions).

The situation differs sharply, depending if we are compiling for 32-bits or 64-bits systems.
- 64-bits programs tend to inline fairly well big functions. A small boost to performance can even be expected, apparently the compiler is able to find even more optimizations.
- 32-bits programs, in contrast, don't like such (forced) inlining. Here also the situation differs depending on the compiler, with Visual making a much better job than GCC.

I tentatively attribute such difference to the number of available registers. In 32-bits (x86) mode, registers are scarce (6, which are supposed to be general nowadays, but were previously specific). In 64-bits mode, they are plentiful (16). It means this is much easier for the compiler to combine all these codes together, with enough space to keep important variables into registers and avoid memory accesses.

If the hypothesis is correct, this means this issue is more related to number of available registers and compiler cleverness, than it is to 32 vs 64 bits. This can be important for ARM systems, which tend to be 32-bits one, but get 16 general-purpose registers at their disposal (like x64). They should suffer less from inlining issues.

Coming back to LZ4, sometimes, the loss of performance on 32-bits x86 is so important that it's better to not inline the function, even if it results in more branching. It's the current choice for LZ4_compressHC().

3) Type selection is okay

I've been hiding the different types inside another inline function, which selects the right pointer type and the appropriate operation depending on a parameter. The compiler makes a fine job at eliminating useless branches, keeping only the path with correct type to be inlined.

4) Clear selection signal is mandatory

In order for the compiler to only keep the right code path, eliminating all branches, it is essential it correctly understand which branch is the right one.
There must be no room for confusion.

As a simple example, suppose we've got this code :

inline int functionA(int parameter)
{ (... bunch of code with one branching depending on parameter...) }

int functionB()
int parameter=1;
return functionA(parameter);

You'll be surprised to notice that the branching is not necessarily removed, even though the parameter has a very clear value.

This seems like a limitation of the compiler, which doesn't want to "bet" on the value of parameter. Therefore, prefer something along this line :

void functionB()
return functionA(1);

This one will ensure that only the path corresponding to parameter==1 will be kept, eliminating branching.

5) Predictable branching don't cost that much

Modern CPU are fairly complex beasts, and one thing they do fairly well is to correctly guess what will be the result of a branching, anticipating the next operations by filling the pipeline.
This is even more true if the result is always the same.
So, even if a branching is not eliminated, but lead to always the same path, the CPU will quickly understand it, and next time will speculative start the right branch even before the result of the test.
As a consequence, the cost of branching is minimal (only the test operation itself and a jmp), with no pipeline stall.
This means it's not always necessary to worry about every branching. Only those with (almost) random pattern will have a big hit to performance. Predictable ones are fairly fast.

So, the result :

The current developer version is a mixed bag. It results in a small performance boost in 64-bits mode (as compared to macro), and a small performance loss in 32-bits modes.

using a Corei5-3340M, on LZ4_compress() (fast variant) :
64-bits GCC : +2% (Ubuntu)
32-bits Visual : -1% (Win7)
32-bits GCC : -5% (Ubuntu)

On the other hand, it eliminates the need for external files (lz4_encoder.h & lz4hc_encoder.h), making the source code both easier to read and to integrate into user projects. And this was the main objective all along.


  1. It looks like the result graph is missing... or is it the one at the beginning of the article?

    1. There is none. I've added a few performance figures in the last paragraph instead.

  2. If the sole goal is to ship only one source file, have you considered making it include itself? Something like:

    #ifndef MODE
    int compress1(void)
    #define MODE 1
    #include "lz4.c"
    #undef MODE
    int compress2(void)
    #define MODE 2
    #include "lz4.c"
    #undef MODE
    return MODE;

    It's filthy and barely-legible, but it does achieve the stated objective.

    1. It's a pretty good trick. And no, I did not thought about it.

      As a small comment, I'm not sure if compilers would accept a file to include itself, since it could lead to infinite loops without appropriate #define. Should it be okay, it's a pretty cool idea.

      But that's not the only one point. A real downside is that the code is more difficult to read, and therefore to maintain, in macro mode, as compared to inline mode. The inline mode is also expected to improve maintenance.

  3. A compiler definitely allows a file to include itself. For instance you can #include __FILE__ but depending on the compiler you may prefer #include "lz4.c" because __FILE__ is not necessarily an absolute path.

    Then, the recursion limit of file inclusion is compiler dependent. That said I've happily wrote code that includes itself up to 32 times without problems.

    1. Good to know. Your idea looks fairly generic, it seems a nice way to reduce the count of source files

  4. Have you tried PGO with gcc or msvc? That should help inlining decisions. It would be interesting to see how well the compiler can do with PGO, without any macros or forced inlines.

    1. No, I haven't. PGO requires some learning curve, and I'm still at the bottom. Moreoever, the risk is that the final code is too optimized for a specific compiler configuration.
      For the current LZ4 version, the forceinline parameter has been brute-force tested, using my own benchmark tool.
      You are welcomed to try PGO on LZ4 if you wish, it may result in improvements the current "manual" process didn't catch.