Monday, June 10, 2013

Fighting Code Bloat - Part 2 - Inline functions

 A while ago, I've written a blog post on a method to decrease source code size, improving maintainability, when the source contains multiple functions which are almost identical, except for a few differences in conditions or types.

The method uses a separate *.h file, which is included as many times as necessary, with a few #define to trigger the relevant sections of the code.

Although it works, I couldn't help but get the feeling that it looks almost like a hack. Also, as a drawback, the method slightly increases source complexity, by increasing the total number of files to be included into an external project (granted, this number remains fairly small for LZ4, but still, it's a step in the wrong direction).

An insightful comment from Bryce Schober kindly reminded that another method was rightly accessible, using inline functions. Yes, inline functions would remove the issue of separated #include files, but it comes with its own set of limitations; namely, inline is merely an "hint" to the compiler, not a guaranteed that the function will effectively get inlined, and it doesn't solve the issue of manipulating different types within the function.

However, I wanted to give this method a try, in case it would result in a cleaner solution. Inline functions are regularly advised as a better alternative to macro (whenever applicable), since inline functions are still compiled, with all the benefits of strong typing and semantic check. It greatly improves debugging and code maintainability.

For this attempt, I selected the set of decompression functions, which has two big advantages :
1) The function is small. As an heuristic, the smallest a function is, the more probable it will get inlined.
2) There are no 'type' manipulations between the different version, only different sets of tests.

The key aspect for this trick to work is to expect the compiler's optimizer to do its job properly. Namely, whenever a test is guaranteed to be 'true' or 'false', we want the associated branch to be eliminated and the relevant piece of code to be instantiated in place. That's why it's key for this function to be inlined : without it, the compiler can't remove the branches, resulting in sensible performance penalty.

Ultimately, the experiment proved successful, and is the basis of LZ4 newest version, r97.

A few key things that were learned in the process :

- Make sure that branches to be removed receive a very clear '1 or 0' signal.
For example, if one branch depends on (variable > 0), don't write it this way. You should know if variable will fulfill the condition, and add another input, such as testtrigger, which will receive the value 1 or 0. Now, the test becomes if (testtrigger) : this will ensure the compiler will understand the test's result, and therefore only instantiate the correct branch.

- Make sure the function will really get inlined. In my tests, GCC happily inlined the generic function, but Visual Studio would not. Hopefully, this can be solve by using __forceinline keyword (forceinline is not part of C99 specification, it is therefore compiler-specific. It's a portability issue).

- Impact on performance is not zero. Here, it is a mixed bag. Inlined functions perform slightly worse than macros for Visual, while performing slightly better for GCC. It's difficult to explain why. The resulting assembler code is fairly close, but definitely not identical. The very small assembler differences could be enough to explain the performance delta. We are hopefully talking about small deltas, in the range of 2%.

- Code readability is definitely improved, as compared to a separate file with macros. This is a big boost to code maintenance. As an added bonus, it even proved easier to find some more optimization.

As a result, this version features a small boost to decoding speed, as measured below :
(fullbench, compiled with GCC v4.6.1 64-bits, running on Core i5 560M with Linux Ubuntu 11.10)
Summary (MB/s): r96 r97


LZ4_decompress_fast 1412 1460
LZ4_decompress_fast_withPrefix64k 1408 1457
LZ4_decompress_safe 1369 1391
LZ4_decompress_safe_withPrefix64k 1404 1416
LZ4_decompress_safe_partial 1327 1387


All in all, this is a step in the right direction.
Can the experiment be extended to the compression function itself ? Well, that's going to be a later story. It certainly looks much more complex.