Monday, November 24, 2014

Portability woes : Endianess and Alignment (Part 1)

 In creating an ultra-portable code, able to be compiled on (almost) every platform, there are some unusual problems to take into consideration. Listing them by order of increasing complexity, this post will review in detail address space, endianess and alignment restrictions (part 2).

Detecting 32 vs 64 bits
This is the easiest part, now practiced by many programmers. It's very common for a program to be designed with a single target environment. But since PC programmers had to deal with the 32->64 bits transitions, there are many solutions available around, just looking throughout Internet. However, they are not all equivalent.
Consider the initial solution adopted by LZ4 :

/* 32 or 64 bits ? */
#if (defined(__x86_64__) || defined(_M_X64) || defined(_WIN64) \
  || defined(__64BIT__)  || defined(__mips64) \
  || defined(__powerpc64__) || defined(__powerpc64le__) \
  || defined(__ppc64__) || defined(__ppc64le__) \
  || defined(__PPC64__) || defined(__PPC64LE__) \
  || defined(__ia64) || defined(__itanium__) || defined(_M_IA64) \
  || defined(__s390x__) )   /* Detects 64 bits mode */
#  define LZ4_ARCH64 1
#  define LZ4_ARCH64 0

It works. But you can easily spot the weakness : this is a long macro, with many special cases, and there is no guarantee there will be no more additional cases in the future. And by the way, this is what happened : the list was completed thanks to contributors adding one by one each target platform as they were discovered.
So it's complex, and not completely future proof.
Let's compare to the new method :

static unsigned LZ4_64bits(void) { return sizeof(void*)==8; }

Yes, a single trivial line, and it is future proof. It could be done as a macro too, but I prefer a static function, since it gives the compiler a chance to do something clever about it.

The initial feeling is that the macro is "runtime free" while the function will cost a small comparison test every time it's called, thus be slower.
But eventually, that's not what a modern compiler is expected to do. Clever compilers will realize this function always return the same value, and therefore replace it by its result. When there are branches depending on the result, the compiler is also expected to automatically solve the test, and remove the useless branch through dead code optimization.
And in practice, it works well.

It doesn't solve everything though.
For example, using Visual, the intrinsic function _BitScanForward64() is only accessible during x64 compilation. Compiling a source code mentioning this function in 32-bits mode will fail the link stage, even if the program will never call the function. That's a situation a runtime test cannot solve.
For this special case, it's still necessary to restrict the compiled source code through macro selection.

Detecting Endianess
While 32/64 bits is (mostly) a question of performance, endianess will impact result correctness. So it's damn important to correctly detect it.
A code able to handle different endianess is less common, but it's still relatively easy to find several solutions over Internet. And once again, they are not all equivalent.

Initially, LZ4 adopted the macro detection approach :

 * Little Endian or Big Endian ?
 * Overwrite the #define below if you know your architecture endianess
#include <stdlib.h>   /* Apparently required to detect endianess */
#if defined (__GLIBC__)
#  include <endian.h>
#  if (__BYTE_ORDER == __BIG_ENDIAN)
#     define LZ4_BIG_ENDIAN 1
#  endif
#elif (defined(__BIG_ENDIAN__) || defined(__BIG_ENDIAN) || defined(_BIG_ENDIAN)) && !(defined(__LITTLE_ENDIAN__) || defined(__LITTLE_ENDIAN) || defined(_LITTLE_ENDIAN))
#  define LZ4_BIG_ENDIAN 1
#elif defined(__sparc) || defined(__sparc__) \
   || defined(__powerpc__) || defined(__ppc__) || defined(__PPC__) \
   || defined(__hpux)  || defined(__hppa) \
   || defined(_MIPSEB) || defined(__s390__)
#  define LZ4_BIG_ENDIAN 1
/* Little Endian assumed. PDP Endian and other very rare endian format are unsupported. */
As you can see, it is a big mess. It depends on so many different parameters, it's hard to maintain, and it's difficult to guarantee it will always work. Indeed, it does not. Quite regularly, I received external contributions regarding specific platforms which would fail the test, in both directions (some little endian declared as big endian, and the other way round).
Add to this already complex situation the case of bi-endian CPU, a growing list of hardware which can select to be little-endian or big-endian, at will. That makes using architecture as an endian hint an unsustainable proposition.

As previously, the intention is to replace this list of macros by a guaranteed runtime test. Here is the new method within LZ4 :

static unsigned LZ4_isLittleEndian(void)

const union { U32 i; BYTE c[4]; } one = { 1 }; /* don't use static : performance detrimental */
return one.c[0]; }

Once again, the runtime method is not just much shorter and readable, it's also guaranteed to always produce the right result, whatever the CPU and its local mode.

However, this time, it's a little bit more difficult for the compiler to make the test "runtime free".

Here, unfortunately, your mileage may vary. For example, in the above example, I only succeeded in making the compiler realize the function always produce the same result after removing the "static" property from variable "one". Other possibilities exist, such as filling a 32-bits value and then accessing it with a char* pointer. They all work. At the end of the day, the question is : which version is most likely to be reduced into a constant value by as many compilers as possible ?

The above version proved successful so far. I can only wish it will remain as successful for other compiler/target combinations. At least, it's no longer the correctness of the test which is at stake, only its performance.

In the next article, we'll review memory access alignment restriction, which is, by far, the most complex issue.
In the meantime, should you wish to review and test the new detection methods, you can grab them in the latest LZ4 feature branch named "AlignEndian" :

It's possible to compare it with latest "official" release r124. On x86/x64 CPU, it was benchmarked and proved to provide similar performance. On other CPU though, especially big-endian ones, it would deserve to be checked.


  1. How is the endianness of the machine important?

    Can you comment on this article?:

    1. The article you link to is correct, and I'm using this methodology in some of my other source code. For example, this is what I do within lz4frame library.

      The problem is, it leaves a lot of performance on the table, because it trades one 32-bits access for four 8-bits accesses plus a reconstruction. Most compilers (all those I've tested so far) are unable to translate that into a more efficient instruction when possible (as is the case for x86/x64 CPU).

      It's not always important. For example, lz4frame only needs to safely write in little endian format for block headers, so it's a very small portion of overall time.
      But within lz4, this method can cost a visible portion of performance, between 20% & 40% on x86/x64 CPU. So it makes a difference.

    2. Thanks, that makes sense, I thought most compilers would be smarter about it.