RealTime Data Compression: Portability woes : Endianess and Alignment (Part 2)

Tuesday, November 25, 2014

Portability woes : Endianess and Alignment (Part 2)

In the previous part, endianess and 32/64 bits detection were detailed. Now we'll have a look at more complex memory alignment troubles.

Memory Access Alignment is a lesser known issue, but its impact can be huge : it will crash your program or result in a disproportionate slow down.

Let's first summarize the problem. When accessing a single byte into memory, there is no restriction. But when trying to access 2-bytes at a time (short), or 4-bytes at a time (int), alignment get into the way.

Aligned property means a 2-bytes field must always be accessed on an even address (multiple of 2), or accessing a 4-bytes field is always done on an address multiple of 4, and so on.

From a performance perspective, accessing many bytes at a time is a win, as it makes better use of the memory bus. But when a multi-bytes field is accessed on a non-aligned memory address, all sort of problems can happen : bus width or addressing limitation, cache line overlap, memory segment border overlap, and so on.

All these issues can be solved, and indeed, on the most widely known programming environment, x86/x64, these problems are solved since a long long time.

But it has a cost, it makes the CPU more complex, and consume some precious transistor space. As a consequence, several CPU vendors selected to be a bit lazy on these issues, deciding to not address them, leaving the problem into the hands of software developers. In such case, if an unaligned memory access is nonetheless performed, the CPU sends an exception, typically resulting in a program crash.

To make the matter more complex, some CPU addressed alignment issues, but in an inefficient manner, resulting in undesirable slow performance. Other ones address it correctly for short (2-bytes) or int (4-bytes) but not long long (8-bytes).

Data alignment issue is well described, with many sources throughout Internet. Unfortunately, finding a proper portable solution for it is not, many "advisers" simply telling to avoid unaligned access altogether. Thanks, really.

But the above condition cannot be met in every circumstances. Consider how the compression algorithm works : it looks for similar pattern into past data. Such pattern can appear anywhere, not just on "aligned" addresses.

For a portable code, this situation is a nightmare. The "safe" approach would be to always access data byte-by-byte, but then, the impact on performance is huge, and for speed-oriented application such as LZ4, this trade-off is unacceptable.

The way it was handled by LZ4 up to now relied on the compiler. The basic idea is : the compiler should be fully aware if its target CPU can, or cannot, handle unaligned memory access.

This is achieved through the "pack" instruction, which, in a nutshell, tell the compiler to "not align these data", and therefore generate proper cautious assembler code when accessing them.

It gives the following result :

#if defined(__GNUC__)  && !defined(LZ4_FORCE_UNALIGNED_ACCESS)

#  define _PACKED __attribute__ ((packed))

#else

#  define _PACKED

#endif

#if !defined(LZ4_FORCE_UNALIGNED_ACCESS) && !defined(__GNUC__)

#  if defined(__IBMC__) || defined(__SUNPRO_C) || defined(__SUNPRO_CC)

#    pragma pack(1)

#  else

#    pragma pack(push, 1)

#  endif

#endif

typedef struct { U16 v; }  _PACKED U16_S;

typedef struct { U32 v; }  _PACKED U32_S;

typedef struct { U64 v; }  _PACKED U64_S;

typedef struct {size_t v;} _PACKED size_t_S;

#if !defined(LZ4_FORCE_UNALIGNED_ACCESS) && !defined(__GNUC__)

#  if defined(__SUNPRO_C) || defined(__SUNPRO_CC)

#    pragma pack(0)

#  else

#    pragma pack(pop)

#  endif

#endif

#define A16(x)   (((U16_S *)(x))->v)

#define A32(x)   (((U32_S *)(x))->v)

#define A64(x)   (((U64_S *)(x))->v)

#define AARCH(x) (((size_t_S *)(x))->v)

If it looks a bit complex, that's because it is.

The first issue we have is that issuing the "pack" instruction must be done in a variety of ways, depending on the compiler and its version. It translates into this monstrous macro, trying to figure out all possible situations reported up to now. As you can guess, I regularly receive notice of new situations the macro cannot cope with.

This is about as bad as the previous stories regarding 32/64 bits and endianess.

But there is more.

Compilers are not that clever.

In many circumstances, the compiler will issue a "safe" slow code to access unaligned data, even though the target CPU is able to efficiently handle this situation, resulting in a large speed drop. This is especially true for late ARM CPU.

To counter-balance this effect, there is a need to manually "turn off" the "pack" instruction, using in the above example the #define LZ4_FORCE_UNALIGNED_ACCESS.

Unfortunately, the manual switch is almost never used. Source code will most of the time be compiled "as is", which is no surprise.

So we have 2 issues : issuing the "pack" instruction is complex, and not future-proof, and compilers don't automatically make the best choice.

To save the day, maybe a new runtime check will help, like for previous issues ?

Alas, there is no portable runtime test available to check for aligned properties.

(Note : of course, one could imagine running an external program/process just for this purpose, but it's outside of the scope of a little single-file library).

So we are stuck, aren't we ?

Well, that's the difficult part. To make some progresses on current situation, I'm going to change the logic : instead of relying on the compiler, take explicit charge to handle unaligned accesses.

The core foundation of the new solution is based on below function, already used within lz4frame :

static U32 LZ4_readLE32 (const BYTE* srcPtr)

{

    U32 value32 = srcPtr[0];

    value32 += (srcPtr[1]<<8);

    value32 += (srcPtr[2]<<16);

    value32 += (srcPtr[3]<<24);

    return value32;

}

What's good with this function ?

It handles both endianess and alignment in a safe way. The code is portable.

What's wrong with it ?

It's the safe approach, and therefore is slower than necessary when CPU can correctly handle unaligned memory accesses.

So, we will now special-cases CPU which do support efficient unaligned access.

Some of them are easily detectable, thanks to widely supported standard macro : __i386__, __x86_64__, __ARM_ARCH_7__ are known architectures with good support for unaligned memory accesses. __ARM_ARCH_6__ is also able to handle it, but in a less efficient manner, so it's unclear if it's really faster than the portable version.

Finding a list of CPU with efficient support of unaligned memory accesses (and their related detection macro) has proven an impossible task so far. One may have in mind that Linux faces a similar challenge, which is supposed to be solved thanks to the macro CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS. Unfortunately, I couldn't find a place where this macro is defined. It seems to be a decentralized methodology, where each architecture tells independently if it's compatible or not. For the Linux kernel, it's likely the correct method. But that also means there is no central repository where this property is listed.

So I'm a bit stuck right now.

My expectation is that external contributors interested in LZ4 performance may benchmark the speed of the new version, tentatively enabling/disabling the prominent switch at the top of lz4.c when they see fit :

* CPU_HAS_EFFICIENT_UNALIGNED_MEMORY_ACCESS :

* You can force the code to use unaligned memory access, should you know your CPU can handle it efficiently.

* If it effectively results in better speed (up to 50% improvement can be expected)

* please report your configuration to upstream (https://groups.google.com/forum/#!forum/lz4c)

* so that an automatic detection macro can be added to mainline.

/* #define CPU_HAS_EFFICIENT_UNALIGNED_MEMORY_ACCESS 1 */

Each time a CPU is known to efficiently handle unaligned memory access, its standard detection macro can be added to the list, in order to automatically use the faster code path.

Is it any better than previous method ?
Well, sort of.

To begin with, there is no longer a need to fiddle with the different "pack" instructions depending on compilers and versions. This is one less complexity to manage, and dependency to worry.
Getting formally in charge of unaligned access allows introduction of dedicated functions. For example, a more adapted "string comparison" function could be introduced.

More importantly, instead of crashing when the detection fail, the library will now run slower, but still run correctly. But it introduces some new risk : many users may simply not notice the slow down, and just use the library "as is", unaware of the latent performance improvement which could be unleashed. The hope is, as long as a few contributors can detect and report the performance issue, the situation can be solved for everybody with similar setup.

Latest version of LZ4 using these new detection routines is accessible in the feature branch "AlignEndian" : https://github.com/Cyan4973/lz4/tree/AlignEndian

It's possible to compare it with latest "official" release r124. On x86/x64 CPU, it was benchmarked and proved to provide similar performance. On other CPU though, it's still worthwhile to check.

16 comments:

pm215November 26, 2014 at 12:53 AM
Unfortunately if you lie to the compiler about the alignment of your data then you may still encounter alignment faults even on recent ARM CPUs. The support for loading from unaligned addresses applies to simple load and store instructions, not to every memory accessing instruction the CPU has. If the compiler is sufficiently optimising to notice that it could combine multiple memory accesses into a single LDRD/STRD or Neon load/store insn then it will fault if they're not aligned correctly.

In the long run I think it's better to write portable code and get the compilers fixed if they generate particularly poor code in certain cases.

Have you investigated how good the codegen is for doing unaligned accesses via memcpy?

uint16_t r;
memcpy(&r, unaligned_ptr, sizeof(r));
return r;

This isn't as daft as it sounds, because compilers do know how to inline small memcpy and should emit the correct simple unaligned load/store insn.
ReplyDelete
Replies
AnonymousNovember 26, 2014 at 11:49 AM
I presume you saw the Debian bug report on this issue
(https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=757037) but I
wonder if you saw this thread also?
https://www.marshut.net/ksqnvn/architectures-where-unaligned-access-is-not-ok.html

It contains the disheartening quotes: "liblzo2 seems to be one of
these codebases that bases its idea of how C works on portability
folklore and the assumption that the compiler and standard C library
are the most naive implementation possible :-(" and "sounds like
someone thought C was an assembler and not a language with quite tight requirements on what you are allowed to do. "

Sadly, I fear they are right. The world has changed. You are still
viewing C as the proverbial "thin layer over assembly", which allows
you to write portable code that can be compiled to run on processors
with different instruction sets and capabilities. In this worldview,
the compiler takes the source code you write and translates the
operations into equivalent assembly appropriate for the target,
possibly with some optimizations that do not change the outward
behaviour of the code.

Unfortunately, in the minds of current the compiler maintainers, this
is but a quaint fairytale. Instead, modern C as defined by the
current specs is a typed language with significant restrictions on
what's legal, and if you ask it to do something the spec defines as
"undefined", rather than generating the code you asked for (that would work just fine on the processor you know you are compiling for) the compiler will probably just pretend that code wasn't there. And when it does this, you are supposed to be grateful that it was kind enough not to format your hard drive.

(continued)
ReplyDelete
Replies
AnonymousNovember 26, 2014 at 11:52 AM
Unbelievable as it may seem to someone who views C as a low-level
systems language that let's you write efficient code that takes
advantage of the capabilities of modern processors, modern C simply
does not provide any legal way to read or write potentially unaligned
multibyte variables except byte-by-byte. Those optimizations which
double your speed on processors with fast unaligned access? Just a
temporary oversight on the part of the compiler, a bug waiting to happen, and eventually your comeuppance will be served. Seems vicious, but see Mike Stump's comments as a GCC maintainer in this (really long but insightful) thread: http://compgroups.net/comp.arch/if-it-were-easy/2993157

The only legal way in C to do what you want is to write something like the byte-oriented code that you have done or to use memcpy() (which is part of the C standard at this point, and thus fair game for compiler optimization/omission). In either case, the 'fast path' macro that has so far given you good performance simply must go. If the compiler gods smile on you, they will optimize your seemingly slower code to run the way you had intended. If not (as has been your experience), presumably you were worshipping the wrong version of the compiler god. Upgrade and try again. Or try a different god. But whatever you do, if you value your hard drive and aren't wearing nose plugs, don't have the hubris to assume you are the one telling the compiler what to do.

Personally, I'm coming to the conclusion that the only reliable way to obtain the level of performance you are seeking is to write in assembly, distribute compiled code, or both. Write legal C that will provide a variable level of reasonably good performance, and then write optimized versions of functions in assembly that can be used on recognized processors. You don't even have to write the assembly yourself --- it's perfectly fine to find a compliant compiler, try out every optimization option possible, verify that it's generating
correct performant code, and lock it down to binary or assembly. If binary, distribute for all the major formats or point people to Agner Fog's converter (http://www.agner.org/optimize/#objconv). If assembly source, NASM is probably your most portable bet. One day there might be a viable intermediate code representation you could use instead (like LLVM IR:
https://idea.popcount.org/2013-07-24-ir-is-better-than-assembly/), but
I don't think it's there yet.
ReplyDelete
Replies
grailsDecember 18, 2014 at 8:41 AM
I read on link below that memory alignment is not significant. Is it just depends on study and testing?

http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/
ReplyDelete
Replies
Willy TarreauApril 29, 2015 at 3:39 PM
Hi Yann,

I just stumbled across your very interesting post. I recently found that I could shrink enums to 8-bits only by appending the packed attributed to them, and that it worked for unions as well, but I never thought about using that for unaligned data! Thus I went a bit further based on your idea and used a union instead :

typedef union {
unsigned char u8;
signed char s8;
unsigned short u16;
signed short s16;
unsigned int u32;
signed int s32;
unsigned long long u64;
signed long long s64;
} __attribute__((packed)) packed;

That way I can dereference pointers with the type I need.
Eg: ((packed *)x)->u32

I verified that the code is correct and identical to a regular load on i386, x86_64, armv7-a, armv7-a in thumb mode, and produces lots of byte loads on armv5 which doesn't support unaligned accesses. I was surprized to discover that mips32 does support unaligned accesses. The memcpy() equivalent code is more or less optimized on mips32 (adds stupid add/sub on sp), and the memcpy() on armv5 is not optimized at all (does a byte copy to a local variable).

What I like with this approach is that it relies on the compiler's knowledge of the target and uses it exactly as it defines the way to access structure members. If it fails on one arch, it will require a bug report. And contrary to another post above, I don't like to start to suspect the compiler of sometimes being wrong and avoiding what it provides "just in case". I'm fine with working around known bugs but not supposed ones.

You were speaking about performance, I found unaligned accesses to be not really fast on atom in 32-bit mode, accessing a 32-bit unaligned word is as slow as accessing the 4 bytes separately there.
ReplyDelete
Replies

Add comment