Saturday, January 29, 2011

ARM Programming (part 2)

Now that context is presented, we can go on tough things : what's special with programming an ARM processor ?

ARM processors are used in low-powered devices, such as hand-held terminals. Battery-life is a key property to protect. Consequently, some sacrifices were accepted by ARM in order to save a lot of transistors, translating into less power usage.

As a general requirement, program for low-powered devices must also keep in mind the battery-life requirement, and designed in a power-conscious manner. Hence, trade-off will almost always favor memory savings and speed over raw precision.
It's possible to use high-level languages for ARM, such as LUA, or Java. Obviously, these languages will try to mask ARM shortcoming, providing a universal interface to programmer. But that's also why it's so difficult to optimize them.
So we'll settle on C programming language, which is much closer to metal.

Aligned data
The most critical restriction to consider when designing an ARM-compatible program is the strong requirement on aligned data. What does that mean ?
The smallest memory element that can be manipulated is a byte, like most other processors of these last 30 years (Saturn being a very rare exception). Reading and writing a byte can be done anywhere in memory.
However, things change when dealing with Short types (16 bits, 2 bytes).
Such data must be read or written on even addresses only (0x60h is okay, 0x61h is not).
Same restriction apply when dealing with Long types (32 bits, 4 bytes). These data must be written or read from a memory address which is a multiple of 4 (0x60h is okay, 0x61h 0x62h 0x63h are not).

Failing this condition will almost certainly result in a crash, since most OS don't want to handle these exceptions for performance consideration.
As long as you are dealing with Structured data, there is no issue : the compiler will take care of this for you.
Now, when dealing with a data stream, this is no longer a compiler job : you have to make sure that any read or write operation respects this restriction.

Guess what ? A file is such a giant data stream.

As a consequence, making your file format ARM friendly may require to change it. PC-ARM format compatibility is not guaranteed without this.

Forget about float
Programming with float is indeed supported by ARM compiler. But this is just a trick : hardware does not really support it, so a little program will take care of the calculation adaptation for you.
This has an obvious drawback : performance will suffer greatly.

Therefore, favor your own "fixed point" calculation instead, using a 32bit long as a container for your format. 16/16 is quite easy to come up with, but you may need another distribution, such as 22/10 for example. Don't hesitate to select the most suitable format for your program.

To give an example, i made a simple DCT implementation several months ago, using float (or double), as a "reference" starting point. It resulted in a speed of one frame per second.
I then simply replaced the "float" type with a fixed point implementation. This new version would only need 20ms to achieve the same job. Now, 50x is a huge enough performance delta to be seriously considered.

Cache is your (little) friend
Ensuring that data you need is fetched from cache rather than main memory is key to the performance of your application, and therefore to its impact on battery life.
Compared with PC, cache is a scarce resource for ARM. While modern x86 CPU tend to have multi-megabytes Level 2 caches on top of Level 1 caches, you end up with just a Level 1 cache with ARM, and generally a small one (size vary depending on implementations ; look at the technical doc of your model, 8KB or 16KB being very common).
Making sure your frequently accessed data stay in this cache will provide terrific performance boost.
As a consequence, a real difficulty is that your set of data has to remain small to match the cache size. This can change dramatically your algorithm trade-off compared with a PC version.

A lot of other performance optimizations advises are also valid, such as "read I/O in 32bits instead of 8bits at a time", but these ones are pretty much the most important ARM specific ones.
I have to thank int13 for providing me the opportunity to have a quick peek into this field.


  1. What was that hardware that didn't have an FPU? An original Palm Pilot? ;-)

    Smartphones have FPUs and even SIMD units. The 25 USD Raspberry Pi uses the cheapest CPU you'd care to support, and it's an ARM11 with a hardware (scalar) FPU. The recommended OS uses the hardfp eabi to pass floats to functions in registers. The CPU also supports unaligned loads and stores of multibyte values.

    Now, it's true that someone might want to decompress LZ4 on an Atmel AVR, but I don't think the mainline codebase should support 8bit microcontrollers.

    1. Indeed, things have changed for the better these last few years.
      When doing this article, I was thinking of embedded ARM on board old phones and toys (all of them being pre-iPhone). I don't remember the exact model today.

      But it's true since then, aligned data and FPU performance are less of a concern with modern versions.
      I guess cache size remains an important asset to consider though.

  2. There are LOTS of M0, M0+, and M3 ARM processors out there. None of these have an fpu.