Saturday, January 29, 2011

ARM Programming (part 2)

Now that context is presented, we can go on tough things : what's special with programming an ARM processor ?

ARM processors are used in low-powered devices, such as hand-held terminals. Battery-life is a key property to protect. Consequently, some sacrifices were accepted by ARM in order to save a lot of transistors, translating into less power usage.

As a general requirement, program for low-powered devices must also keep in mind the battery-life requirement, and designed in a power-conscious manner. Hence, trade-off will almost always favor memory savings and speed over raw precision.
It's possible to use high-level languages for ARM, such as LUA, or Java. Obviously, these languages will try to mask ARM shortcoming, providing a universal interface to programmer. But that's also why it's so difficult to optimize them.
So we'll settle on C programming language, which is much closer to metal.

Aligned data
The most critical restriction to consider when designing an ARM-compatible program is the strong requirement on aligned data. What does that mean ?
The smallest memory element that can be manipulated is a byte, like most other processors of these last 30 years (Saturn being a very rare exception). Reading and writing a byte can be done anywhere in memory.
However, things change when dealing with Short types (16 bits, 2 bytes).
Such data must be read or written on even addresses only (0x60h is okay, 0x61h is not).
Same restriction apply when dealing with Long types (32 bits, 4 bytes). These data must be written or read from a memory address which is a multiple of 4 (0x60h is okay, 0x61h 0x62h 0x63h are not).

Failing this condition will almost certainly result in a crash, since most OS don't want to handle these exceptions for performance consideration.
As long as you are dealing with Structured data, there is no issue : the compiler will take care of this for you.
Now, when dealing with a data stream, this is no longer a compiler job : you have to make sure that any read or write operation respects this restriction.

Guess what ? A file is such a giant data stream.

As a consequence, making your file format ARM friendly may require to change it. PC-ARM format compatibility is not guaranteed without this.

Forget about float
Programming with float is indeed supported by ARM compiler. But this is just a trick : hardware does not really support it, so a little program will take care of the calculation adaptation for you.
This has an obvious drawback : performance will suffer greatly.

Therefore, favor your own "fixed point" calculation instead, using a 32bit long as a container for your format. 16/16 is quite easy to come up with, but you may need another distribution, such as 22/10 for example. Don't hesitate to select the most suitable format for your program.

To give an example, i made a simple DCT implementation several months ago, using float (or double), as a "reference" starting point. It resulted in a speed of one frame per second.
I then simply replaced the "float" type with a fixed point implementation. This new version would only need 20ms to achieve the same job. Now, 50x is a huge enough performance delta to be seriously considered.

Cache is your (little) friend
Ensuring that data you need is fetched from cache rather than main memory is key to the performance of your application, and therefore to its impact on battery life.
Compared with PC, cache is a scarce resource for ARM. While modern x86 CPU tend to have multi-megabytes Level 2 caches on top of Level 1 caches, you end up with just a Level 1 cache with ARM, and generally a small one (size vary depending on implementations ; look at the technical doc of your model, 8KB or 16KB being very common).
Making sure your frequently accessed data stay in this cache will provide terrific performance boost.
As a consequence, a real difficulty is that your set of data has to remain small to match the cache size. This can change dramatically your algorithm trade-off compared with a PC version.

A lot of other performance optimizations advises are also valid, such as "read I/O in 32bits instead of 8bits at a time", but these ones are pretty much the most important ARM specific ones.
I have to thank int13 for providing me the opportunity to have a quick peek into this field.

Monday, January 24, 2011

LZ4 : World's fastest compressor

 As an unexpected surprise, i learned this morning that a compression benchmark site, Stephan Bush's SqueezeChart, declared LZ4 as the world's fastest compressor.
The final result : 6.4GB of data compressed in 39 seconds.
This is total time, and it tells a lot about the underlying I/O system, since it means reading at 165 MB/s and simultaneously writing the compressed result at 115 MB/s, which means either a RAID array, or a fast SSD (Solid-State Disk).

To be fair, LZ4 is known to be fast, but i was not expecting such a result. Some other more established compressor were supposed to get the graal, most especially QuickLZ, if not LZO. But apparently it ended on LZ4.
Is there any interest in LZ4 being more than just a tool for studying compression ? I'm wondering now.

Anyway, this is a nice little opportunity to brag around :)
You can grab LZ4 at its homepage.

[Edit] : SqueezeChart results have been independantly confirmed by Compression Ratings public benchmark.