Wednesday, November 24, 2010
Looking for performances on modern CPU architecture ? You can't avoid taking care of memory cache effects.
The logic for cache is pretty simple : access to main memory is (relatively) slow, therefore CPU keeps a part of memory next to it, for faster interaction.
At the lowest part of this strategy, you could consider registers as being "level 0 cache". Registers are within the CPU, and some of them are even electable for some comparison or mathematical operations, even if limited. As an example, the old "saturn" processor, used in several HP handheld calculators, had 4 main registers and 4 store registers. Data in these registers are accessed swiftly and can be manipulated easily. But this is not transparent : you have to explicitly load and save data into them, therefore the term "cache" is quite dubious here.
At the next level, all modern processors do feature nowadays an L1 cache. Not only PC ones, but even low-power CPU for embedded market, such as ARM, do feature an L1 cache. This time, this is truly a copy of main memory, an (almost) transparent mechanism which simply ensure that recently accessed data remain close to processors.
L1 caches are typically very fast, with access times in the range of 5 cycles and better. They are however relatively small, storing just a few kilobytes (Core 2 Duo for example feature 2x32KB of L1 cache per core). L1 caches are typically considered part of the CPU.
Going up another level, most PC processors do also feature an L2 cache. This time, we can really say that data sit "close to" processor, as L2 caches are rarely "inside" the CPU. They are however sometimes part of the same package.
L2 caches are slower (in the range of 20 cycles access time), and larger (between 0.25 and 8MB typically). So data that do not fit into L1 is likely to be found in this second larger cache.
A few processors do also feature an L3 cache, but i won't enumerate on that. Suffice to say that it is an even larger and slower cache.
Then you have the main memory. Nowadays, this memory tend to be quite large, at several GB per machine. Therefore, it is several order of magnitude larger than cache, but also much slower. Performance figures do vary a lot depending on architecture, but we can waver a typical 100-150 cycles number for access time.
So here you have the basic principles : very recently accessed data is re-accessed swiftly thanks to L1 cache (5 cycles), then a bit less recent data is found into L2 cache (20 cycles) then main memory (150 cycles). As a consequence, making sure that wanted data remains as much as possible into memory cache makes a terrific difference on performance.
The rule seems simple enough as it is, but there is more to it. Understanding how exactly work a memory cache is key to craft performances. So in a later note, we'll study its in depth mechanism.