RealTime Data Compression: Counting bytes fast

Saturday, September 27, 2014

Counting bytes fast - little trick from FSE

An apparently trivial and uninteresting task nonetheless received some special optimization care within FSE : counting the bytes (or 2-bytes shorts when using the U16 variant).

It seems a trivial task, and could indeed be realized by a single-line function, such as this one (assuming table is properly allocated and reset to zero) :

while (ptr<end) count[*ptr++]++;

And it works. So what's the performance of such a small loop ?
Well, when counting some random data, the loop performs at 1560 MB/s on the test system. Not bad.
(Edit : Performance numbers are measured on a Core i5-3340M @2.7GHz configuration. Benchmark program is also freely available within the FSE project)
But wait, data is typically not random, otherwise it wouldn't be compressible. Let's use a more typical compression scenario for FSE, with a distribution ratio of 20%. With this distribution, the counting algorithm works at 1470 MB/s. Not bad, but why does it run slower ? We are starting to notice a trend here.
So let's go to the other end of the spectrum, featuring highly compressible data with a distribution ratio of 90%. How fast does the counting algorithm run on such data ? As could be guessed, speed plummets, reaching a miserable 580 MB/s.

This is a 3x performance hit, and more importantly, it makes counting now a sizable portion of the overall time to compress a block of data (let's remind FSE targets speeds of 400 MB/s overall, so if just counting costs that much, it drags the entire compression process down).

What does happen ? This is where it becomes interesting. This is an example of CPU write commit delay.

Because the algorithm writes into a table, this data can't be cached within registers. Writing to a table cell means the result must necessarily be written to memory.
Of course, this data is probably cached into L1, and a clever CPU will not suffer any delay for this first write.
The situation becomes tricky for the following byte. In the 90% distribution example, it means we have a high probability to count the same byte twice. So, when the CPU wants to add +1 to the appropriate table cell, write commit delay gets into the way. +1 means CPU has to perform both a read and then a write at this memory address. If the previous +1 is still not entirely committed, cache will make the CPU wait a bit more before delivering the data. And the impact is sensible, as measured by the benchmark.

So, how to escape this side-effect ?
A first idea is, don't read&write to the same memory address twice in a row. A proposed solution can be observed in the FSE_count() function. The core loop is (once cleaned) as follows :

Counting1[*ip++]++;
Counting2[*ip++]++;
Counting3[*ip++]++;
Counting4[*ip++]++;

The burden of counting bytes is now distributed over 4 tables. This way, when counting 2 identical consecutive bytes, they get added into 2 different memory cells, escaping write commit delay. Of course, if we have to deal with 5 or more identical consecutive bytes, write commit delay will still be there, but at least, the latency has been used counting 3 other bytes, instead of wasted.

The function is overall more complex : more tables, hence more memory to init, special casing non-multiple-of-4 input sizes, regroup all results at the end, so intuitively there is a bit more work involved in this strategy. How does it compare with the naive implementation ?

When compressing random data, FSE_count() gets 1780 MB/s, which is already slightly better than the naive strategy. But obviously, that's not the target. This is when distribution gets squeezed that it makes the most difference, with 90% distribution being counted at 1700 MB/s. Indeed, it's still being hit, but much less, and prove overall much more stable.

With an average speed > 1700MB/s, it may seem that counting is a fast enough operation. But it is still nonetheless the second contributor to overall compression time, gobbling on its own approximately 15% of budget. That's perceptible, and still a tad too much if you ask me for such a simple task. But save another great find, it's the fastest solution I could come up with.

Edit :
Should you be willing to test some new ideas for the counting algorithm, you may find it handy to get the benchmark program which produced the speed results mentioned in this article. The program is part of the "test directory" within FSE project, as a single file named fullbench.c :
https://github.com/Cyan4973/FiniteStateEntropy/blob/master/test/fullbench.c

Edit 2 :
Thanks to recent comments, notably from gpd, Nathan, and Henry Wong, a new and better reason has been provided to explain the observed delay. Its name is store-to-load forwarding. I would like to suggest here the read of the detailed explanation from Nathan Kurz, backed by his cycle-exact Likwid analysis, and the excellent article from Henry on CPU microarchitecture.
In a nutshell, while write commit delay used to be a problem, it should now be properly handled by store-cache on modern CPU. However, it introduces some new issues, related to pipeline, serial dependency and prefetching, with remarkably similar consequences, save the number of lost cycles at stake, which is quite reduced.

Edit 3 :
Nathan Kurz provided an entry which beats the best speed so far, achieving 2010 MB/s on a Core i5-3340M @ 2.7 GHz. Its entry is provided within the fullbench program (as algorithm 202), alongside a simplified version which achieves the same speed but is shorter (algorithm 201).
It's more than 10% better than the initial entry suggested in this blog, and so is definitely measurable.
Unfortunately, these functions use SSE 4.1 intrinsic functions, and therefore offer limited portability perspectives.

50 comments:

cbloomSeptember 27, 2014 at 6:06 PM
On load-hit-store architectures, even more count arrays is better. (I use 8)

On some architectures I believe you could store the whole histogram in a few large SIMD registers.
ReplyDelete
Replies
AnonymousSeptember 28, 2014 at 1:03 AM
Counting? There's an app for that!
ReplyDelete
Replies
jwatte_foodSeptember 28, 2014 at 2:02 AM
Try keeping the counter for the last seen byte in a register and only adding to counter when you see another? Adds a branch, though.
ReplyDelete
Replies
AnonymousSeptember 28, 2014 at 7:04 AM
I don't have full access to your measurement instrumentation, so My Mileage May Vary. But this function performed better than Counting4 using my (admittedly weak) performance instrumentation::

void cb_switch8(void const * const start, void const * const end, uint64_t counts[256])
{
uint16_t *p8_start = (uint16_t *)start;
uint16_t *p8_end = (uint16_t *)end;

while (p8_start < p8_end)
{
switch (*p8_start)
{
// The switch8.cish file was generated by this python script:
// for i in xrange(2**8):
// print "case 0x%02x: counts[0x%02x]++; break;" % (i, i)
// and so follows this pattern:
//
// case 0x00: counts[0x00]++; break;
// ...
// case 0xff: counts[0xff]++; break;
#include "switch8.cish"
}
p8_start++;
}
}

Over 1E6 bytes of random data, the results are:

cb_naive: 2741 clicks [0.002741 s] // naive
cb_naive4: 2170 clicks [0.002170 s] // Counting4
cb_switch8: 1035 clicks [0.001035 s]

Please give it a try. I'd like to see if it is still an improvement when used in your code.
ReplyDelete
Replies
AnonymousSeptember 28, 2014 at 7:07 AM
Oh, wait. My typecasting to uint16_t is a bug. When corrected, the program is 8x slower than Counting4.
ReplyDelete
Replies
AnonymousSeptember 28, 2014 at 2:19 PM
How about using gpus?
ReplyDelete
Replies
CyanSeptember 28, 2014 at 10:19 PM
Suggestion from Sebastian Egner :

Just an idea: Count 16-bit values formed by two bytes from the stream, e.g. consecutive. Then reduce the histogram to the byte values.
ReplyDelete
Replies
AnonymousSeptember 28, 2014 at 10:47 PM
Isn't store-to-load forwarding supposed to make this problem non-existent? Is that implemented in most modern processors?
ReplyDelete
Replies
gpdSeptember 29, 2014 at 3:21 PM
Simple idea: do multiple passes.

Divide the input data in L1 cache sized subsets. Then do 256/K passes over the data. Each pass counts the appearance of a K subset of all possible values incrmenting a local R_kcounter; you want to keep K small so you can fit all the counters in registers. At the end of the pass you increment your table with the content of each counter.

You can load and test 16 bytes at a time by using SSE compare aganist the Kth value (which is known statically), copy the resulting mask to a general purpose register and then do a popcnt.

Another possible optimization is to set the subset size to half of the L1 cache and prefetch the other half while performing computation on the first half.
ReplyDelete
Replies
Nathan KurzSeptember 29, 2014 at 10:15 PM
Nice article, and interesting to ponder. I think your conclusions are right, and your solution good, but the that the problem isn't actually due to any write commit delay. These used to be an issue, but with current store-to-load forwarding, they aren't a factor any more.

Instead, as 'gpd' suggests, the issue is with speculative loads. The processor tries to get ahead by reading ahead in the input stream and pre-loading the appropriate counter. It even speculatively executes the addition and issues the write. But before the write is retired, the counter preload is invalidated by a previous store. Usually no reload of the counter is necessary, since the updated value is forwarded directly to the appropriate register from the load, but the work done for the increment and store is tossed out (not retired) and the micro-ops are re-executed with the new value.

I think I can show this fairly directly using the CPU's performance counters. But when I tried to post the details here, your blog engine didn't want to accept them. And then it got into a loop where it kept remembering the original version, and would throw away my edits. So I sent long message to you via your contact form, and am trying to post this shorter version from an anonymous browser window in the hope that I could at least get it to accept something.
ReplyDelete
Replies
UnknownSeptember 29, 2014 at 11:32 PM
Ok, I have plaeyd with this, and this is the best I could come up with using SSE4.1:

http://pastebin.com/CFQmxBPg

Compile it with "gcc -O2 -msse4.1 test.c". For random data the naive solution runs for 615 ms vs the vectorized for 347 ms, for zero filled data it is 1795 vs 352 ms.

I have also tried using 256 bit AVX registers where each bit would be used to count the number of occurrences of a specific byte, and use several AVX registers each of which would store the 0th, 1st, 2nd, etc bit of the count and use AND and XOR to update them all simultaneously. It was not faster.
ReplyDelete
Replies
Henry WongSeptember 30, 2014 at 12:18 AM
That's a neat trick!

(Do you know if your clock speed of 2.7 GHz includes the effect of turbo boost? From your 1780MB/s number, I'd guess it was actually running at ~3.2 GHz).

I ran your code on an Ivy Bridge (i5-3570K, 4200 MHz), and got ~1.8 clocks per byte (2366 MB/s, -P0, 64-bit, gcc 4.4.3). Ignoring the ALU ops (2 loads + 1 store per byte processed), I would expect at best 1.5 clocks per byte, so you're already quite close (Ivy Bridge can do two memory ops up to one store per cycle).

I think some of the confusion with the comments about load-store forwarding is due to having called it a write "commit" delay. Load-store forwarding allows this delay to be ~5 clocks instead of waiting for commit (often multiple tens of cycles). But repeatedly incrementing the same location in memory still requires serializing all of the read-modify-write operations even with store-load forwarding, while accessing independent locations allows the increments to be pipelined.

I mostly disagree with the comments about store dependence speculation. While memory dependence speculation may enable the issue to be visible (by allowing independent accesses to proceed. Note your trivial code gets a load address from memory too.), and misspeculation may play a role, the effect you're seeing will exist even with perfect speculation. Increments to independent memory locations can happen at rate of one every 1.5 clocks, while dependent increments can't happen any faster than once every 6 cycles on Ivy Bridge. (This also explains why keeping groups of 4 independent increments should be enough to get back almost all of the lost performance: 6/1.5 = 4)

My guess is that memory dependence misspeculation is not playing a big role (at least for -P90), because the (second) load is "dependent" so often it will predict "dependent" all the time, leading to no flushes. Note also that the (first) load for address generation is never dependent, and that can also be correctly predicted.

I attempted software pipelining the address generation, and reducing the number of loads (in exchange for more ALU ops by loading two or 4 bytes then computing the addresses for the increments using shifts), and managed to squeeze a ~5% improvement on Ivy Bridge, but it got an even bigger decrease (~8%?) on AMD Bulldozer. :P

I've read Nathan's performance counter results, and I don't quite trust them. I'll comment on that in a separate comment.
ReplyDelete
Replies
Henry WongSeptember 30, 2014 at 5:30 AM
For memory dependence misspeculations, there are performance counters specifically measuring those: MACHINE_CLEARS.MEMORY_ORDERING gives a count, and INT_MISC.RECOVERY_CYCLES gives number of cycles caused by machine clears in general. This counter says there are very few memory dependence misspeculations (Using 10k iterations as Nathan did):

For P90, I get ~2k out of 6.6B instructions retired,
For P20, I get 800k out of 6.6B instructions retired.
(I used VTune. It should be the same performance counters that likwid uses)

I did reproduce Nathan's observation that store data counts (port 4) got much higher with P90 on an Ivy Bridge. However, I don't really trust that particular counter (even though I do not know why it seems to be off).

- The number of store-data (Port 4) ops with P90 is too high. Every store-data needs a corresponding load, unless there is value prediction or some form of speculative memory bypassing (rather than store-load forwarding), neither of which are implemented, to my knowledge. You can't know what data to store until you've loaded and incremented the number. Yet for P90, there are more store-data ops executed than loads executed.

- For P0, I've also noticed the number of store-data ops to be slightly *lower* than the number of stores committed (by ~8%). You can squash executed ops, but you can't have a committed store without having executed it.

Ivy bridge doesn't have a separate store AGU, but Haswell does. Maybe the store AGU (Port 7) might show something different than the store-data port (Port 4)? In any case, I think I trust the "memory ordering machine clears" counter more than the increase in executed ops.
ReplyDelete
Replies
Nathan KurzOctober 1, 2014 at 10:21 PM
Hi Henry --

Nice insights. While I'm certainly often wrong, I think my numbers are correct this time. The big discrepancy you note between the number of loads and stores is true, but a result of my poor explanation rather than a problem with the counters. The counter I used for PORTS_23 is the number of cycles that a load happens on _either_ Port 2 or Port 3. I did it this way because there were only 4 PMC counters available, and I wanted to show everything in a single run. If you use separate counters, you get very close to double the number of loads as you would expect. I'm pretty sure that the Port 4 numbers are correct, as it eventually agrees with MEM_UOP_RETIRED_STORES as I increase the number of tables.

I do not think that MACHINE_CLEARS.MEMORY_ORDERING or INT_MISC.RECOVERY_CYCLES are the correct counters to pay attention to here. These are heavier weight exceptions, versus the light weight recovery of simply reissuing a store. You might look at Fabian's explanation of the first here: http://fgiesen.wordpress.com/2013/01/31/cores-dont-like-to-share/

You are right to suspect that there are some strange Port 7 issues happening. Particularly, while Haswell can calculate store addresses there, it can be difficult to make it happen. It turns out that Port 7 only supports fixed offset addressing (offset + base), so if the address is using (base + index * scale) addressing it needs to execute on Port 2 or 3. I was able to determine conclusively that this is a real bottleneck on Haswell: when I changed the code so IACA said it would execute on Port 7, I got a significant boost in speed. The bottleneck then became the 4 uop per cycle limit.

I ended up with two improved versions of Yann's code, one based on spreading things out to Port 7 on Haswell, and one based on reducing the number of loads by using _mm_loadu_si256() and then splitting into bytes. The first approach got to me 2200 MB/s on 3.4 GHz Haswell, and 2050 on 3.6 GHz Sandy Bridge. The second XMM approach got me up to 2450 MB/s on 3.4 GHz Haswell, and 2150 on 3.6 GHz Sandy Bridge.

I sent my code to Yann, and he'll probably post something after he looks at my code and figures out if I've made some silly mistake (certainly possible). There are still a number of details I don't understand, but the PMC numbers for these seem to make sense as well. I'm hoping that when Yann publishes (presuming no terrible logic problems with my code) someone with a deeper understanding (or knowledge of some assembly tricks) might be able to push things a bit faster.

ReplyDelete
Replies
UnknownOctober 2, 2014 at 9:04 AM
I have a bunch of functions with a test program in a single file: https://github.com/powturbo/turbohist.
Maybe someone is interested and can test this on Haswell.
ReplyDelete
Replies
UnknownOctober 3, 2014 at 3:35 PM
Tested on i7-2600k at 3.4 and 4.5GHz.
Now 1.3 clocks per symbol and >2500 MB/s.
Scalar function "hist_4_32" nearly as fast as the best SSE function.
Look at: https://github.com/powturbo/turbohist
ReplyDelete
Replies
GarenOctober 7, 2014 at 1:37 AM
It looks like you're ultimately looking for the maximum value, right? In that case, do you need the other values?

Interestingly, I tried reading a machine word (32-bit or 64-bit integer) at a time, and 'slicing' it into bytes, and it had a slightly negative affect on my Intel I7-920 -- but a 20% performance boost on an AMD Phenom II.
ReplyDelete
Replies
UnknownOctober 13, 2014 at 5:58 PM
Congratulation and thanks to Nathan!
TurboHist improved and tested on Sandy Bridge.
"hist_8_32" w/o SIMD and w/o inline assembly is now nearly as fast as the fastest "count2x64". see: https://github.com/powturbo/TurboHist
ReplyDelete
Replies
UnknownOctober 14, 2014 at 8:30 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownOctober 16, 2014 at 6:06 AM
Counting1[*ip]++;
Counting2[*(ip+1)]++;
Counting3[*(ip+2)]++;
Counting4[*(ip+3)]++;
ip += 4;
ReplyDelete
Replies
Mr ZJanuary 6, 2015 at 3:28 AM
FWIW, I remember encountering precisely this sort of problem several (about 17) years ago when I wrote the histogram kernel that's eventually evolved into the one found in C6x DSP's IMGLIB. I didn't work alone on this kernel—a few of us looked at how to solve the recurrence, and eventually we converged on this clever approach.

On that device, loads have 4 exposed delay slots (total latency 5), and there is no hardware forwarding between loads and stores. But, a load immediately following a store sees the preceding store's results. To write a correct count[idx]++ sequence in assembly, you end up with 7 cycle recurrence: Load, 4 dead cycles, Add, Store.

In the DSP's histogram kernel, we were able to maximize the memory bandwidth utilization (just shy of 1 bin update per cycle--8 updates in 9 cycles, IIRC) with just 4 separate histograms, and one level of "increment forwarding." That is, if there are two consecutive updates to the same bin in the same histogram, add an extra '1' to the second update, so you can the next update's load above the previous update's store. (That may be possible on an x86; I've never tried.) The C6x makes it easy to forward an extra increment, as its compare instructions deposit 0 or 1 in a general-purpose register.

The resulting sequence to update a single histogram looks something like this in steady state, in pseudo-code:

{ /* start of loop body */
idx0 = *p++;
cnt0 = histo[ idx0 ];
histo[idx1] = cnt1;
cnt0 = 1 + (idx0 == idx1);
idx1 = *p++;
cnt1 = histo[ idx1 ];
histo[ idx0 ] = cnt0;
cnt1 = 1 + (idx1 == idx0);
} /* end of loop body */

Notice the stores for idx0 happen _after_ the loads for idx1, and vice versa. There's obviously some code before/after the loop to get the initial values of everything into a proper state, and to fix things up at the end.

The architecture has a second curveball it throws at you. The C6x CPUs CPU can process two memory accesses per cycle (which is why it can average one bin update per cycle), but most family members use single-ported memory. To allow two simultaneous accesses in the same cycle, it divides the memory into multiple banks, selected by the LS-bits of the address. If two memory accesses go to the same bank, you get a bank hit. (1 cycle stall.) On the C64x and later, it divides memory into 8 x 32-bit banks.

So, the four histograms are also interleaved. That is, histogram 0 goes to banks 0 and 4 (assuming a C64x or later device), histogram 1 goes to banks 1 and 5, histogram 2 goes to banks 2 and 6, and histogram 3 goes to banks 3 and 7. This ensures 100% bank-conflict free computation.

Scheduling that assembly was a chore, but we made it work.

Getting back to x86, which most people are concerned with...

As I recall, the original Pentium had an LS-banked L1D also (8 x 16-bit banks if memory serves), to allow the U and V pipe to access memory in parallel. I don't know if bank conflicts are a big factor in the modern L1Ds, but if you have a strongly correlated input set and the L1D is single-port LS-banked, then maybe it does matter.

In any case, histogram is an old, dear frenemy of mine.
ReplyDelete
Replies

Add comment