RealTime Data Compression: xxHash for small keys: the impressive power of modern compilers

Wednesday, March 14, 2018

xxHash for small keys: the impressive power of modern compilers

Several years ago, xxHash was created as a companion error detector for LZ4 frame format. The initial candidate for this role was CRC32, but it turned out being several times slower than LZ4 decompression, nullifying one of its key strength.

After some research, I found the great murmurhash, by Austin Appleby, alongside its validation tool SMHasher . It nearly fitted the bill, running faster than LZ4, but still a bit too close to my taste. Hence started a game to see how much speed could be extracted from a custom hash formula while preserving good distribution properties.

A lot of things happened since that starting point, but the main take away from this story is : xxHash was created mostly as a checksum companion, digesting long inputs.

Fast forward nowadays, and xxHash is being used in more places than it was originally expected. The design has been expanded to create a second variant, XXH64, which is successful in video content and data bases.
But for several of these uses cases, hashed keys are no longer necessarily "large".

In some cases, the need to run xxHash on small keys resulted in the creation of dedicated variants, that cut drastically through the decision tree to extract just the right suite of operations for the desired key. And it works quite well.

That pushes the hash algorithm into territories it was not explicitly optimized for. Thankfully, one of SMHasher's test module was dedicated for speed on small keys, so it helped to pay attention to the topic during design phase. Hence the performance on small key is correct, but the dedicated function push it to another level.

Let's analyse the 4-byte hash example.
Invoking the regular XXH32() function on 4-bytes samples, and running it on my Mac OS-X 10.13 laptop (with compilation done by llvm9), I measure 233 MH/s (Millions of hashes per second).
Not bad, but running the dedicated 4-bytes function, it jumps to 780 MH/s. That's a stark difference !

Let's investigate further.
xxHash offers an obscure build flag named XXH_PRIVATE_API. The initial intention is to make all XXH_* symbols static, so that they do not get exposed on the public side of a library interface. This is useful when several libraries use xxHash as an embedded source file. In such a case, an application linking to both libraries will encounter multiple XXH_* symbols, resulting in naming collisions.

A side effect of this strategy is that function bodies are now available during compilation, which makes it possible to inline them. Surely, for small keys, inlining the hash function might help compared to invoking a function from another module ?
Well, yes, it does help, but there is nothing magical. Using the same setup as previously, the speed improves to 272 MH/s . That's better, but still far from the dedicated function.

That's where the power of inlining can really kick in. In the specific case that the key has a predictable small length, it's possible to pass as length argument a compile-time constant, like sizeof(key), instead of a variable storing the same value. This, in turn, will allow the compiler to make some drastic simplification during binary generation, through dead code removal optimization, throwing away branches which are known to be useless.
Using this trick on the now inlined XXH32(), speed increases to 780 MH/s, aka the same speed as dedicated function.

I haven't checked but I wouldn't be surprised if both the dedicated function and the inlined one resulted in the same assembly sequence.
But the inlining strategy seems more powerful : no need to create, and then maintain, a dedicated piece of code. Plus, it's possible to generate multiple variants, by changing the "length" argument to some other compile-time constant.

object	XXH32()	XXH32 inlined	XXH32 inlined + length constant	dedicated XXH32 function
4-bytes field	233 MH/s	272 MH/s	780 MH/s	780 MH/s

Another learning is that inlining is quite powerful for small keys, but the XXH_PRIVATE_API build macro makes a poor job at underlining its effectiveness.

As a consequence, next release of xxHash will introduce a new build macro, named XXH_INLINE_ALL. It does exactly the same thing, but its performance impact is better documented, and I suspect the name itself will make it easier for developers to anticipate its implications.

8 comments:

Jonatas EstevesMarch 17, 2018 at 12:09 AM
Hi Cyan, thanks for another great read. Love your blog.
On the 7th paragraph you say you compiled with llvm9, I guess you mean 6? Unless you have already invented a time-machine and are actually bringing technology from the future. Given your track-record as a dev, I wouldn't doubt.
ReplyDelete
Replies
UnknownMarch 21, 2018 at 12:22 AM
Nice post (and awesome blog, I loved the 'Huffman revisited' series!).

Questions:

a) why crc32 turned out to be slower? IIRC there are dedicated crc32 instructions for both Intel and ARMv8-a (the new crypto instructions).

b) or it an issue of better distribution/less collisions?

Cheers

Adenilson
ReplyDelete
Replies
UnknownMarch 21, 2018 at 7:59 PM
Thanks for the reply.

Indeed, portability is a primary concern and things were different way back in 2011.

In a side note, the ARMv8-a instruction is compatible with zlib's crc32 (i.e. uses the same polynomial), we are using it in Chromium's zlib (shipping in the upcoming Chrome M66).

Cheers

Adenilson
ReplyDelete
Replies
RohanMarch 4, 2022 at 12:01 AM
Hi Cyan
Its always a pleasure to read your blogs and follow along. I was doing some experiments with your amazing XXH3 implementation.
I had one small question

In your blog you mention that with inlining and passing the length of the key to hash as a compile time constant, you achieve speeds closer to a dedicated hash function(which is around 3-4x faster) and hence there is no need to maintain a dedicated hash function. But the code of xxhash3 has dedicated hash functions for 1-3 bytes 4-9 bytes and so on.
Does that mean that even if we do not have inlining we can get good performance from xxhash3 for small key sizes ? Or by enabling inling is there supposed to be a 3-4x improvement ?
ReplyDelete
Replies
RohanMarch 4, 2022 at 12:01 AM
Hi Cyan
Its always a pleasure to read your blogs and follow along. I was doing some experiments with your amazing XXH3 implementation.
I had one small question

In your blog you mention that with inlining and passing the length of the key to hash as a compile time constant, you achieve speeds closer to a dedicated hash function(which is around 3-4x faster) and hence there is no need to maintain a dedicated hash function. But the code of xxhash3 has dedicated hash functions for 1-3 bytes 4-9 bytes and so on.
Does that mean that even if we do not have inlining we can get good performance from xxhash3 for small key sizes ? Or by enabling inling is there supposed to be a 3-4x improvement ?
ReplyDelete
Replies

Add comment