Wednesday, March 14, 2018

xxHash for small keys: the impressive power of modern compilers

 Several years ago, xxHash was created as a companion error detector for LZ4 frame format. The initial candidate for this role was CRC32, but it turned out being several times slower than LZ4 decompression, nullifying one of its key strength.

After some research, I found the great murmurhash, by Austin Appleby, alongside its validation tool SMHasher . It nearly fitted the bill, running faster than LZ4, but still a bit too close to my taste. Hence started a game to see how much speed could be extracted from a custom hash formula while preserving good distribution properties.

A lot of things happened since that starting point, but the main take away from this story is : xxHash was created mostly as a checksum companion, digesting long inputs.

Fast forward nowadays, and xxHash is being used in more places than it was originally expected. The design has been expanded to create a second variant, XXH64, which is successful in video content and data bases.
But for several of these uses cases, hashed keys are no longer necessarily "large".

In some cases, the need to run xxHash on small keys resulted in the creation of dedicated variants, that cut drastically through the decision tree to extract just the right suite of operations for the desired key. And it works quite well.

That pushes the hash algorithm into territories it was not explicitly optimized for. Thankfully, one of SMHasher's test module was dedicated for speed on small keys, so it helped to pay attention to the topic during design phase. Hence the performance on small key is correct, but the dedicated function push it to another level.

Let's analyse the 4-byte hash example.
Invoking the regular XXH32() function on 4-bytes samples, and running it on my Mac OS-X 10.13 laptop (with compilation done by llvm9), I measure 233 MH/s (Millions of hashes per second).
Not bad, but running the dedicated 4-bytes function, it jumps to 780 MH/s. That's a stark difference !

Let's investigate further.
xxHash offers an obscure build flag named XXH_PRIVATE_API. The initial intention is to make all XXH_* symbols static, so that they do not get exposed on the public side of a library interface. This is useful when several libraries use xxHash as an embedded source file. In such a case, an application linking to both libraries will encounter multiple XXH_* symbols, resulting in naming collisions.

A side effect of this strategy is that function bodies are now available during compilation, which makes it possible to inline them. Surely, for small keys, inlining the hash function might help compared to invoking a function from another module ?
Well, yes, it does help, but there is nothing magical. Using the same setup as previously, the speed improves to 272 MH/s . That's better, but still far from the dedicated function.

That's where the power of inlining can really kick in. In the specific case that the key has a predictable small length, it's possible to pass as length argument a compile-time constant, like sizeof(key), instead of a variable storing the same value. This, in turn, will allow the compiler to make some drastic simplification during binary generation, through dead code removal optimization, throwing away branches which are known to be useless.
Using this trick on the now inlined XXH32(), speed increases to 780 MH/s, aka the same speed as dedicated function.

I haven't checked but I wouldn't be surprised if both the dedicated function and the inlined one resulted in the same assembly sequence.
But the inlining strategy seems more powerful : no need to create, and then maintain, a dedicated piece of code. Plus, it's possible to generate multiple variants, by changing the "length" argument to some other compile-time constant.

object XXH32() XXH32 inlined XXH32 inlined +
length constant
dedicated XXH32 function
4-bytes field
233 MH/s
272 MH/s
780 MH/s
780 MH/s


Another learning is that inlining is quite powerful for small keys, but the XXH_PRIVATE_API build macro makes a poor job at underlining its effectiveness.

As a consequence, next release of xxHash will introduce a new build macro, named XXH_INLINE_ALL. It does exactly the same thing, but its performance impact is better documented, and I suspect the name itself will make it easier for developers to anticipate its implications.

8 comments:

  1. Hi Cyan, thanks for another great read. Love your blog.
    On the 7th paragraph you say you compiled with llvm9, I guess you mean 6? Unless you have already invented a time-machine and are actually bringing technology from the future. Given your track-record as a dev, I wouldn't doubt.

    ReplyDelete
    Replies
    1. This is indeed displayed as llvm9 :

      ```
      cc -v
      Apple LLVM version 9.0.0 (clang-900.0.39.2)
      Target: x86_64-apple-darwin17.4.0
      Thread model: posix
      ```

      My time machine is an OS-X 10.13 laptop.

      Not to be confused with the official clang/llvm project.
      Apple has its own numbering system unfortunately.

      I shall add this was compiled on Mac OS-X, to reduce confusion.

      Delete
  2. Nice post (and awesome blog, I loved the 'Huffman revisited' series!).

    Questions:

    a) why crc32 turned out to be slower? IIRC there are dedicated crc32 instructions for both Intel and ARMv8-a (the new crypto instructions).

    b) or it an issue of better distribution/less collisions?

    Cheers


    Adenilson

    ReplyDelete
    Replies
    1. This happened in 2011 / 2012.
      Back them the new crc32 functions proposed by Intel were only available on the latest generation cpu (my Core2 Duo did not support them).
      There were concerns for portability : crc32 is only a name, but there are actually multiple (incompatible) crc32 flavors. The one I planned to use was the gzip version, which is more or less the "official" crc32. but the Intel version is called crc32c "castagnoli". Does the same job, but differently, hence generates different results.
      For more details, see :
      https://en.wikipedia.org/wiki/Cyclic_redundancy_check#Polynomial_representations_of_cyclic_redundancy_checks

      For ARM, I had report that they supported a different crc32 version. Not sure if they converged since then.
      Bottom line : it was too soon to bet on the "generalization" of the crc32 variant that Intel selected to implement in hardware. Even today, I would be cautious on such bet. Plus it introduces system-specific code, which ends up being complex to maintain. And any software backup was bound to be as slow as I found crc32.

      xxHash, on the other hand, was designed to be portable, so it can be deployed on any cpu supported by lz4. And its speed does not depend on the presence of a local hardware accelerator.

      Delete
  3. Thanks for the reply.

    Indeed, portability is a primary concern and things were different way back in 2011.

    In a side note, the ARMv8-a instruction is compatible with zlib's crc32 (i.e. uses the same polynomial), we are using it in Chromium's zlib (shipping in the upcoming Chrome M66).

    Cheers

    Adenilson

    ReplyDelete
  4. Hi Cyan
    Its always a pleasure to read your blogs and follow along. I was doing some experiments with your amazing XXH3 implementation.
    I had one small question

    In your blog you mention that with inlining and passing the length of the key to hash as a compile time constant, you achieve speeds closer to a dedicated hash function(which is around 3-4x faster) and hence there is no need to maintain a dedicated hash function. But the code of xxhash3 has dedicated hash functions for 1-3 bytes 4-9 bytes and so on.
    Does that mean that even if we do not have inlining we can get good performance from xxhash3 for small key sizes ? Or by enabling inling is there supposed to be a 3-4x improvement ?

    ReplyDelete
    Replies
    1. Yes, inlining + size as a compile-time constant will always improve speed.
      Even for XXH3.
      The relative speed boost will be a bit less important for XXH3 (than XXH64 or XXH32), because it's better optimized for small data, but it will still be significant.

      Delete
  5. Hi Cyan
    Its always a pleasure to read your blogs and follow along. I was doing some experiments with your amazing XXH3 implementation.
    I had one small question

    In your blog you mention that with inlining and passing the length of the key to hash as a compile time constant, you achieve speeds closer to a dedicated hash function(which is around 3-4x faster) and hence there is no need to maintain a dedicated hash function. But the code of xxhash3 has dedicated hash functions for 1-3 bytes 4-9 bytes and so on.
    Does that mean that even if we do not have inlining we can get good performance from xxhash3 for small key sizes ? Or by enabling inling is there supposed to be a 3-4x improvement ?

    ReplyDelete