RealTime Data Compression: xxHash wider : assessing quality of a 64-bits hash function

Saturday, July 19, 2014

xxHash wider : assessing quality of a 64-bits hash function

The initial version of xxHash was created in a bid to find a companion error detection algorithm for LZ4 decoder. The objective was set after discovering that usual implementation of CRC32 were so slow that the overall process of decoding + error check would be dominated by error check.
The bet was ultimately successful, and borrowed some its success from Murmurhash, most notably its test tool smHasher, the best of its kind to measure the quality of a hash algorithm. xxHash speed advantage stems from its explicit usage of ILP (Instruction Level Parallelism) to keep the multiple ALU of modern CPU cores busy.

Fast forward to 2014, the computing world has evolved a bit. Laptops, desktops and servers have massively transitioned to 64-bits, while 32-bits is still widely used but mostly within smartphones and tablets. 64-bits computing is now part of the daily experience, and it becomes more natural to create algorithms targeting primarily 64-bits systems.

An earlier demo of XXH64 quickly proved that moving to 64-bits achieves much better performance, just by virtue of wider memory accesses. For some time however, I wondered if it was a "good enough" objective, if XXH64 should also offer some additional security properties. It took the persuasion of Mathias Westerdhal to push me to create XXH64 as a simpler derivative of XXH32, which was, I believe now, the right choice.

XXH64 is therefore a fairly straighfoward application of XXH methodology to 64-bits : an inner loop with 4 interleaved streams, a tail sequence, to handle input sizes which are not multiple of 32, and a final avalanche, to ensure all bits are properly randomized. The bulk of the work was done by Mathias, while I mostly provided some localized elements, such as prime constants, shift sequences, and some optimization for short inputs.

The quality of XXH64 is very good, but that conclusion was difficult to assess. A key problem with 64-bits algorithms is that it requires to generate and track too many results to properly measure collisions (you need 4 billions hashes for a 50% chance of getting 1 collision). So, basically, all tests must be perfect, ending with 0 collision. Which is the case.
Since it's a bare minimum, and not a good enough objective to measure 64-bits quality, I also starred at bias metric. The key idea is : any bit within the final hash must have a 50% chance of becoming 0 or 1. The bias metric find the worst bit which deviates from 50%. Results are good, with typical worst deviation around 0.1%, equivalent to perfect cryptographic hashes such as SHA1.

Since I was still not totally convinced, I also measured each 32-bits part of the 64-bits hash (high and low) as individual 32-bits hashes. The theory is : if the 64-bits hash is perfect, any 32-bits part of it must also be perfect. And the good thing is : with 32-bits, collision can be properly measured. The results are also excellent, each 32-bits part scoring perfect scores in all possible metric.

But it was still not good enough. We could have 2 perfect 32-bits hashes coalesced together, but being a repetition of each other, which of course would not make an excellent 64-bits hash. So I also measured "Bit Independence Criteria", the ability to predict one bit thanks to another one. On this metric also, XXH64 got perfect score, meaning no bit can be used as a possible predictor for another bit.

So I believe we have been as far as we could to measure the quality of this hash, and it looks good for production usage. The algorithm is delivered with a benchmark program, integrating a coherency checker, to ensure results remain the same across any possible architecture. It's automatically tested using travis continuous test environment, including valgrind memory access verification.

Note that 64-bits hashes are really meant for 64-bits programs. They get roughly double speed thanks to increased number of bits loaded per cycle. But if you try to fit such an algorithm on a 32-bits program, the speed will drastically plummet, because emulating 64-bits arithmetic on 32-bits hardware is quite costly.

SMHasher speed test, compiled using GCC 4.8.2 on Linux Mint 64-bits. The reference system uses a Core i5-3340M @2.7GHz

Version	Speed on 64-bits	Speed on 32-bits
XXH64	13.8 GB/s	1.9 GB/s
XXH32	6.8 GB/s	6.0 GB/s

26 comments:

rogojinJuly 19, 2014 at 2:03 PM
Awesome! This is exactly what I've been looking for.
ReplyDelete
Replies
AnonymousJuly 19, 2014 at 8:23 PM
How it compares (speed) with CityHash128?
ReplyDelete
Replies
Jarek DudaJuly 21, 2014 at 3:34 AM
Hi Yann,
Remember that you have practically built free checksum inside FSE/ANS: you start with some fixed initial state and store the final state. If something was wrong, the decoded initial state will turn out a random value instead - there is approximately 1/L probability of false positive, what can be decreased by using simultaneously a few states (interleaving).

A good entropy coder can be simultaneously used also for something much stronger: not only to decide whether the data is damaged, but also allow to repair such eventual damages (error correction).
For this purpose, add "forbidden symbol" to alphabet and rescale the rest. You don't use this symbol while encoding, but it can be accidentally used after an error while decoding - in this case decoder should go back and try a correction - developing a tree of corrections ( http://demonstrations.wolfram.com/CorrectionTrees/ ).
Best,
Jarek
ReplyDelete
Replies
Jarek DudaJuly 21, 2014 at 10:47 AM
Hi Yann,
Indeed redundancy has a cost - tiny for checksum, and huge for error correction. 2^-12 false positive probability may not be sufficient for many applications (~2^-13 when the initial state is chosen as the last one).
rANS can directly operate on 64 bits states, while for tANS/FSE we could "couple" a few states, so that an error disturbs all of them. Unfortunately simple interleaving (e.g. even symbols are encoded using the first state, odd using second) is not sufficient - e.g. a single bit damage affects only a single state (and bit synchronization, but it's just an order of magnitude). A cheap coupling to prevent that is e.g. (sometimes?) encoding XOR of current symbol with the previous one.

Regarding correction trees, it is basically so called sequential decoding used for convolutional codes, but with a more sophisticated coding scheme. I have worked later on it and for large states and bidirectional correction it can easily outperform turbo codes and most of LDPCs - here is implementation and preprint: https://indect-project.eu/correction-trees/
However, for ANS we cannot use bidirectional correction, the code is nonsystematic and tANS has relatively small state - the performance would be slightly worse (~turbo codes level). But it is cheap.

Another nearly free option that can be added to tANS based compressors is encryption - e.g. just slightly perturb symbol spread accordingly to cryptographic key.
ReplyDelete
Replies
Jarek DudaJuly 22, 2014 at 12:19 AM
Indeed, due to fixed block structure, synchronization errors are nearly impossible to deal with for DLPC and turbo codes. However, for sequential decoding such errors are just additional local corrections to consider - types of branches in correction tree. Then, the regularly distributed checksums, with large probabilities cuts off wrong branches.
So if we operate e.g. on 8 bit blocks, there are 256 possible bit flips to consider, with rapidly dropping probability (we will develop the tree a lot until considering scenario with all bits flipped).
To add e.g. deletion errors, we just add them to the set of local scenarios to consider with corresponding probabilities - starting e.g. with taking 7 bits and inserting 1 bit in all possible positions.
The biggest issue here is that different corrections can lead to the same sequence, increasing complexity of decoder.
I think I could write a practical decoder for reasonable parameters, close to (unknown!) capacity ...
ReplyDelete
Replies
SebastianJuly 25, 2014 at 9:06 PM
For a 32 bit hash on 64 bit programs, couldn't you then call the 64 bit hash and drop the lower 32 bits? Would that not give you higher speed than the 32 bit algorithm? If so, perhaps it should just do that by default.
ReplyDelete
Replies
AnonymousAugust 23, 2014 at 6:03 AM
Any plans on building a 128bit version?
ReplyDelete
Replies
ZedNovember 18, 2014 at 3:56 PM
umac-based hash is much faster on 32-bit cpus, has quality guaranteed by its cryptographic (MAC) roots and can be extended to any size by hashing multiple times with different key. its only drawback is much larger code since it employs AES for key stream generation

for each 64-bit of input data it performs the following computation:
sum += (data0+key0)*(data1+key1)
where data/key are 32-bit values and multiplication/sum are 64-bit

even without SIMD, these are only 1-2 cpu ticks consuming 64 bits of input data and producing 32-bit result (higher bits of sum). and with SSE2/AVX2 it's 4 cpu ticks consuming 32/64 input bytes and producing 32-bit result
ReplyDelete
Replies
AnonymousNovember 20, 2014 at 6:30 AM
awesome work and explained elegantly. Do you have any plan to release XXH64 (java version) from https://github.com/jpountz/lz4-java git repo? I dont see XXH64 in latest version (1.2)
ReplyDelete
Replies
AnonymousFebruary 1, 2015 at 2:23 AM
You measured collisions taking just the lower 32 bits, and just the upper 32 bits. I suggest also measuring collisions taking the 32 even numbered bits and the 32 odd numbered bits. That would give you a pretty good "matrix".
ReplyDelete
Replies
UnknownAugust 23, 2016 at 11:25 AM
Hi,

How to install xxhash on centos and how to test the file and folder.

Thanks,
Chellasundar
ReplyDelete
Replies