Since working on the stats produced by huff0, some inefficiencies became apparent, and as a consequence got solved in a newer version, available on the new huff0 homepage.
To properly understand what was changed, here is a performance snapshot, comparing v0.5 (old) with v0.6 (new).
Huff0 v0,5 Huff0 v0,6
R C D R C D
enwik8.7z 1.000 740 1400 1.000 810 1930
audio1 1.070 285 280 1.070 285 280
enwik8-lz4-lit 1.363 210 200 1.363 210 200
enwik8-lz4-offh 1.188 188 181 1.188 188 181
enwik8-lz4-ml 2.915 200 197 2.915 210 197
enwik8-lz4-run 4.284 209 218 4.284 235 218
loong 278.0 450 2930 278.0 785 2930
Changes are underlined in bold characters.
To begin with, only speed is affected by the changes. Compression ratio remains strictly identical.
The first changed is in the way data is scanned. This is a nifty trick, related to cache behavior. When updating a data, such data should not be updated again immediately, otherwise there is a "cache" penalty : the previous change must be fully committed before the new one get accepted.
In order to avoid such penalty, we interleave data changes, so that the CPU get enough time to deal with repeated changes on the same value. It makes the code slightly more complex and data structure a bit bigger, but is definitely worth it : although it affect negatively situation such as "not compressible", on the other hand the more some data is present, the better the benefit.
We end up with a grand +70% improvement for "ultra compressible" corner case. Not only corner case benefit though, "ordered" distribution get a nice +5% boost, while squeezed one get a bit more than +10%.
The second change is on the way incompressible data is detected. While the end result is still slower than Range0, there is a pretty nice performance boost by betting earlier on the incompressible nature of data just scanned. It avoids later algorithm to be triggered, thus saving time and energy. This is a nice +10% boost.
While this gain is only visible on "not compressible" example, it still can achieve real situations benefits. It is not uncommon for some parts of large files to be incompressible. For example, an ISO file may contain a few pictures in compressed format. In this case, the better the detection, the faster the compression, since any attempt to compress it is likely to fail or end up with miserable gains.
The third change is really straighforward : i've removed my hand-made "copy" operation, and swapped it with the standard memcpy() call of C library.
Although only useful when some segments are incompressible, in this case, the gain is really noticeable, at about +30%. For just a simple function call, this is really worth it.
I still have to understand how come the call to memcpy() is faster than my own simple loop. There is probably some very interesting learnings behind.
Side comment, i tried the same trick on "ultra compressible" file, replacing my own simple loop with a memset(). It resulted in no speed difference. I tried then to improve the simple loop by making more complex parallel executions, but it resulted in a loss.
My only explanation so far is that the compiler probably translates my simple loop into a memset(), transparently.
Anyway, please feel free to test.
The binary can be downloaded at Huff0 homepage.