However, there were some useful learnings during this implementation.
First, on multi-threading cost.
LZ4 has a very simple design, it was only natural to keep the large block structure and adapt it to parallel compression. With Huff0, doing the same directly results in a dramatic performance slowdown.
The reason for it : creating new threads costs some performance overhead.
In order to limit this cost, a simple solution is to limit the number of times a new thread is created; a very simple way to achieve this is to deal with larger blocks of data.
Huff0 therefore features a double-segmentation layer, with larger blocks defined at thread level, while smaller ones are dispatched for entropy adaptation purpose (i.e. compression rate is better if blocks are smaller, because the statistics are more adapted to the block).
Thinking aloud, i'm wondering if performance would be better by changing threading strategy : instead of creating a new thread for each block, what about "sleeping it", filling its tables and resume it ?
I'm not sure if it would change performance conclusion, but it's probably worth a try.
[Edit] and indeed, it is, see follow-up : http://fastcompression.blogspot.com/2011/03/advanced-multithreading-analysis-huff0.html
Second, on memory access behavior.
This part is a bit more tricky.
Making a code "multi-thread compatible" requires little more than getting rid of all form of global variables, keeping only local ones (ie dynamically allocated into the stack) or, when necessary, allocating memory in a kind of "context structure" which will be provided as part of the function parameters. This second point makes the code very much "object-oriented".
Now on the findings : i've been witnessing some performance hit on highly accessed memory slots. When, for example, a statistics or counter structure is declared, and due to its nature is updated very often, performance is better when the structure is declared as global variable, rather than local ones. (Note that declaring "static local" variables has the same effect).
I'm not sure to fully understand that one, but i've found a lead : this comment :
In calculating the memory address, local variables require one more fast addition operation than global ones. It sounds very little, but precise benchmarks do feel a difference.
I still have to find a way around this performance hit though, if that's possible...
You can download the new version of Huff0 here.