Thursday, July 30, 2015

Huffman revisited - Part 3 - Depth limited tree

A non-trivial issue that most real-world Huffman implementations must deal with is tree depth limitation.

Huffman construction doesn't limit the depth. If it would, it would no longer be "optimal". Granted, the maximum depth of an Huffman tree is bounded by the Fibonacci serie, but that leave ample room for larger depth than wanted.

Why limit Huffman tree depth ? Fast huffman decoders use lookup tables. It's possible to use multiple table levels to mitigate the memory cost, but a very fast decoder such as Huff0 goes for a single table, both for simplicity and speed. In which case the table size is a direct product of the tree depth (tablesize = 1 << treeDepth).

For the benefit of speed and memory management, a limit had to be selected : it's 8 KB for the decoding table, which nicely fits into Intel's L1 cache, and leaves some room to combine it with other tables if need be. Since latest decoding table uses 2 bytes per cell, it translates into 4K cells, hence a maximum tree depth of 12 bits.

12 bits for compressing literals is generally too little, at least according to optimal Huffman construction. Creating a depth-limited tree is therefore a practical issue to solve. The question is : how to achieve this objective with minimum impact on compression ratio, and how to do it fast ?

Depth-limited huffman trees have been studied since the 1960's, so there is ample literature available. What's more surprising is how complex the proposed solutions are, and how many decades were
necessary to converge towards an optimal solution.

(Note : in below paragraph, n is the alphabet size, and D is the maximum tree Depth.)
It started with Karp, in 1961 (Minimum-redundancy coding for the discrete noiseless channel), proposing a solution in exponential time. Then Gilbert, in 1971 (Codes based on inaccurate source probabilities), still in exponential time. Hu and Tan, in 1972 (Path length of binary search trees), with a solution in O(n.D.2^D). Finally, a solution in polynomial time was proposed by Garey in 1974 (Optimal binary search trees with restricted maximal depth), but still O(n^2.D) time and using O(n^2.D) space. In 1987, Larmore proposed an improved solution using O(n^3/2.D.log1/2.n) time and space (Height restricted optimal binary trees). The breakthrough happened in 1990 (A fast algorithm for optimal length-limited Huffman codes), when Larmore and Hirschberg propose the Package_Merge algoritm, a completely different kind of solution using only O(n.D) time and O(n) space. It became a classic, and was refined a few times over the next decades, with the notable contribution of Mordecai Golin in 2008 (A Dynamic Programming Approach To Length-Limited Huffman Coding).

Most of these papers are plain difficult to read, and it's usually harder than necessary to develop a working solution by just reading them (at least, I couldn't. Honorable mention for Mordecai Golin, which proposes a graph-traversal formulation relatively straightforward. Alas, it was still too much CPU workload for my taste).

In practice, most fast Huffman implementations don't bother with them. Sure, when optimal compression is required, the PackageMerge algorithm is preferred, but in most circumstances, being optimal is not really the point. After all, Huffman is already a trade-off between optimal and speed. By following this logic, we don't want to sacrifice everything for an optimal solution, we just need a good enough one, fast and light.

That's why you'll find some cheap heuristics in many huffman codes. A simple one : start with a classic Huffman tree, flatten all leaves beyond maximum depth, then flatten enough higher leaves to maxBits to get back the total length to one. It's fast, it's certainly not optimal, but in practice, the difference is small and barely noticeable. Only when the tree depth is very constrained does it make a visible difference (you can read some relevant great comments from Charles Bloom on its blog).

Nonetheless, for huff0, I was willing to find a solution a bit better than cheap heuristic, closer to optimal. 12 bits is not exactly "very constrained", so the pressure is not high, but it's still constrained enough that the depth-limited algorithm is going to be necessary in most circumstances. So better have a good one.

I started by making some simple observations : after completing an huffman tree, all symbols are sorted in decreasing count order. That means that the number of bits required to represent each symbol must follow a strict increasing order. That means the only thing I need to track is the border decision (from 5 to 6 bits, from 6 to 7 bits, etc.).

So now, the algorithm will concentrate on moving the arrows.

The first part is the same as the cheap heuristic : flatten everything that needs more than maxBits. This will create a "debt" : a symbol requiring maxBits+1 bits creates a debt of 1/2=0.5 when pushed to maxBits. A symbol requiring maxBits+2 creates a debt of 3/4=0.75, and so on. What may not be totally obvious is that the sum of these fractional debts is necessarily an integer number. This is a consequence of starting from a solved huffman tree, and can be proven by simple recurrence : if the huffman tree natural length is maxBits+1, then the number of elements at maxBits+1 is necessarily even, otherwise the sum of probabilities can't be equal to one. The debt's sum is therefore necessarily a multiple of 2 * 0.5 = 1, hence an integer number. Rince and repeat for maxBits+2 and further depth.

So now we have a debt to repay. Each time you demote a symbol from maxBits-1 to maxBits, you repay 1 debt. Since the symbols are already sorted in decreasing frequency, it's easy to just grab the smallest maxBits-1 symbols, and demote them to maxBits, up to repaying the debt. This is in essence what the cheap heuristic does.

But one must note that demoting a symbol from maxBits-2 to maxBits-1 repay not 1 but 2 debts. Demoting from maxbits-3 to maxBits-2 repay 4 debts. And so on. So now the question becomes : is it preferable to demote a single maxBits-2 symbol or two maxBits-1 symbols ?

The answer to this question is trivial since we deal with integer number of bits : just compare the sum of occurrences of the two maxBits-1 symbols with the occurrence of the single maxBits-2 one. Whichever is smallest costs less bits to demote. Proceed.

This approach can be scaled. Need to repay 16 debts ? A single symbol at maxBits-5 might be enough, or 2 at maxBits-4. By recurrence, each maxBits-4 symbol might be better replaced by two maxBits-3 ones, and so on. The best solution will show up by a simple recurrence algorithm.

Sometimes, it might be better to overshoot : if you have to repay a debt of 7, which formula is better ? 4+2+1, or 8-1 ? (the -1 can be achieved by promoting the best maxBits symbol to maxBits-1). In theory, you would have to compare both and select the better one. Doing so leads to an optimal algorithm. In practice though, the positive debt repay (4+2+1) is most likely the better one, since distribution must be severely twisted for the overshoot solution to win.

The algorithm becomes a bit more complex when some bits ranks are missing. For example, one needs to repay a debt of 2, but there is no symbol left at maxBits-2. In such case, one can still uses maxBits-1 symbols, but maybe there is no more of these symbols left either. In which case, the only remaining solution is to overshoot (maxBits-3) and promote enough elements to get the debt back to zero.

On average, implementation of this algorithm is pretty fast. Its CPU cost is unnoticeable, compared to the decoding cost itself, and the final compression ratio is barely affected (<0.1%) compared to unconstrained tree depth. So that's mission accomplished.

The fast variant of the depth limit algorithm is available in open source and can be grabbed at github, under the function nameHUF_setMaxHeight().

Wednesday, July 29, 2015

Huffman revisited - Part 2 : the Decoder

The first attempt to decompress the Huffman bitStream created by a version of huff0 modified to use FSE bitStream ended up in brutal disenchanting. While decoding operation itself worked fine, the resulting speed was a mere 180 MB/s.

OK, in absolute, it looks reasonable speed, but keep in mind this is far off the objective of beating FSE (which decodes at 475 MB/s on the same system), and even worse than reference zlib huffman. Some generic attempts at improving speed barely changed this, moving up just above 190 MB/s.

This was a disappointment, and a clear proof that the bitStream alone wasn't enough to explain FSE speed. So what could produce such a large difference ?

Let's look at the code. The critical section of FSE decoding loop looks like this :

    DInfo = table[state];
    nbBits = DInfo.nbBits;
    symbol = DInfo.symbol;
    lowBits = FSE_readBits(bitD, nbBits);
    state = DInfo.newState + lowBits;
    return symbol;

while for Huff0, it would look like this :

    symbol = tableSymbols[state];
    nbBits = tableNbBits[symbol];
    lowBits = FSE_readBits(bitD, nbBits);
    state = ((state << nbBits) & mask) + lowBits;
    return symbol;

There are some similarities, but also some visible differences. First, Huff0 creates 2 decoding tables, one to determine the symbol being decoded, the other one to determine how many bits are read. This is a good design for memory space : the larger table is tableSymbols, as its size primarily depends on 1<<maxNbBits. The second table, tableNbBits, is much smaller : its size only depends on nbSymbols. This construction allows using only 1 byte per cell. It favorably compares to 4 bytes per cell for FSE. This memory advantage can be used either as a net space saver, or as a way to boost accuracy, by increasing maxNbBits.

The cost for it is that there are 2 interdependent operations : first decode the state to get the symbol, then use the symbol to get nbBits.

This interdependence is likely the bottleneck. When trying to design high performance computation loops, there are 3 major rules to keep in mind :

Ensure hot data is already in the cache.
Avoid badly predictable branches (predictable ones are fine)
For modern OoO (Out of Order) CPU : keep their multiple execution units busy by feeding them with independent (parallelizable) operations.

This list is given in priority order. It makes no sense to try optimizing your code for OoO operations if the CPU has to wait for data from main memory, as the latency cost is much higher than any CPU operation. If your code is full of badly predictable branches, resulting in branch flush penalties, this is also a much larger problem than having some idle execution units. So you can only get to the third set of optimization after properly solving the previous ones.

This is exactly the situation where Huff0 is, with a fully branchless bitstream and data tables entirely within L1 cache. So the next performance boost will likely be found into OoO operations.

In order to avoid dependency between symbol first, then nbBits , let's try a different table design, where nbBits is directly stored alongside symbol, in the state table. This double the memory cost, hence reducing the memory advantage enjoyed by Huffman compared to FSE. But let's see where it goes :

    symbol = table[state].symbol;
    nbBits = table[state].nbBits;
    lowBits = FSE_readBits(bitD, nbBits);
    state = ((state << nbBits) & mask) + lowBits;
    return symbol;

This simple change alone is enough to boost the speed to 250 MB/s. Still quite far from the 475 MB/s enjoyed by FSE on the same system, but nonetheless a nice performance boost. More critically, it underlines that the diagnosis was correct : untangling operation dependency free up CPU OoO execution units, they can do more work within each cycle.

So let's ramp up the concept. We have removed one operation dependency. Is there another one ?

Yes. When looking at the main decoding loop from a higher perspective, we can see there are 4 decoding operations per loop. But each decoding operation must wait for the previous one to be completed, because in order to know how to read the bitStream for symbol 2, we need first to know of many bits were consumed by symbol 1.

Compare with how FSE work : since state values are separated from bitStream, it's possible to decode symbol1 and symbol2, and retrieve their respective nbBits, in any order, without any dependency. Only later operations, retrieving lowBits from the bitStream to calculate the next state values, introduce some ordering dependency (and even this one can be partially unordered).

The main idea is this one : to decode faster, it's necessary retrieve several symbols in parallel, without dependency. So let's create a compressed data flow which makes such operation possible.

Re-using FSE principles "as is" to design a faster Huffman decoding is an obvious choice, but it predictably results in about the same speed. As stated previously, it's not interesting to design a new Huffman encoder/decoder if it just ends up being as fast as FSE. If that is the outcome, then let's simply use FSE instead.

Fortunately, we already know that compression can be faster. So let's concentrate on the decoding side. Since it seems impossible to decode the next symbol without first decoding the previous one from the same bitStream, let's design multiple bitStreams.

The new design is a bit more complex. Compression side is affected : in order to create multiple bitStreams, one solution is to scan input data block multiple times. It proved efficient enough to not bother with a different design. On top of that, a jumptable is required at the beginning of the block, to let the decoder know where each bitStream starts.

Within each bitStream, it's still necessary to decode the first symbol to read the second. But each bitStream is independent, so it's possible to decode up to 4 symbols in parallel.

This proved a design win. The new huff0 decompresses at 600 MB/s while preserving the compression speed of 500 MB/s. This compares favorably to FSE or zlib's huffman, as detailed below :

Algorithm	Compression	Decompression
huff0	500 MB/s	600 MB/s
FSE	320 MB/s	475 MB/s
zlib-h	250 MB/s	250 MB/s

With that part solved, it was possible to check that there is no visible compression difference between FSE and Huff0 on Literals data. To be more precise, compression is slightly worse, but header size is slightly better (huffman headers are simpler to describe). On average, both effects compensate.

The resulting code is open sourced and currently available at : https://github.com/Cyan4973/FiniteStateEntropy (dev branch)

The new API mimic its FSE counterparts, and provides only the higher (simpler) prototypes for now :

size_t HUF_compress (void* dst, size_t dstSize, 
               const void* src, size_t srcSize);
size_t HUF_decompress(void* dst,  size_t maxDstSize,
                const void* cSrc, size_t cSrcSize);

For the time being, both FSE and huff0 are available within the same library, and even within the same file. The reasoning is that they share the same bitStream code. Obviously, many design choices will have the opportunity to be challenged and improved in the near future.

Having created a new competitor to FSE, it was only logical to check how it would behave within Zstandard. It's almost a drop-in replacement for literal compression.

Zstandard	previous	Huff0 literals
compression speed	200 MB/s	240 MB/s
decompression speed	540 MB/s	620 MB/s

A nice speed boost with no impact on compression ratio. Overall, a fairly positive outcome.

Tuesday, July 28, 2015

Huffman revisited - part 1

s compression speed, way better than any other huffman compressor I'm aware of, and more critically way better than FSE compression, making it a replacement candidate.

With such great result at hand, I confidently proceeded to implement huffman decompression based on the same design. I was in for a nasty surprise ...