Saturday, September 27, 2014

Counting bytes fast - little trick from FSE

 An apparently trivial and uninteresting task nonetheless received some special optimization care within FSE : counting the bytes (or 2-bytes shorts when using the U16 variant).

It seems a trivial task, and could indeed be realized by a single-line function, such as this one (assuming table is properly allocated and reset to zero) :

while (ptr<end) count[*ptr++]++;

And it works. So what's the performance of such a small loop ?
Well, when counting some random data, the loop performs at 1560 MB/s on the test system. Not bad.
(Edit : Performance numbers are measured  on a Core i5-3340M @2.7GHz configuration. Benchmark program is also freely available within the FSE project)
But wait, data is typically not random, otherwise it wouldn't be compressible. Let's use a more typical compression scenario for FSE, with a distribution ratio of 20%. With this distribution, the counting algorithm works at 1470 MB/s. Not bad, but why does it run slower ? We are starting to notice a trend here.
So let's go to the other end of the spectrum, featuring highly compressible data with a distribution ratio of 90%. How fast does the counting algorithm run on such data ? As could be guessed, speed plummets, reaching a miserable 580 MB/s.

This is a 3x performance hit, and more importantly, it makes counting now a sizable portion of the overall time to compress a block of data (let's remind FSE targets speeds of 400 MB/s overall, so if just counting costs that much, it drags the entire compression process down).

What does happen ? This is where it becomes interesting. This is an example of CPU write commit delay.

Because the algorithm writes into a table, this data can't be cached within registers. Writing to a table cell means the result must necessarily be written to memory.
Of course, this data is probably cached into L1, and a clever CPU will not suffer any delay for this first write.
The situation becomes tricky for the following byte. In the 90% distribution example, it means we have a high probability to count the same byte twice. So, when the CPU wants to add +1 to the appropriate table cell, write commit delay gets into the way. +1 means CPU has to perform both a read and then a write at this memory address. If the previous +1 is still not entirely committed, cache will make the CPU wait a bit more before delivering the data. And the impact is sensible, as measured by the benchmark.

So, how to escape this side-effect ?
A first idea is, don't read&write to the same memory address twice in a row. A proposed solution can be observed in the FSE_count() function. The core loop is (once cleaned) as follows :

Counting1[*ip++]++;
Counting2[*ip++]++;
Counting3[*ip++]++;
Counting4[*ip++]++;

The burden of counting bytes is now distributed over 4 tables. This way, when counting 2 identical consecutive bytes, they get added into 2 different memory cells, escaping write commit delay. Of course, if we have to deal with 5 or more identical consecutive bytes, write commit delay will still be there, but at least, the latency has been used counting 3 other bytes, instead of wasted.

The function is overall more complex : more tables, hence more memory to init, special casing non-multiple-of-4 input sizes, regroup all results at the end, so intuitively there is a bit more work involved in this strategy. How does it compare with the naive implementation ?

When compressing random data, FSE_count() gets 1780 MB/s, which is already slightly better than the naive strategy. But obviously, that's not the target. This is when distribution gets squeezed that it makes the most difference, with 90% distribution being counted at 1700 MB/s. Indeed, it's still being hit, but much less, and prove overall much more stable.


With an average speed > 1700MB/s, it may seem that counting is a fast enough operation. But it is still nonetheless the second contributor to overall compression time, gobbling on its own approximately 15% of budget. That's perceptible, and still a tad too much if you ask me for such a simple task. But save another great find, it's the fastest solution I could come up with.

Edit :
Should you be willing to test some new ideas for the counting algorithm, you may find it handy to get the benchmark program which produced the speed results mentioned in this article. The program is part of the "test directory" within FSE project, as a single file named fullbench.c :
https://github.com/Cyan4973/FiniteStateEntropy/blob/master/test/fullbench.c

Edit 2 :
Thanks to recent comments, notably from gpdNathan, and Henry Wong, a new and better reason has been provided to explain the observed delay. Its name is store-to-load forwarding. I would like to suggest here the read of the detailed explanation from Nathan Kurz, backed by his cycle-exact Likwid analysis, and the excellent article from Henry on CPU microarchitecture.
In a nutshell, while write commit delay used to be a problem, it should now be properly handled by store-cache on modern CPU. However, it introduces some new issues, related to pipeline, serial dependency and prefetching, with remarkably similar consequences, save the number of lost cycles at stake, which is quite reduced.

Edit 3 :
Nathan Kurz provided an entry which beats the best speed so far, achieving 2010 MB/s on a Core i5-3340M @ 2.7 GHz. Its entry is provided within the fullbench program (as algorithm 202), alongside a simplified version which achieves the same speed but is shorter (algorithm 201).
It's more than 10% better than the initial entry suggested in this blog, and so is definitely measurable.
Unfortunately, these functions use SSE 4.1 intrinsic functions, and therefore offer limited portability perspectives.

Saturday, July 19, 2014

xxHash wider : assessing quality of a 64-bits hash function

 The initial version of xxHash was created in a bid to find a companion error detection algorithm for LZ4 decoder. The objective was set after discovering that usual implementation of CRC32 were so slow that the overall process of decoding + error check would be dominated by error check.
The bet was ultimately successful, and borrowed some its success from Murmurhash, most notably its test tool smHasher, the best of its kind to measure the quality of a hash algorithm. xxHash speed advantage stems from its explicit usage of ILP (Instruction Level Parallelism) to keep the multiple ALU of modern CPU cores busy.

Fast forward to 2014, the computing world has evolved a bit. Laptops, desktops and servers have massively transitioned to 64-bits, while 32-bits is still widely used but mostly within smartphones and tablets. 64-bits computing is now part of the daily experience, and it becomes more natural to create algorithms targeting primarily 64-bits systems.

An earlier demo of XXH64 quickly proved that moving to 64-bits achieves much better performance, just by virtue of wider memory accesses. For some time however, I wondered if it was a "good enough" objective, if XXH64 should also offer some additional security properties. It took the persuasion of Mathias Westerdhal to push me to create XXH64 as a simpler derivative of XXH32, which was, I believe now, the right choice.

XXH64 is therefore a fairly straighfoward application of XXH methodology to 64-bits : an inner loop with 4 interleaved streams, a tail sequence, to handle input sizes which are not multiple of 32, and a final avalanche, to ensure all bits are properly randomized. The bulk of the work was done by Mathias, while I mostly provided some localized elements, such as prime constants, shift sequences, and some optimization for short inputs.

The quality of XXH64 is very good, but that conclusion was difficult to assess. A key problem with 64-bits algorithms is that it requires to generate and track too many results to properly measure collisions (you need 4 billions hashes for a 50% chance of getting 1 collision). So, basically, all tests must be perfect, ending with 0 collision. Which is the case.
Since it's a bare minimum, and not a good enough objective to measure 64-bits quality, I also starred at bias metric. The key idea is : any bit within the final hash must have a 50% chance of becoming 0 or 1. The bias metric find the worst bit which deviates from 50%. Results are good, with typical worst deviation around 0.1%, equivalent to perfect cryptographic hashes such as SHA1.

Since I was still not totally convinced, I also measured each 32-bits part of the 64-bits hash (high and low) as individual 32-bits hashes. The theory is : if the 64-bits hash is perfect, any 32-bits part of it must also be perfect. And the good thing is : with 32-bits, collision can be properly measured. The results are also excellent, each 32-bits part scoring perfect scores in all possible metric.

But it was still not good enough. We could have 2 perfect 32-bits hashes coalesced together, but being a repetition of each other, which of course would not make an excellent 64-bits hash. So I also measured "Bit Independence Criteria", the ability to predict one bit thanks to another one. On this metric also, XXH64 got perfect score, meaning no bit can be used as a possible predictor for another bit.

So I believe we have been as far as we could to measure the quality of this hash, and it looks good for production usage. The algorithm is delivered with a benchmark program, integrating a coherency checker, to ensure results remain the same across any possible architecture. It's automatically tested using travis continuous test environment, including valgrind memory access verification.

Note that 64-bits hashes are really meant for 64-bits programs. They get roughly double speed thanks to increased number of bits loaded per cycle. But if you try to fit such an algorithm on a 32-bits program, the speed will drastically plummet, because emulating 64-bits arithmetic on 32-bits hardware is quite costly.

SMHasher speed test, compiled using GCC 4.8.2 on Linux Mint 64-bits. The reference system uses a Core i5-3340M @2.7GHz
VersionSpeed on 64-bitsSpeed on 32-bits
XXH6413.8 GB/s1.9 GB/s
XXH326.8 GB/s6.0 GB/s






Tuesday, May 20, 2014

Streaming API for LZ4

 For quite some time, the LZ4 Streaming API project has been started and delayed, as other priorities stepped in the way. To be fair, one important delaying factor was the difficulty to define a "clean enough" API, something that would be simple to use and understand, while also providing the level of fine-tuning required by advanced programmers (typically within embedded environments).

I feel quite close today from breaking this shell, with an interface definition which matches my definition of "clean enough" API. So let's share some preliminary results.

Current streaming interface

The current streaming API exposes the following functions :

void* LZ4_create (const char* inputBuffer);
int   LZ4_compress_continue (void* LZ4_Data, const char* source, char* dest, int inputSize);
char* LZ4_slideInputBuffer (void* LZ4_Data);
int   LZ4_free (void* LZ4_Data);

Although it works as intended, this API introduces some critical limitations.

First, there is a need to create a data structure which tracks the behavior of the stream.
This is void* LZ4_data, which will be generated by LZ4_create().

It looks fine, but a first issue is that allocation happens inside LZ4_create(). In some circumstances, this may not be acceptable : sometimes allocation must be fully controlled by the host process. For example, standard functions such as malloc() may be unavailable. For this use case have been created 2 more functions :

int LZ4_sizeofStreamState(void);
int LZ4_resetStreamState(void* state, const char* inputBuffer);

It looks fine too, but is unfortunately still not usable for static allocation purpose (on stack, for example).

The second issue is the const char* inputBuffer argument, specified at creation stage, because it will be fixed during the entire compression process. It's not possible to jump between memory segments. This is a major limitation, preventing "in-place" compression of scattered memory segments.

LZ4_compress_continue() looks fine as a function definition, but what is less so is the documented requirement : up to 64KB of previously processed data must stand *before* the data to be compressed. This is a major constraint, typically requiring to prepare data into an intermediate buffer (which in some circumstances, might prove an unaffordable burden).

Then there is LZ4_slideInputBuffer(), which is arguably not at its right responsibility level. Its mission is to copy the last previous 64KB of source data to the beginning of the input buffer. It exists because the content of void* LZ4_data is not accessible.

No function is defined to simply load a dictionary : to achieve that goal, one has to compress a segment using LZ4_compress_continue(), throw the result, and compress the next segment using data accumulated so far.

To summarize, yes it can work, but it's cumbersome. One has to accept the idea that data to compress must be prepared into an intermediate buffer where above conditions can be met.
It's not convenient, and may sometimes not be possible, for example due to allocation limitations.
It also has a negative impact on performance, due to memory copy operations overhead.

New streaming interface definition

While defining the streaming API, I constantly struggled to find the right balance between a "fully automated" API, and a "user-controlled" one. The "fully automated API" would take in charge all kind of workload, generate data respecting the LZ4 Streaming Format with full headers and markers, and take care of allocation for required buffers. On the other hand, the "user-controlled" API would serve the needs for host programs which require full control over resource allocation.

I therefore settled with providing both.
The "user-controlled" version will be part of LZ4 library. It's the simpler one, and only takes care of generating compressed block format, chained or independent. It depends on the host process to pay attention to all memory operations (allocation, position and availability) and to provide its own custom format linking the successive blocks.

The "fully automated" API will be part of a new library, which will be called LZ4S for "LZ4 Streaming". It is basically a user program of the previous API.

In this blog post, we will therefore concentrate on the first API, since it is also an underlying foundation for the 2nd one.

typedef struct { unsigned int table[LZ4_STREAMSIZE_U32]; } LZ4_stream_t;

int LZ4_compress_continue (void* LZ4_stream, const char* source, char* dest, int inputSize);

OK, let's focus on the most important differences with current API.

An LZ4_stream_t structure is exposed. It is provided to give a better control over memory management to the host program. For example, allocation can be static (stack, global) or dynamic, and use any user-defined allocation function.

Obviously, the host program is both in charge of allocating and freeing this memory space. It may also duplicate it "at will", which is a good idea for "static dictionary compression" scenarios.

Cleaning (or resetting) LZ4_stream_t is only a matter of memset() it to zero. It's a requirement to do it once before using this structure.

LZ4_stream_t 's size is controlled by the value LZ4_STREAMSIZE_U32, which is checked at compile time thanks to a static assert piece of code, as already used within xxHash. It mostly depends on LZ4's hash table (which typical size is about 16KB). Hash table size is a parameter controlled by the defined value LZ4_MEMORY_USAGE. Up to now, this define was present into lz4.c. To be coherent with the new interface, it will be moved to lz4.h instead. I don't foresee any issue with that.

LZ4_stream_t is described as a table of unsigned int. This is intentional. The real usage of this memory bank remains hidden inside lz4.c. This way, internal variables within the structure cannot be used as a kind of implicit interface contract, allowing future modifications to happen without breaking compatibility with existing programs.

Once the streaming structure is created, you can start to populate it by loading a static dictionary using :

int LZ4_loadDict (void* LZ4_stream, const char* source, int size);

This part is optional. Loading a dictionary flushes any prior data reference from LZ4_stream_t , if there is any, so you may also use this function to initialize an LZ4_stream_t structure by simply setting a size of 0.

LZ4_compress_continue() looks the same as previously, and is indeed compatible,  but a major difference is that it no longer requires to compress next data block "right after" previous data. Previous and Next data can be anywhere in memory. The LZ4_stream_t structure will keep track of it automatically.

One strong assumption of LZ4_compress_continue() is that previously compressed data is still available. Unfortunately, this might not be the case, and this function cannot guess that.
Should previously compressed data be no longer accessible, for example because you need the space for something else, or you can't guarantee its availability, it's possible to solve that situation :
  • You can "save" the relevant data segment from previous block (typically, the last 64KB, it can also be less than that) into a "safe" place. At which point, you will need to indicate this change of position.
int LZ4_moveDict (void* LZ4_stream, char* safeBuffer, int dictSize);

Memory space available at char* safeBuffer must be at least dictSize,  since it is the amount of data  LZ4_moveDict() will copy there (Note : the maximum amount of data that will be copied is 64KB, even if dictSize is larger).


Decompression

The current streaming decompression API is defined as follows :

int LZ4_decompress_safe_withPrefix64k (const char* source, char* dest, int compressedSize, int maxOutputSize);
int LZ4_decompress_fast_withPrefix64k (const char* source, char* dest, int originalSize);

Its documented requirement is that previously decompressed data must stand *right before* the memory address where the next data block is going to be decompressed. It works, but implies the creation of a temporary buffer to match the requirement.

The new streaming API get rid of this limitation, allowing previous data to stand anywhere within accessible memory :

int LZ4_decompress_safe_usingDict (const char* source, char* dest, int compressedSize, int maxOutputSize, const char* dictStart, int dictSize);
int LZ4_decompress_fast_usingDict (const char* source, char* dest, int originalSize, const char* dictStart, int dictSize);

The dictionary definition is part of the argument list (const char* dictStart, int dictSize), therefore there is no need for a tracking structure, in contrast with compression.



A first code implementing this API is currently available in the "dev" branch of LZ4 on github. It is still early stage, and probably the right time to be provide your comments should you have any.

[Edit] Added : LZ4_moveDict() as a potential replacement of LZ4_SetDictPos()
[Edit] Added : Paragraph on decompression
[Edit] Modified : move to LZ4_moveDict()
[Edit] Interface definition converges towards LZ4_compress_continue()
[Edit] "streaming" branch now merged into "dev" branch

Monday, April 7, 2014

Taking advantage of unequalities to provide better compression

 When starting investigation on ANS properties, in November 2013, I stumbled upon the fact that positions in the table are not equivalent. Basically, the low states are more "worthy" than higher states. Back then, it wasn't properly explained nor forecast by the theory.

On first look, it may look like a nuisance : usual arithmetic coders offer clean flat probabilities, this uneven layout seems like an added "noise" on top of an already complicated algorithm. On second though, it looks like a potential opportunity, even if complex, to improve accuracy by taking advantage of such differences.
With many other issues to solve to get a working ANS decoder out, this issue was left for a later investigation. And here we are today.

The same conclusion was later demonstrated by Charles Bloom, within his serie of blog posts on ANS. He produced a graphic, which proved much clearer than the stats produced the year before.

Probability of each state value

Note that the "clean graph" is obtained through an average of multiple "synthetic" inputs, not real files, and requires the "precise distribution algorithm" for the layout of symbols. If you try to reproduce this result on your own, you are likely to obtain a graph which roughly produces the same perspective, but not with the same accuracy.

As stated by Charles, the graph follows quite closely a 1/X distribution. It's an important conclusion, because it hints towards a formula able to "guess" the best spot for a singleton probability. It seems that, as long as P is between sqrt(2) and sqrt(2)/2, we can find its optimal position in the table.

It's an interesting result that we'll try exploit. I made multiple tests with different P values, looking for the optimal position in the table, and the result was relatively close to the 1/X formula (best position is usually slightly different, but the overall difference in file size is extremely small)

  P    1/X  Best  Difference
0.707  511  499    4 bytes
0.850  339  351    2 bytes
1.000  212  209    2 bytes
1.150  117  115    1 byte
1.300   44   43    2 bytes
1.414    0    0    0 bytes

So we've got a way here to improve accuracy of singleton probabilities (those for which the normalized frequency is 1). Can we use that ?

Unfortunately, we must also take into consideration the increased header cost to specify position accurately. If a symbol probability is normalized to 1 (a singleton), it means its number of occurrence is relatively small. Therefore, we can't spend 10 bits to provide the exact best position of a rare symbol into the table, it would be counter productive.

In order to experiment, I targeted the easy gains, and took a look at low-probability symbols, those for which realP < 1. Such cases are unfortunately quite common. They are also the most wasteful ones, since they require the minimum normalized frequency, which is 1 (even if realP = 0.01, we can't round that down to zero, otherwise the compression would be lossy).
One way to minimize their loss would be to put them at the top of the table, where the probability is lowest (approximately sqrt(2)/2 ~= 0.707). The end effect is somewhat equivalent to what Charles Bloom did within his own tANS version, but in a more "directive" manner, since we are cherry picking which symbols must fit at the top of the table.
The impact on header size is going to be very moderate, since now we just have to distinguish between a "normal 1" and a "low 1".

Let's check if this method produces some worthy gains :

Table  Low1     HC     Fast3  Gains    Fast1
4K    435286  435426  435398  0.026%  435723
2K    435677  436041  436072  0.091%  436783
1K    436691  437756  437875  0.271%  439181

Indeed, this is much better than anticipated. This new version handily beats the "perfect normalization algorithm", by taking advantage of unequal probabilities within ANS table. It is therefore now integrated within the open source version of FSE on Github.

A great property is that this improved table initialization algorithm is achieved with the same speed as before, so it's basically an improvement we get for free.

Another important property is that the gain improves while table size decreases. It's great : it means we can consider reducing the table size, hence its memory cost, while keeping the compression loss in check. Notice how we get better compression performance today than earlier version (Fast1) did produce using tables twice bigger !

It's an important benefit, and it enables some critical RAM saving, essential to improve Zhuff performance.
But let's keep that part for a later blog post...

Wednesday, April 2, 2014

Ultra-fast normalization

 Today's objective is to use the lessons learned when defining the perfect normalization algorithm to design a new, better, fast normalization algorithm. As stated last time, perfect normalization works fine, it's just a bit too slow to be used within FSE.

The main learning, in my opinion, is that fractional rest doesn't really matter. Or more specifically, it only matters for small p values, where round down cost elasticity is highest.
Also of interest, the "threshold" level is not a stable 0.5 across, it's a p-specific threshold, starting at 0.443 for p==1, and then converging towards 0.5.

Elasticity of the "round down cost"


The list of "round down" selection also shows that only the first few decisions make a (large) difference, targeting almost exclusively small p values with small fractional rest.
After these first "easy gains", all other "round down" basically produce the same cost. The higher the value, the most stable the cost, in accordance with the graph "round down cost elasticity".

smallest :  76 : cost  413.0 bits  -  count    413 : 1.10  (from 2 -> 1)
smallest :  70 : cost  413.0 bits  -  count    413 : 1.10  (from 2 -> 1)
smallest :  89 : cost  416.0 bits  -  count    416 : 1.11  (from 2 -> 1)
smallest :  87 : cost  440.5 bits  -  count    753 : 2.01  (from 3 -> 2)
smallest :  63 : cost  444.0 bits  -  count    759 : 2.02  (from 3 -> 2)
(..)

smallest : 110 : cost  544.1 bits  -  count  40919 : 109.01  (from 109 -> 108)
smallest : 117 : cost  544.2 bits  -  count  16031 : 42.71  (from 43 -> 42)
smallest : 115 : cost  544.4 bits  -  count  36788 : 98.00  (from 98 -> 97)
smallest : 104 : cost  544.6 bits  -  count  37561 : 100.06  (from 100 -> 99)
smallest : 116 : cost  544.7 bits  -  count  50027 : 133.27  (from 133 -> 132)
smallest :  32 : cost  544.7 bits  -  count 125551 : 334.47  (from 333 -> 332)
smallest : 100 : cost  544.8 bits  -  count  26623 : 70.92  (from 71 -> 70)
(...)


All these results point towards a simple idea :
could we just define a threshold under which all p values would be "rounded down", and then, fine-tune the remaining difference by attributing it to the largest value, since it makes almost no cost difference which p is rounded down as long as it is large enough (stable cost) ?

OK, let's try that.
The initial idea is to use as a threshold the converging value (1.443 = 1/ln(2), although its relative position vary depending on p, from 0.443 to 0.5). It works, but we can do better.

One important side-effect regarding weight distribution concerns the very small values, resulting in a realP < 1, which must be rounded up to p=1 (the minimum weight). The number of such small values vary greatly depending on data type and size to compress. We could be lucky and have no small realP, or inversely have hundreds of them. Between those extreme, the most likely situation is that we will get a few of them.
Because of these round up, it results in a hidden cost for all other symbol probabilities, which will have to "make room" (hence, round down) for them. To take into consideration this collective cost, we will simply raise the decision bar on the graph.

Round up/down threshold

Note that since the "round down cost elasticity" becomes very small as p becomes big (it ends up being a very thin line), it means that above a certain value of p, all p will simply be rounded down, irrespective of fractional rest.

It makes for a very simple algorithm, that we expect to be both fast and efficient.
Let's put that to the test.

First, let's have a look at final compression size.
As usual, we'll use "book1" as a single-block compression test :

Table  HC    Fast3   Fast2
 4K  435426  435398  435459 
 2K  436041  436072  436138
 1K  437756  437875  437952

(Note : Fast3 is the newly designed fast approximation algorithm, Fast2 was designed in this post)

As can be seen, Fast3 consistently beats Fast2, albeit by a small margin, since there is very little gain remaining.

The most important expected benefit is on the speed front : since the algorithm is much simpler, it is expected to run faster.
To measure this effect, I've modified FSE so that it runs its normalization loop several thousand times per block, hence ensuring normalization becomes the dominant operation. The final result should not be interpreted as an absolute speed, but rather a relative measure.

                  Fast3  Fast2
Norm. SpeedTest : 25.6   6.4

This one is pretty clear : the new algorithm is approximately 4x times faster than the previous one.

Faster, stronger, simpler, that's a good deal.
The new algorithm is now integrated into the open-sourced version of FSE hosted at github.

------------------------------------------------------------
Epilogue :
Astute readers may have noted that Fast3 compresses better than HC on 4K Table setting.
How could that be ? HC is supposed is supposed to be perfect, hence unbeatable, normalization !

Well, an answer is, the perfect normalization assumes a perfect entropy coder, which FSE is not, it's only a good approximation of it.

On the "book1" example, the threshold bar of Fast3 produces distributions which incidentally benefit from FSE irregularity.
Which introduce a new question : could we intentionally use these irregularities to boost compression ratio even further ?

Well, let's keep that question for a future blog post...

Friday, March 7, 2014

Perfect Normalization

 People keeping an eye on the github repository of FSE may have noticed the apparition a new function, called FSE_normalizeCountHC(). Let's detail it.

The previous post presented some new technique to improve accuracy of the normalization step, while remaining fast. The post of today will concentrate on making the normalization step "optimal", keeping an eye on speed as secondary objective.
(Note : this objective has also been covered on Charles Bloom's blog, using a different explanation and formula, but with essentially the same result. It's a good read.)

Reaching the "optimal" status means no other probability distribution can achieve a better compression. This is obviously measured against a "theoretical optimal entropy coder", which FSE is not (it's a close approximation of it). But still, any gain from a better distribution is likely to translate into better FSE compression.

To achieve the best possible distribution, we need to inverse our reasoning : instead of distributing remaining "bonus points", we will round up everyone, and then consider that a "round down" is a cost. This cost is necessary : we can't "round up" all probabilities, otherwise the sum of probabilities will be bigger than the table. By design, some of these probabilities will have to be rounded down.
We will simply select the ones which will cost less.

This requires to define a "cost function".
The cost function is relatively easy to understand :
Let's suppose we have a symbol S, which appears Qs times into data source, resulting in a probability of P. The cost per symbol can be calculated using Shannon formula :
Cp = -log2(P);

Now, we want to evaluate the cost of downgrading the probability of S from P to P-1.
The new cost per symbol will be higher : C(p-1) = -log2(P-1);
So the total cost of this operation will be : Qs x (C(p-1) - Cp)

We can calculate this cost for each and every symbol present into data source, and then pick the one which cost less, degrade its probability note by one, update its cost, select the new symbol with lowest cost, rinse and repeat, stop when you have reached the required total (= table size).
By construction, this method is guaranteed to provide the best possible normalized distribution.

So, what kind of result does it provide ?
First, a look at "book1" benchmark :

4K table, Fast normalization : 435 459 bytes
4K table, Perfect normalization : 435 426 bytes (gain : 0,008%)

2K table, Fast normalization : 436 138 bytes
2K table, Perfect normalization : 436 041 bytes (gain : 0,022%)

1K table, New normalization : 437 952 bytes
1K table, Perfect normalization : 437 756 bytes (gain : 0,045%)

As can be seen, the remaining gain is very small. That's because we already grabbed most of the benefits from latest improvements.
Still, it can be interesting for those willing to reach the best possible compression ratio, since the "normalization" cost only applies during compression, it's totally free on the decompression side.

Looking further, the new function also provides a debug mode which details its selection processus. It's very interesting, so let's have a look at it, for example compressing "book1" with 2K tables :

smallest :  76 : cost  413.0 bits  -  count    413 : 1.10  (from 2 -> 1)
smallest :  70 : cost  413.0 bits  -  count    413 : 1.10  (from 2 -> 1)
smallest :  89 : cost  416.0 bits  -  count    416 : 1.11  (from 2 -> 1)
smallest :  87 : cost  440.5 bits  -  count    753 : 2.01  (from 3 -> 2)
smallest :  63 : cost  444.0 bits  -  count    759 : 2.02  (from 3 -> 2)
smallest :  69 : cost  444.0 bits  -  count    444 : 1.18  (from 2 -> 1)
smallest :  59 : cost  445.7 bits  -  count    762 : 2.03  (from 3 -> 2)
smallest : 106 : cost  468.0 bits  -  count    468 : 1.25  (from 2 -> 1)
smallest :  33 : cost  486.7 bits  -  count    832 : 2.22  (from 3 -> 2)
smallest :  83 : cost  497.2 bits  -  count    850 : 2.26  (from 3 -> 2)
smallest :  62 : cost  498.0 bits  -  count    498 : 1.33  (from 2 -> 1)
smallest :  60 : cost  498.0 bits  -  count    498 : 1.33  (from 2 -> 1)
smallest :  79 : cost  500.7 bits  -  count    856 : 2.28  (from 3 -> 2)
smallest :  78 : cost  502.0 bits  -  count    502 : 1.34  (from 2 -> 1)
smallest : 120 : cost  503.7 bits  -  count    861 : 2.29  (from 3 -> 2)
smallest :  84 : cost  517.1 bits  -  count   1966 : 5.24  (from 6 -> 5)
smallest : 113 : cost  520.0 bits  -  count    520 : 1.39  (from 2 -> 1)
smallest :  46 : cost  530.6 bits  -  count   7170 : 19.10  (from 20 -> 19)
smallest :  39 : cost  533.5 bits  -  count   6470 : 17.24  (from 18 -> 17)
smallest : 107 : cost  533.9 bits  -  count   4994 : 13.30  (from 14 -> 13)
smallest : 118 : cost  535.7 bits  -  count   5382 : 14.34  (from 15 -> 14)
smallest :  98 : cost  537.8 bits  -  count   9132 : 24.33  (from 25 -> 24)
smallest : 115 : cost  538.8 bits  -  count  36788 : 98.00  (from 99 -> 98)
smallest :  10 : cost  538.9 bits  -  count  16622 : 44.28  (from 45 -> 44)
smallest : 110 : cost  539.1 bits  -  count  40919 : 109.01  (from 110 -> 109)
smallest : 104 : cost  539.2 bits  -  count  37561 : 100.06  (from 101 -> 100)
smallest :  44 : cost  540.2 bits  -  count  10296 : 27.43  (from 28 -> 27)
smallest : 109 : cost  540.3 bits  -  count  14044 : 37.41  (from 38 -> 37)
smallest : 116 : cost  540.6 bits  -  count  50027 : 133.27  (from 134 -> 133)
smallest : 111 : cost  540.8 bits  -  count  44795 : 119.33  (from 120 -> 119)
smallest :  97 : cost  541.3 bits  -  count  47836 : 127.43  (from 128 -> 127)
smallest : 119 : cost  541.4 bits  -  count  14071 : 37.49  (from 38 -> 37)
smallest : 108 : cost  541.4 bits  -  count  23078 : 61.48  (from 62 -> 61)
smallest :  32 : cost  541.5 bits  -  count 125551 : 334.47  (from 335 -> 334)
smallest : 105 : cost  542.0 bits  -  count  37007 : 98.59  (from 99 -> 98)
smallest : 114 : cost  542.3 bits  -  count  32889 : 87.62  (from 88 -> 87)
smallest : 101 : cost  542.8 bits  -  count  72431 : 192.96  (from 193 -> 192)
smallest :  32 : cost  543.1 bits  -  count 125551 : 334.47  (from 334 -> 333)
smallest : 102 : cost  543.2 bits  -  count  12237 : 32.60  (from 33 -> 32)
smallest :  45 : cost  543.8 bits  -  count   3955 : 10.54  (from 11 -> 10)
smallest : 110 : cost  544.1 bits  -  count  40919 : 109.01  (from 109 -> 108)
smallest : 117 : cost  544.2 bits  -  count  16031 : 42.71  (from 43 -> 42)
smallest : 115 : cost  544.4 bits  -  count  36788 : 98.00  (from 98 -> 97)
smallest : 104 : cost  544.6 bits  -  count  37561 : 100.06  (from 100 -> 99)
smallest : 116 : cost  544.7 bits  -  count  50027 : 133.27  (from 133 -> 132)
smallest :  32 : cost  544.7 bits  -  count 125551 : 334.47  (from 333 -> 332)
smallest : 100 : cost  544.8 bits  -  count  26623 : 70.92  (from 71 -> 70)
smallest : 111 : cost  545.4 bits  -  count  44795 : 119.33  (from 119 -> 118)
smallest :  97 : cost  545.6 bits  -  count  47836 : 127.43  (from 127 -> 126)
smallest : 101 : cost  545.7 bits  -  count  72431 : 192.96  (from 192 -> 191)
smallest : 103 : cost  546.2 bits  -  count  12303 : 32.78  (from 33 -> 32)

OK, it may seem a bit daunting to analyze, let's go for the most interesting conclusions :
1) The first selected symbols make a big difference, since they really cost much less (starting with 413 bits). Then, the costs tend to converge, and there is very little difference remaining between the various available choices beween 535 and 545 bits.
Basically, it means most of the gains for the "HC" versions came from correctly selecting the first few symbols.
2) These first symbols tend to have "low probability" : a lot of 2->1 transitions, some 3->2, and so on. Starting with 530 bits, they completely disappear, and we only have higher probability symbols.
3) Some higher probability symbol appear several times. Note for example symbol 32, which is rounded down several times, reaching 332, while its "real" probability should be 334.47. It means it's better to downgrade it several times, rather than to downgrade a single time a lower probability symbol.

The point 2 was the most surprising to me. Remember last time we showed a graphic proving that the low probability symbols were prone to the most important "round down" losses.


This is because this graphic was only considering the "worst case" situation, with "real probability" being infinitely close to the next probability step (e.g. 1.999, then 2.999, etc.)

The new result implies that low probability symbols are also susceptible to generate the lowest amount of losses, when they are at the other end of the spectrum (e.g. 1.01).

This lead me to try to visualize the "level of freedom" of the "round down cost" (all possible values from 1.001 to 1.999, then from 2.001 to 2.999, and so on). It gave the following graphic :



As could be guessed, this "level of freedom" is much higher at low probability than at high probability. In fact, the "freedom area" quickly becomes a very thin line beyond value 300.
This corresponds to intuition :
From 1.01 to 1.99, the level of freedom is a factor 2.
From 2.01 to 2.99, the level of freedom is now 50%.
and so on.
By the time it reaches probability 1000, the level of freedom is basically 1/1000th, which is almost nothing.



This is the zoomed version of the same graphic, concentrating on the first few values. As said last time, this graphic will remain the same irrespective of the total size of the table.

So that means that most of the attention must be paid to the symbols with lowest probability, where the difference between a low and a high "fractional rest" can make a large cost difference, and therefore should be selected carefully.
High probability symbols don't have such level of freedom, and therefore their "fractional rest" has very little importance, and almost no impact on total cost.

Another interesting point is that the graphic converges towards 1/ln(2) = 1.443, not 1.5. Which means that the intuitive 50% threshold is not true, at least not for low-probability symbols.

Unfortunately, the new "HC" version is also slow. Not "horribly slow", just slow enough to be noticeable in benchmark, reducing total speed by as much as 5-10%, for a negligible compression gains. That means the "fast" version remains the default one, while the "HC" version is provided in option.

Nonetheless, the above findings are very interesting, and may be useful in the quest for a better fast renorm algorithm.

Sunday, March 2, 2014

Better normalization, for better compression

 A secondary issue to deal with when implementing an FSE entropy coder is the requirement to normalize statistics. Basically, it consists in attributing a probability to each symbol, so that the sum of all probabilities = table size, which is itself a power of 2.
It is in no way specific to FSE, and any other entropy coder with the ability to represent fractional bits, such as arithmetic coders, would face the same issue.

The process seems fairly straightforward : get a ratio, which is Ratio = destSize / origSize; with destSize = table size, and origSize = input Size. Then, simply multiply each symbol occurrence by this ratio to get its scaled probability :
P[s] = Occurrence[s] * R;

We just have 2 little details to solve with this process :

1) Ratio is a float number (since in practice input Size is a non numeral multiple of table size). It basically means all calculations are going to be float, which is hardly ideal for speed.
This part is the easiest to solve : just use some fixed-point arithmetic. It's particularly simple with 64-bits arithmetic : scale table Size to a large number, such as 2^60, and R becomes a large integer, with which you can multiply symbol occurrences.  It's accurate enough.
32-bits arithmetic is fine too, but requires to pay a bit more attention to limit situations, such as large input sizes.

2) P[s] is unlikely to be a natural number. You will end up with something like P[s] = 10.1, or P[s]=10.9 . So the natural question is : to which value should be rounded P[s] ?
With above examples, it looks simple : 10.1 should be rounded to 10, 10.9 should be rounded to 11.
Except that : if you apply this simple logic to all symbols, the sum of P[s] is not going to be equal to table size. It's going to be close, but not exactly this number. Which is unforgiving : this strict condition is mandatory, otherwise compressed data will be corrupted.

There are a few ways to handle this situation.
A "panic" solution would simply consists in adding or removing some points to random P[s] until reaching the desired total. As long as all present symbols get a P[s]>=1, it's going to work. Compression will suffer, but compressed data will be decodable.

This is obviously not very optimal. So let's try to find something better, keeping in mind the requirement for speed.

The initial algorithm shipped within FSE was completely speed oriented : it would attribute a probability to each symbol in a single pass, guaranteeing that all present symbols get a probability >=1, and that the sum of them is necessarily equal to table size.
To reach this result, there were 2 complementary operations :
1) Statistics were altered, to guarantee that all symbols present get a probability >= 1. It simply consists in adding a constant to all symbol statistics, so that even when a symbol is present only once, (Occurence[s] + base) * R >= 1.
2) "Fractional rest" was tracked. P[s] was rounded down by default, but would be rounded up when the cumulative sum of current and previous fractional rests would be > 1. This idea is somewhat equivalent to dithering algorithms, but in 1D space.

It works. It's not optimal, but damn fast.
Note that we are not talking about a large loss here. It would typically be ~0.05% using a 4K table. Almost negligible. However, I also discovered that the error became larger as table size was reduced (up to 0.20% for a 2K table for example).
Since one of my current objectives is to use smaller tables, I figure it was a good idea to investigate.

Both mechanisms above introduce some distortion. The most important one is the statistics alteration. The effect is more pronounced when sourceSize / tableSize gets bigger, as one would expect, since the base alteration has to be increased.
The second effect is less harmful, but still not optimal : the "round up" bonus is more likely to be attributed to symbols with a large "fractional rest", such as 10.9, but it's not guaranteed. If the remaining sum of previous fractional rests is very small, for example 0.05, the new sum will be 0.95, and therefore not trigger the round up bonus. On the other hand the next symbol, even if it has a small fractional rest, such as 10.1, would pass the total to 1.05, and therefore trigger the "round up" bonus.

It's desirable to make a more accurate selection of which symbol probability must be rounded up or down.
So the algorithm is going to be altered. The new idea is now : we round down all symbol probabilities by default, then we provide the "round up bonus" to the symbols which "more deserve it".

We just introduce 2 new problems.

First, we can't round down probabilities < 1. These symbols have reached the absolute minimum. We must provide them this probability of 1.

That means, we now have 2 situations : either the "round down" of all symbols of probability > 1 is prevalent, and we still have some "bonus points" to distribute", or, the "round up" of symbols of probability < 1 is prevalent, and we must now remove some more from some other symbols.

Let's suppose we are in situation 1 : we still have a few bonus points to distribute.
So the initial reasoning is still valid.
This lead us to the second problem : which symbol does "more deserve it" ? How to translate that ?

An intuitive idea is that symbols with highest "fractional rest" deserve it more. Simply said, 10.9 should be rounded up instead of 10.1.
This is simple case. But more complex choices have to be made too. For example, supposed I've got only 1 bonus point, which probability should be rounded up : 1.9 or 10. 9 ?

To investigate this question, I've started to look at the impact of a "round down" error. Suppose I'm providing to symbol A a probability of 1 instead of 2, and to symbol B a probability of 10 instead of 11, what is the impact ?
First let's look at the impact on the unit cost of each symbol :



This is intuitive : the smaller the probability, the more a "round down" error will have an impact on symbol cost.
But it doesn't tell the whole story. The symbols with smaller probabilities are, by definition, less numerous than symbols with higher probabilities. So it could be that, even though the "unit cost" of more probable symbols is less impacted, multiplied by the number of symbols, the total error may be larger.

So let's have a look at how much compressed data is generated by these symbols, simply by multiplying their unit cost by their occurrence :



Well, it's interesting. Symbols with probabilities of ~30% do actually generate more compressed data than symbols with probabilities > 80%.
Interesting, yes, but that's still the wrong picture to look at. The above picture gives the total amount of compressed data generated if the probability is correct. I won't get a smaller amount of data if I attribute a probability of 5% to a symbol which real probability is 30% : its cost will sky-rocket instead, following the trajectory of the first picture.

What actually matters is "how much additional compressed data will be generate if I round down this probability".
To visualize this question, I've created the following scenario :
- Let's imagine that all probabilities are accurately represented (meaning, a symbol with a probability of 2 is really present 2 times).
- Let's imagine that I provide to this symbol a probability wrong by one (meaning, a symbol present 2 times is now attributed a probability of 1)
- What is the additional cost of this error ?

It gives the following picture :



OK, now this is really interesting.
This is almost a flat line. The total error is consistently going to be around 1.5 bits.

To be more precise, it's not "totally flat", it's in fact slightly diminishing, converging towards a value around 1.443 (seems to be 1/ln(2)).
There is a more remarkable difference at the beginning, so let's zoom out this portion :


It generates a few comments. First, it's not an "exploding cost". It is capped, to 2 bits max, for the specific case of downgrading a probability from 2 to 1.
Second, it converges very fast. We almost reach 1.5 bits at probability 10, it then continues its slow convergences towards 1/ln(2).

Another interesting fact is that this convergence is independent of the total size of table. This example is extracted from a table size of 2K, but you could have a table of 16K, or a table of just 32, and you would still get the exact same result !

This seems strange, but math explains it all.
As stated previously, the scenario is :
symbol has exactly P[s] occurrence,
so its total compressed cost shall be : P[s] * -log2(P[s] / TableSize)
but since we undervalued it probability, its new cost is : P[s] * -log2(((P[s]-1) / TableSize)
So the difference cost is : (P[s] * -log2(((P[s]-1) / TableSize)) - (P[s] * -log2(P[s] / TableSize))
and since : log2(P[s] / TableSize) = log2(P[s]) - log2(TableSize)
we have : diff = (P[s] * (log2(TableSize) - log2((P[s]-1]))) - (P[s] * (log2(TableSize) - log2(P[s]))
which simplifies into : diff = P[s] * (log2(P[s]) - log2(P[s]-1))
Basically, the term TableSize disappears from equation.

Concretely, what it means for our fast normalization algorithm, is that using the "fractional rest" as the key metric to decide which symbol should get its "round up" bonus is correct. The gain is almost flat across the spectrum.
It also gives a second key, to solve ties.
Since the loss curb is decreasing with probabilities, it means that, for an identical "fractional rest", we have a bit more to gain when Probability is lower.

And here we have our algorithm to distribute the "round up bonus" : select symbols with highest "fractional rest", solve ties by putting lower probability symbols first.


We still now have to deal with scenario 2 :
we have no "bonus" point to distribute,
quite the reverse situation : we have distributed too many probabilities, as a consequence of the requirement to provide a minimum probability of "1" to all symbols present even only once. So now, we must remove some points, somewhere, to reach the desired total.

We'll deal with this situation quite easily. All symbols are already rounded down. So we have no more "fractional rest" to play with. We are going to downgrade any symbol by a full probability point.
So we just look at the graphic. It simply states that, even though the loss is almost equivalent, it's still lower for higher probability symbols, even if marginally.
So we'll give the entire cost to the symbol with highest probability. It's a fast approximation, and works well in practice.


With this new algorithm, FSE now compresses a bit more.

To give an example, let's compress book1 (part of calgary corpus) as a single block.

4K table, Old normalization : 435 723 bytes
4K table, New normalization : 435 459 bytes (gain : 0,06%)

A very small, but still perceptible gain. Let's check with lower table sizes :

2K table, Old normalization : 436 783 bytes
2K table, New normalization : 436 138 bytes (gain : 0,15%)

1K table, Old normalization : 439 181 bytes
1K table, New normalization : 437 952 bytes (gain : 0,28%)

And indeed, with lower table sizes, the increased accuracy becomes more important.

The good thing is, the new normalization approximation algorithm is still very fast, and can therefore be used as a drop-in replacement of the older one, with no perceptible speed impact.

Friday, February 28, 2014

FSE encoding : Mapping sub-ranges in a memory efficient way

 With last blog post, we were left with a description of the FSE compression algorithm. However, the code associated was a bit cryptic. So let's spend some time understanding it.

Here are the lines of code :

nbBitsOut = (state + symbolTT.deltaNbBits) >> 16; flushBits(bitStream, state, nbBitsOut);
subrangeID = state >> nbBitsOut; state = stateTable[subrangeID + symbolTT.deltaFindState];

As suggested in an earlier blog post, the first task is to determine the number of bits to flush. This is basically one of 2 values, n or n+1, depending on state crossing a threshold.
symbolTT.deltaNbBits stores a value which, when added with state, makes the result of >> 16 produces either n or n+1, as required. It is technically equivalent to :

nbBitsOut = n;
if (state >= threshold) nbBitsOut += 1;

but as can be guessed, it's much faster, because it avoids a test, hence a branch.

The 2nd line just flushes the required nb of bits.

So we are left with the last 2 lines, which are more complex to grasp. It realises the conversion from newState to oldState (since we are encoding in backward direction). Let's describe how it works.

A naive way to do this conversion would be to create conversion tables, one per symbol, providing the destination state for each origin state. It works. It's just memory wasteful.
Consider for example a 4096 states table, for a 256 alphabet. Each state value uses 2 bytes. It results into 4K * 256 * 2 = 2 MB of memory. This is way too large for L1 cache, with immediate consequences on performance.

So we need a trick to reduce that amount of memory.
Let's have another look at a sub-range map :


Remember, all origin state values within a given sub-range have the same destination state. So what seems clear here is that we can simply reduce all origin state values by 9 bits, and get a much smaller map, with essentially the same information.


It's simple yet very effective. We now have a much smaller 8-state map for a symbol of probability 5/4096. This trick can be achieved with all other symbols, reducing the sum of all sub-range maps to a variable total between number of state and number of states x 2.

But we can do even better. Notice that the blue sub-ranges occupy 2 slots, providing the same destination state.
Remember that the red area corresponds to n=9 bits, and the blue area corresponds to n+1=10 bits.  What we just have to do then is to shift origin state by this amount of bits. Looks complex ? not really : we already have calculated this number of bits. Let's just use it now.

subrangeID = state >> nbBitsOut;

For this trick to work properly, state values must be scaled not from 0 to 4095, but from 4096 to 8191. Doing this, it results in the following sub-range map :


A few important properties to this transformation :
- There are as many cells as symbol probability.
- The first subrangeID is the same as symbol probability (in this example, 5).
- The sub-ranges are now stored in order (from 1 to 5). This is desirable, as it will simplify the creation of the map : we will just store the 5 destination states in order.
- Since sub-range maps have same size as symbol probability, and since the sum of probabilities is equal to the size of state table, the sum of all sub-ranges map is the size of state table ! We can now pack all sub-range maps into a single table, of size number of states.

Using again the previous example, of a 4096 states table, for an alphabet of 256 symbols. Each state value uses 2 bytes. We essentially now disregard the alphabet size, which has no more impact on memory allocation.
It results into 4K * 2 = 8 KB of memory, which is much more manageable, suitable for an L1 cache.

We now have all sub-range maps stored into a single common table. We just need to find the segment corresponding to the current symbol to be encoded. This is what symbolTT.deltaFindState does : it provides the offset to find the correct segment into the table.
Hence :
state = stateTable[subrangeID + symbolTT.deltaFindState];
This trick is very significant. It was a decisive factor in publishing an open source FSE implementation, as the initial naive version was unpractical, too memory hungry and too slow.