Here are the lines of code :
nbBitsOut = (state + symbolTT.deltaNbBits) >> 16; flushBits(bitStream, state, nbBitsOut);
subrangeID = state >> nbBitsOut; state = stateTable[subrangeID + symbolTT.deltaFindState];
As suggested in an earlier blog post, the first task is to determine the number of bits to flush. This is basically one of 2 values, n or n+1, depending on state crossing a threshold.
symbolTT.deltaNbBits stores a value which, when added with state, makes the result of >> 16 produces either n or n+1, as required. It is technically equivalent to :
nbBitsOut = n;
if (state >= threshold) nbBitsOut += 1;
but as can be guessed, it's much faster, because it avoids a test, hence a branch.
The 2nd line just flushes the required nb of bits.
So we are left with the last 2 lines, which are more complex to grasp. It realises the conversion from newState to oldState (since we are encoding in backward direction). Let's describe how it works.
A naive way to do this conversion would be to create conversion tables, one per symbol, providing the destination state for each origin state. It works. It's just memory wasteful.
Consider for example a 4096 states table, for a 256 alphabet. Each state value uses 2 bytes. It results into 4K * 256 * 2 = 2 MB of memory. This is way too large for L1 cache, with immediate consequences on performance.
So we need a trick to reduce that amount of memory.
Let's have another look at a sub-range map :
Remember, all origin state values within a given sub-range have the same destination state. So what seems clear here is that we can simply reduce all origin state values by 9 bits, and get a much smaller map, with essentially the same information.
It's simple yet very effective. We now have a much smaller 8-state map for a symbol of probability 5/4096. This trick can be achieved with all other symbols, reducing the sum of all sub-range maps to a variable total between number of state and number of states x 2.
But we can do even better. Notice that the blue sub-ranges occupy 2 slots, providing the same destination state.
Remember that the red area corresponds to n=9 bits, and the blue area corresponds to n+1=10 bits. What we just have to do then is to shift origin state by this amount of bits. Looks complex ? not really : we already have calculated this number of bits. Let's just use it now.
subrangeID = state >> nbBitsOut;
A few important properties to this transformation :
- There are as many cells as symbol probability.
- The first subrangeID is the same as symbol probability (in this example, 5).
- The sub-ranges are now stored in order (from 1 to 5). This is desirable, as it will simplify the creation of the map : we will just store the 5 destination states in order.
- Since sub-range maps have same size as symbol probability, and since the sum of probabilities is equal to the size of state table, the sum of all sub-ranges map is the size of state table ! We can now pack all sub-range maps into a single table, of size number of states.
Using again the previous example, of a 4096 states table, for an alphabet of 256 symbols. Each state value uses 2 bytes. We essentially now disregard the alphabet size, which has no more impact on memory allocation.
It results into 4K * 2 = 8 KB of memory, which is much more manageable, suitable for an L1 cache.
We now have all sub-range maps stored into a single common table. We just need to find the segment corresponding to the current symbol to be encoded. This is what symbolTT.deltaFindState does : it provides the offset to find the correct segment into the table.
Hence :
state = stateTable[subrangeID + symbolTT.deltaFindState];
This trick is very significant. It was a decisive factor in publishing an open source FSE implementation, as the initial naive version was unpractical, too memory hungry and too slow.
Thanks for the blog posts Yann. With which compilers are you testing? Also in my experience, i have found that some times compilers do a better job if you give them longer calculation strings. So instead of doing:
ReplyDeletea = (xxxxxxx);
a -= (yyyyyyy);
it is better to just do:
a = (xxxxxxx) - (yyyyyyyy);
Also is FSE_symbolCompressionTransform struct cache optimized? I suspect that making minBitsOut a U16 and changing the order of attributes in the struct to minBitsOut, maxState, deltaFindState would be better.
typedef struct
Delete{
int deltaFindState;
U16 maxState;
BYTE minBitsOut;
} FSE_symbolCompressionTransform;
FSE_symbolCompressionTransform is an 8-bytes structure. The C Compiler is required to align it properly on int size, which means it will round it to 8.
With you proposal, it should not change the size of the structure, which will still be 8 bytes. What will change is where stands the "dummy" 8th byte.
With BYTE minBitsOut first, the second byte will be the dummy one, in order to guarantee that U16 maxState is 16-bits aligned.
Yes, the structure (with padding) size will remain the same. The change to U16 for minBitsOut, will permit the compiler to use U16 instructions, which in my experience are some times faster than BYTE ones.
DeleteI also suggested to reorder the struct to match the access patterns to it, from the code. It isn't an exact science, because the compiler also affects the access patterns, but it is a good start.
For very performance sensitive code, even the order with which you declare the variables of a function matters, because it affects how they are arranged in the stack.
Also (with a quick glance at the code) another possible optimization would be to have state1, state2 as ints (if you are sure that the pointer diff range is capped) and have FSE_encodeByte directly return them (to avoid passing an 8 byte reference pointer into it). On the other hand the compiler might be doing this optimization already.
ReplyDeletedue to the way FSE_addBits() works, it's necessary to cast state value to size_t. The cast from ptrdiff_t to size_t is costless. On the other hand, the cast from int to size_t is costly on 64-bits systems. Hence a small performance hit.
DeleteTo clarify my suggestion a little more.
DeleteInstead of passing state as a reference to "FSE_encodeByte(ptrdiff_t* state...", my suggestion is to do this:
state = FSE_encodeByte(state, ....)
so you avoid dereferencing it. Some times, avoiding the dereference is faster.
This comment has been removed by the author.
DeleteHave you also looked into using alloca (or Visual Studio's _alloca) to allocate some of the needed arrays in the stack?
ReplyDeleteActually, FSE default implementation already allocates tables on stack. But it does not use alloca(), since it is not recommended practice.
DeleteMinor correction: in order for the compiler to automatically convert comparison to a signed shift it has to know the ranges (i.e. for 32-bit integers it has to know that both values are in [0..2^31-1] range). I won't be surprised if compilers don't tend to implement this optimization since the ranges are generally unknown and on some architectures there are instructions to convert condition result to 0/1 (i.e. setne)
ReplyDeleteThank you for publishing these posts! I enjoy reading them.
You're welcomed. I added a link to your comment from the article.
ReplyDelete