Saturday, January 19, 2019

The type system

 btrc: compile2 : the type system

The type system is a simple yet powerful way to ensure that the code knows enough about the data it is manipulating. It is used to declare conditions at interfaces, which are then enforced at each invocation.
The compiler is very good to check types. The condition is trivial to enforce, and doesn’t cost much compilation resource, so it’s a powerful combination.

typedef and weak types

C is sometimes labelled a “weakly typed” language, presumably because it is associated to the behavior of one of its keywords typedef.
The keyword itself implies that typedef DEFines a new TYPE, but it’s unfortunately a misnomer.

As an example, typedef can be used this way :

typedef int meters;
typedef int kilograms;

This defines 2 new “types”, meters and kilograms, which can be used to declare variables.

meters m;
kilograms k;

One could logically expect that, from now on, it’s no longer allowed to mix meters and kilograms, since they represent different types, hence should not be compatible.

Unfortunately, that’s not the case : meters and kilograms are still considered as int from the C type system perspective, and mixing them works without a single warning, even when every possible compiler warning is enabled.

As such, typedef must be considered a mere tagging system. It’s still useful from a code reviewer perspective, since it has documenting value, and that may help notice discrepancies, such as this example. But the compiler won’t be able to provide any signal.

Strong types in C

To ensure that two types cannot be accidentally mixed, it’s necessary to strongly separate them. And that’s actually possible.
C has one thing called a struct (and its remote relative called union).
Two struct defined independently are considered completely foreign, even if they contain exactly the same members.
They can’t be mixed unintentionally.

This gives us a basic tool to strongly segregate types.

Operations

Using struct comes with severe limitations. To begin with, the set of default operations is much more restricted. It’s possible to allocate struct on stack, and make it part of a larger struct, it’s possible to assign with = or memcpy(), but that’s pretty much it. No simple operation like + - * /, no comparison like < <= => >, etc.

Users may also access members directly, and manipulate them. But it breaks the abstraction.
When structures are used as a kind of “bag of variables”, to simplify transport, and enforce naming for clarity, it’s fine to let users access members directly. Compared to a function with a ton of parameters, an equivalent function with a structure as input will help readability tremendously, just because it enforces naming parameters.
But in the present case, when structures are used to enforce abstractions, users should be clearly discouraged from accessing members directly. Which means, all operations must be achieved at struct level directly.

To comply with these limitations, it’s now necessary to create all allowed operations one by one, giving a uniquely named symbol to each one. So if meters and kilograms can be added, both operations need their own function signature, such as add_meters() and add_kilograms(). This feels like a hindrance, and indeed, if there are many types to populate, it can require a lot of glue code.

But on the plus side, only what’s allowed is now possible. For example, multiplying meters with meters shouldn’t produce some meters, but rather a square_meters surface, which is a different concept. Allowing additions, but not multiplications, is an impossible subtlety for basic typedef.

Composition

There is no “intermediate” situation, where a type would be “compatible” with another type, yet different. In the mechanisms explained so far, types are either compatible and identical, using typedef, or completely incompatible, using a new definition and a new name.

In contrast, in Object Oriented languages, a cat can also be an animal, thanks to inheritance, so it’s possible to use cat to invoke animal methods, or use functions with animal parameter(s).

struct strongly leans towards composition. A struct cat can include a struct animal, which makes it possible to invoke animal related functions, though it’s not transparent : it’s necessary to explicitly spell the substructure (cat.animal) as a parameter or return value of the animal related function.

Note that even Object Oriented languages generally approve the composition over inheritance guiding principle. The guiding principle states that, if there isn’t a very good reason to employ inheritance, composition must always be preferred, because it generally fares better as the code evolves and becomes more complex (multiple inheritances quickly translate into a nightmare).

struct can be made more complex, with tables of virtual function pointers, achieving something similar to inheritance and polymorphism. But this is a whole different level of complexity. I will rather avoid this route for the time being. The current goal is merely to separate types in a way which can be checked by the compiler. Enforcing a unified interface on top of different types is a more complex topic, better left for a future article.

Opaque types

struct are fine as strong types, but publishing their definition implies that their members are public, meaning any user can access and modify them.
When it’s the goal, it’s totally fine.

But sometimes, one could wish that, in order to protect users from unintentional mis-usage, it would be better to make structure members unreachable. This is called an opaque type. An additional benefit is that whichever is inaccessible cannot be relied upon, hence may be changed in the future without breaking user code.

Object oriented language have the private tag, which allows exactly that : some members might be published, but they are nonetheless unreachable from the user (well, in theory…).

A “poor man” equivalent solution in C is to comment the code, clearly indicating which members are public, and which ones are private. No guarantee can be enforced by the compiler, but it’s still a good indication for users.
Another step is to give private members terrible names, such as never_ever_access_me, which provides a pretty serious hint, and is less easy to forget than a code comment.

Yet, sometimes, one wishes to rely on stronger compiler-backed guarantee, to ensure that no user will access private structure members. C doesn’t have private, but can do something equivalent.
It relies on the principles of incomplete type.

My own preference is to declare an incomplete type by pairing it with typedef :

typedef struct house_s house;
typedef struct car_s car;

Notice that we have not published anything about the internals of struct house_s. This is intentional. Since nothing is published, nothing can be accessed, hence nothing can be misused.

Fine, but what can we do about such a thing ? To begin with, we can’t even allocate it, since its size is not known.
That’s right, the only thing that can be declared at this stage is a pointer to the incomplete type, like this :

house* my_house;
car* my_car;

And now ?
Well, only functions with house* or car* as parameter or return type can actually do something with it.
These functions must access struct house_s and struct car_s internal definitions. These definitions are therefore published in a relevant unit *.c file, rather than the header *.h. Being not part of the public interface, the structure’s internal remains effectively private.

The first functions required are allocators and destructors.
For example, I’m used to the following name convention :

thing* PREFIX_createThing();
void PREFIX_freeThing(thing* t);

Now, it’s possible to allocate space for thing*, and eventually do something with it (with additional functions).
A good convention is that functions which accept thing* as mutable argument should have thing* as first parameter, like in this example :

int PREFIX_pushElement(thing* t, element e);
element PREFIX_pullElement(thing* t);

Notice that we are getting pretty close to object oriented programming with this construction. Functions and data members, while not declared in an encompassing “object”, must nonetheless be defined together: the need to know the structure content to do anything about it forces function definitions to be grouped into the unit that declares the structure content. It’s fairly close.

Compared with a direct struct, a few differences stand out :

  • Members are private
  • Allocation is implemented by a function, it can only be invoked
    • no way to allocate on stack
    • no way to include a thing into another struct
      • but it’s possible to include a pointer thing*
    • Initialization can be enforced directly in the constructor
      • removes risks of garbage content due to lack of initialization.
  • The caller is in charge of invoking the destructor.
    • The pattern is exactly identical to malloc() / free() (see future article on Resource Control)

The responsibility to invoke the destructor after usage is very important.
It’s no different than invoking free() after a malloc(),
but that’s still an additional detail to take care of, with the corresponding risk to forget or mismanage it.

Opaque types and static allocation

It can be convenient to avoid heap allocations, and its malloc() / free() shenanigans. Pushing a structure onto the stack, or within thread-local storage, are natural capabilities of a normal struct, and may be desirable at times.

The previously described opaque type is so secret that it has no size, hence it cannot offer this capability.

Fortunately, static opaque types are possible.
The main idea is to create a “shell type”, with a size and an alignment restriction suitable for the target (private) structure.
Such information could be requested dynamically, by invoking a function which returns the minimum size for example. Problem is, it could not be used to allocate statically, either as global object or on stack (if one excludes VLA capabilities). The shell type itself must be declared as a static object.

For safer maintenance, the shell type and the target structure must be kept in sync, by using typically a static assert. It will ensure that the shell type is always large enough to host the target structure. This check is important to automatically detect future evolution of the target structure.

If it wasn’t for the strict aliasing rule, we would have a winner : just use the shell type as the “public” user-facing type, proceed with transforming it into the private type inside the unit. It combines all properties of struct while remaining opaque.

Unfortunately, the strict aliasing rule gets in the way : we can’t manipulate the same memory region from 2 pointers of different type. It may work, but there is no guarantee that it will continue to work safely in the future. A great way to break everything is to use some advanced performance features such as -O3 -lto. With enough inlining, register caching and dead code elimination, we will start to experience strange effects, which can be hard to debug.

One line of defense could be disable usage of strict aliasing by the optimizer, by using a compilation directive such as fno-strict-aliasing on gcc.
I wouldn’t recommend it though. On top of impacting performance, it ties code correctness to a specific compiler setting, which may or may not be present in user’s project. Portability is also impacted, since there is no guarantee that this capability will always be available on some different C compiler.

As a consequence, the shell type must not be confused with the target type. Due to the strict aliasing rule, they are not interchangeable !

The trick is to use a 3rd party initializer, to convert the allocated space and return a pointer of appropriate type.

To ensure strict compliance with C standard, it’s a multi-steps trick, hence is more complex to setup. Consider this technique as “advanced”, implying limited usage scenarios.

Here is an example :

typedef struct thing_s thing;

typedef union thingBody_u thingbody;
union {
    char[SIZE] body;
    unsigned align4;   // just to ensure `thingBody` will be aligned on 4-bytes boundaries
}

// PREFIX_initStatic_thing() accepts any buffer as input, 
// and turns it into a properly initialized `thing*` opaque pointer.
// The function ensures `buffer` has proper size (`SIZE`) and alignment (4) restrictions
// and will return `NULL` if it does not.
// Use `thingbody` to define a memory area respecting all conditions.
// On success, `thing*` will also be correctly initialized.
thing* PREFIX_initStatic_thing(void* buffer, size_t size);

// Notice there is no corresponding destructor.
// Since the space is reserved externally, its deallocation is controlled externally.


/* ====================================== */
/* Example usage */

int function()
{
    thingBody scratchSpace;   /* on stack */
    thing* T const = PREFIX_initStatic_thing(&scratchSpace, sizeof(scratchSpace));
    assert(T != NULL);  // Should be fine. Only exception is if `struct thing_s` definition changes and there is some version mismatch.

    // Now use `T` is if it was a normal `thing*` pointer
    // (...)

    // do Not `free(T)` at function's end, since thingBody is part of the stack
}

In this example, the static size of thingBody is used to allocate space for thing on the stack. It’s faster, and there is no need to care about deallocation.

But that’s all it does. No data is every read from nor written to thingBody. All usages of the memory region pass through thing*, which is safe.

Compared to a usual public struct, the experience is not equivalent.
To begin with, the proposed stack allocation is a multi-liner and creates 2 variables : the shell type, and the target pointer. It’s not too bad, and this model fits well enough any kind of manual allocation scenario, be it on stack or within a pre-reserved area (for embedded environments typically).

If that matters, stack allocation could have been made a one liner, hidden behind a macro, such as for example :

#define DECLARE_THING(P)        \
 thingBody scratchSpace##P;  \
 thing* P const  =  PREFIX_initStatic_thing(&scratchSpace##P,  sizeof(scratchSpace##P)); \
 assert(P != NULL);   /* optional : actually avoid mixing declarations and statements for C90 compatibility */

But I tend to prefer the first variant. It makes it clearer what’s happening. Since one of C strengths is a clear grasp of resource control, it feels better to preserve that level of understanding.

There are more problematic differences though.
It’s not possible to use the shell type as a return type of a function: once again, shell type and target incomplete type are 2 different things. On the same line, it’s not possible to pass the shell type by value. The memory region can only be passed by reference, and only using the correctly typed pointer.

Embedding the shell type into a larger structure is dangerous and generally not recommended : it requires 2 members (the shell and the pointer), but the pointer is only valid if the struct is not moved around, nor copied. That’s a too strong constraint to make it safely usable.

Due to these limitations, when there are benefits to employ struct convenience, it can be better to give up and just use a public struct, trying to limit mis-usages through clear code comments, and scary member names. It’s not great, because the compiler will no longer be able to help in case of contract violation. The design pattern will now entirely depend on reviewers. Still a reasonable defense.

Summary

This closes this first chapter on the type system. We have seen that it’s possible to create strong types, and we can use this property to ensure users can’t mix up different types accidentally. We have seen that it’s possible to create opaque types, and ensure users can only invoke allowed operations, or can’t rely on secret internal details, clearing the path of future evolution. These properties are compiler-checked, so they are always automatically enforced.

That’s not bad. Just using these properties will seriously improve code resistance to potential mis-usages.

Yet, there is more the compiler can do to detect potential bugs in our code. To be continued…

Special thanks : an early version of this document was proof-read by Nick Terrell

Writing safer C code

Writing safer C code may feel like an overwhelming goal. After all, we are told that C gives programmers plenty of opportunities to shoot their own foot.

But that’s doesn’t mean there is no possible improvement. Actually, in the last decade, programming practices have already evolved dramatically, and for the better, as a consequence of multiple forces, such as improved tooling, shared programming and rising cost of failures, as the numerous Internet exploits tend to remind us all too often.

I expected to start this series with an introduction on C, its strengths, and guiding principles on safer coding practices. But it doesn’t fit the blog post format, being too long, boring, and at times potential troll magnet. Suffice to say that “safer” implies writing Reviewer-Oriented source code, aka highly readable, and as much error automation as possible, favoring fast methods (immediate feedback while editing code) over longer ones (long offsite test sessions in dedicated environments).

One thing I can’t escape though is to mention a few words on the intended audience. These articles are not meant to learn new things for “experts”, which know a lot more than I do. Neither are they intended to guide the first steps towards C programming. The intended audience has good enough C programming skills, and can actually ship products. Shipping real products is important, because the whole concept of “safer programming” is better understood under the pressure and experience of a product’s maintenance.

The main driver is to make it more difficult to ship bugs, as the code base lives and evolves, and new team members get onboard, adding much-needed automated controls at every opportunity. Issues are centered around modifying / fixing an existing code base, and managing the cascading impacts on the rest of the project. This requires to prepare the code for this challenge, hence the design patterns proposed are also useful for new codes with an expected “long” life expectancy (beyond a few months).

Now let’s shorten this introduction and go directly into the meat of the topic.
I’ll start this series with design patterns that leverage compiler checks, to help make C code more resistant to mis-usages and future refactoring.



As a quick background, the compiler is a fairly central part of the development process for compiled language. Compiling a source code incurs a delay, more or less noticeable. That’s a cost.
Interpreted languages (most scripts, python, ruby, basic, bash, etc.) can evade it, making the initial code writing experience more agreeable, with quick modification / experience feedback loop.
The real cost though comes later, and it is steep : compiled languages have this constraint that the compiler must understand and therefore sanitize the code in order to produce the executable binary. This constraint becomes a huge advantage as it catches many categories of errors before they get a chance to run. This typically includes many flavors of mis-typings. Interpreted languages, in contrast, will have to find a majority of problems at run time (note: a good editor’s parser can definitely help both language types there).

And compiler can go much farther. One of the big lessons from modern languages favoring safety like rust is that using the compiler as a primary tool to guide design patterns towards safer practices improves code quality substantially. It’s a good choice : the compiler is a compulsory part of the development chain, it sits close to the programmer, its diagnosis is part of the valuable “short” feedback loop (in contrast with complementary techniques such as code analyzers, test suites and sanitizers). Whatever the compiler can flag gets solved more quickly, reducing load and risks at later stages of the development.

Hence, this is the first topic to explore : let’s make the compiler work for us, check the validity of our code to the best of its abilities. To reach that goal, we will have to purposely leverage its capabilities, in effect help the compiler help us.

And let’s start with its first weapon, the type system.

Wednesday, March 14, 2018

xxHash for small keys: the impressive power of modern compilers

 Several years ago, xxHash was created as a companion error detector for LZ4 frame format. The initial candidate for this role was CRC32, but it turned out being several times slower than LZ4 decompression, nullifying one of its key strength.

After some research, I found the great murmurhash, by Austin Appleby, alongside its validation tool SMHasher . It nearly fitted the bill, running faster than LZ4, but still a bit too close to my taste. Hence started a game to see how much speed could be extracted from a custom hash formula while preserving good distribution properties.

A lot of things happened since that starting point, but the main take away from this story is : xxHash was created mostly as a checksum companion, digesting long inputs.

Fast forward nowadays, and xxHash is being used in more places than it was originally expected. The design has been expanded to create a second variant, XXH64, which is successful in video content and data bases.
But for several of these uses cases, hashed keys are no longer necessarily "large".

In some cases, the need to run xxHash on small keys resulted in the creation of dedicated variants, that cut drastically through the decision tree to extract just the right suite of operations for the desired key. And it works quite well.

That pushes the hash algorithm into territories it was not explicitly optimized for. Thankfully, one of SMHasher's test module was dedicated for speed on small keys, so it helped to pay attention to the topic during design phase. Hence the performance on small key is correct, but the dedicated function push it to another level.

Let's analyse the 4-byte hash example.
Invoking the regular XXH32() function on 4-bytes samples, and running it on my Mac OS-X 10.13 laptop (with compilation done by llvm9), I measure 233 MH/s (Millions of hashes per second).
Not bad, but running the dedicated 4-bytes function, it jumps to 780 MH/s. That's a stark difference !

Let's investigate further.
xxHash offers an obscure build flag named XXH_PRIVATE_API. The initial intention is to make all XXH_* symbols static, so that they do not get exposed on the public side of a library interface. This is useful when several libraries use xxHash as an embedded source file. In such a case, an application linking to both libraries will encounter multiple XXH_* symbols, resulting in naming collisions.

A side effect of this strategy is that function bodies are now available during compilation, which makes it possible to inline them. Surely, for small keys, inlining the hash function might help compared to invoking a function from another module ?
Well, yes, it does help, but there is nothing magical. Using the same setup as previously, the speed improves to 272 MH/s . That's better, but still far from the dedicated function.

That's where the power of inlining can really kick in. In the specific case that the key has a predictable small length, it's possible to pass as length argument a compile-time constant, like sizeof(key), instead of a variable storing the same value. This, in turn, will allow the compiler to make some drastic simplification during binary generation, through dead code removal optimization, throwing away branches which are known to be useless.
Using this trick on the now inlined XXH32(), speed increases to 780 MH/s, aka the same speed as dedicated function.

I haven't checked but I wouldn't be surprised if both the dedicated function and the inlined one resulted in the same assembly sequence.
But the inlining strategy seems more powerful : no need to create, and then maintain, a dedicated piece of code. Plus, it's possible to generate multiple variants, by changing the "length" argument to some other compile-time constant.

object XXH32() XXH32 inlined XXH32 inlined +
length constant
dedicated XXH32 function
4-bytes field
233 MH/s
272 MH/s
780 MH/s
780 MH/s


Another learning is that inlining is quite powerful for small keys, but the XXH_PRIVATE_API build macro makes a poor job at underlining its effectiveness.

As a consequence, next release of xxHash will introduce a new build macro, named XXH_INLINE_ALL. It does exactly the same thing, but its performance impact is better documented, and I suspect the name itself will make it easier for developers to anticipate its implications.

Friday, February 16, 2018

When to use Dictionary Compression


 On the Zstandard website, there is a small chapter dedicated to Dictionary compression. In a nutshell, it explains that it can dramatically improve compression ratio for small files. Which is correct, but doesn’t nearly capture the impact of this feature. In this blog post, I want to address this topic, and present a few scenarios in which dictionary offers more than just “improved compression”.

Database example

Let’s start by a simple case. Since dictionary compression is good for small data, we need an application which handles small data. For example a log record storage, which can be simplified as an append-only database.
Writes are append-only, but reads can be random. A single query may require retrieving multiple individual records scattered throughout the database, in no particular order.
As usual, the solution needs to be storage efficient, so compression plays an important role. And records tend to be repetitive, so compression ratio will be good.
However, this setup offers a classic example of the “block-size trade-off” : in order to get any compression done, it’s necessary to group records together into “blocks”, then compress each block as a single entity. The larger the block, the better the compression ratio. But now, in order to retrieve a single record, it’s necessary to decompress the whole block, which translates into extra-work during read operations.

Impact of Block Size on Compression ratio (level 1)
Last data point compresses records individually ([250-550] bytes)

The “optimal block size” is application dependent. It highly depends on data (how repetitive it is), and on usage scenarios (how many random reads). Typical sizes may vary between 4 KB and 128 KB.

Enter dictionary compression. It is now possible to train the algorithm to become specifically good at compressing a selected type of data.
(For simplicity purpose, in this example, we’ll assume that the log storage server stores a limited variety of record types, small enough to fit into a single dictionary. If not, it is necessary to separate record types into “categories”, generate a dictionary per category, then route each record into a separate pool based on category. More complex, but it doesn’t change the principles.)
As a first step, we can now just use the dictionary to grab additional compression (see graph), and it's an instant win. But that wouldn’t capture all the value from this operation.

The other benefit is that now, the optimal block size can be adjusted to the new conditions, since dictionary compression preserves better compression ratio (even 1 KB block size becomes practicable!). What we gain in exchange is improved performance for random read extraction.

Impact of Block Size on single record extraction (Random Reads)

At its extreme, it might even be possible to compress each record individually, so that no more energy is lost to decompress additional records which aren’t used by the query. But that's not even required.
This change produces huge wins for random reads (collecting “single records” scattered throughout the database). In this example, the log storage can now extract up to 2.5M random records / second, while without dictionary, it was limited to 400K random records / second, and at the cost of a pitiful compression ratio.

In fact, the new situation unlocks new possibilities for the database, which can now accept queries that used to be considered too prohibitive, and might previously have required another dedicated database to serve them.

Normal Vs Dictionary Compression (level 1)
Comparing Compression Ratio and Random Reads per second

That’s when the benefits of dictionary compression really kick in : beyond “better compression”, it unlocks new capabilities that were previously rightly considered “out of reach”.

Massive connections

Another interesting scenario is when a server maintains a large number of connections (multiple thousands) with many clients. Let’s assume the connection is used to send / receive requests respecting a certain protocol, so there tend to be a lot of inter-messages repetition. But within a single message, compression perspective is low, because each message is rather small.

The solution to this topic is known : use streaming compression. Streaming will keep in memory the last N KB (configurable) of messages and use that knowledge to better compress future messages. It reaches excellent performance, as same tags and sequences repeated across messages get squashed in subsequent messages.

The problem is, preserving such a “context” costs memory. And a lot of memory that is. To provide some perspective, a Zstandard streaming context with an history of 32 KB requires 263 KB of memory (about the same as zlib). Of course, increasing history depth improves compression ratio, but also increases memory budget.
That doesn’t look like a large budget in isolation, or even for a few connections, but when applied to 10 thousands client connections, we are talking about > 2.5 GB of RAM. Situation worsen with more connections, or when trying to improve ratio through larger history and higher compression levels.

In such a situation, dictionary compression can help. Training a dictionary to be good at compressing a set of protocols, and then ignite each new connection with this dictionary, will produce instant benefits during the early stage of the connection lifetime. Fine, but then, streaming history takes over, so gains are limited, especially when connection lifetime is long.

A bigger benefit can be realised when understanding that the dictionary can be used to completely eliminate streaming. Dictionary will then partially offset compression ratio reduction, so mileage can vary, but in many cases, ratio just takes a small hit, as generic redundant information is already included in the dictionary anyway.
What we get in exchange are 2 things :
  • huge RAM savings : it’s now longer necessary to preserve a context per client, a single context per active thread is now enough, which is typically several order of magnitude less than the nb of clients.
  • vastly simplified communication framework, as each request is now technically “context-less”, eliminating an important logic framework dedicated to keeping contexts synchronised between sender and receiver (with all the funny parts related to time out, missed or disordered packets, etc.).
Memory budget comparison
Context-less strategy requires much less memory (and is barely visible)


Eliminating the RAM requirement, which evolves from dominant to negligible, and reducing complexity open in turn new possibilities, such as hosting even more connections per server, or improved performance for other local applications. Or simply reduce the number of servers required for the same task.

This kind of scenarios is where Dictionary Compression can give its best : beyond “better compression for small data”, it makes it possible to build target application differently, with second-order effects being as important if not more than compression savings alone.

These are just 2 examples. And I’m sure there are more. But that’s enough to illustrate the topic, and probably enough words for a simple blog post.

Friday, January 12, 2018

Zstandard Overview

 I recently realised that, while there is a specification for Zstandard, which describes in great details what is encoded where, there is no “overview” of the format, which would be neither too detailed nor too vague for programmers with a casual interest in data compression to understand its inner working. This blog post is an attempt to correct that.

Introduction

Zstandard is an LZ77-class compressor, which primarily achieves compression by referencing from past data some segment of bytes identical to following bytes. Zstandard features a few other additional capabilities, but it doesn’t change the core formula. This construction offers several advantages, primarily speed related, especially on the decoder side, since a memory copy operation is all it takes to regenerate a bunch of bytes. Moreover, simple pointer arithmetic is enough to locate the reference to copy, which is as frugal as it gets both cpu and memory wise.

Blocks

Zstandard format is block-oriented. It can only start decoding data when a first full block arrives (with the minor exception of uncompressed blocks). But it’s nonetheless stream-capable : any large input is automatically cut into smaller blocks, and decoding starts as soon as the first block arrives.
A block can have any size, up to a maximum of 128 KB. There are multiple reasons for such a limit to exist.
It’s not a concern related to initial latency for the first block, since the format allows this block to have any size up to maximum, so it can be made explicitly small whenever necessary.
The maximum block size puts an upper limit to the amount of data a decoder must handle in a single operation. The limit makes it possible to allocate a number of resources which are guaranteed to be enough for whatever data will follow.
There are also other concerns into the mix, such as the relative weight of headers and descriptors, time spent to build tables, local adaptation to source entropy, etc. 128 KB felt like a good middle ground providing a reasonable answer to all these topics.
It follows that a small source can be compressed into a single block, while larger ones will need multiple blocks.
The organisation of all these blocks into a single content is called a frame.

Frame

A frame will add a number of properties shared by all blocks in the current frame.
To begin with, it can restrict the maximum block size even further : largest maximum is 128 KB, but for a given frame, it can be defined to a value as small as 1 KB.
The frame can tell upfront how much data will be regenerated from its content, which can be useful to pre-allocate a destination buffer.
Most importantly, it can tell how much “past data” must be preserved in memory to decode next block. This is the “window size”, which has direct consequences on buffer sizes.
There are other properties stored there, but it’s not in the scope of this article to describe all of them. Should you wish to know more, feel free to consult the specification.
Once these properties are extracted, the decoding process is fairly straightforward : decompress data block after block.

Literals

A compressed block consists of 2 sections : literals and sequences.
Literals are the “left over” from LZ77 mechanism : remember, LZ77 compress next bytes by referencing an identical suite of bytes in the past. Sometimes, there is simply no such thing. Trivially, this is necessarily the case at the beginning of each source.
In such a case, the algorithm outputs some “raw bytes”. These bytes are not compressed by LZ77, but they generally can be compressed using another technique : Huffman compression.
The principle behind Huffman is quite different : it transforms bytes into prefix codes using variable number of bits, and assign a low number of bits to frequent characters, while sacrificing infrequent characters with more bits.
The Huffman algorithm makes it possible to select the most efficient repartition of prefix codes.
When all bytes are equally present, it’s not possible to compress anything. But it’s generally not the case.
Consider a standard text file using ASCII character set, a whole set of byte values will not be present (>128), and some characters (like ‘e’) are expected to be more common than others (like ‘q’).
This is the kind of irregularity that Huffman can exploit to provide some compression for these left-over bytes. Typical gains range between 20% and 40%.
The literal section can be uncompressed (mostly when it’s very small, since describing a Huffman table cost multiple bytes), or compressed as a single stream of bits, or using multiple (4) streams of bits.
The multi streams strategy has been explained in another post, and is primarily designed for improved decoding speed.
All literals are decompressed into their own buffer. The buffer size is primarily limited by the block size, since in worst case circumstances, LZ77 will fail completely, leaving only Huffman to do the job.

Sequences

Obviously, we expect LZ77 to be useful. Its outcome is described in the second section, called “sequences”.
A block is rebuilt by a succession of sequences.
A “sequence” describes a number of bytes to copy from literals buffer, and then a number of byte to copy from past data, with an associated offset to locate its reference.
These values are of different nature, so they are encoded using 3 separated alphabets.
Each alphabet must be described, and there is a small header for each of them at the beginning of the section.
The compression technique used here is Finite State Entropy, a tANS variant, which offers better compression for dominant symbols.
Dominant symbols lose a lot of precision with Huffman, resulting in a loss of compression. They are unlikely to be present in “left over” literals, but for sequence symbols, the situation is less favourable.
FSE solves this issue, by being able to encode symbols using a fractional number of bits.
If you are interested in how FSE works, there is a series of articles which tries to describe it, but be aware that it’s fairly complex.
All sequence symbols are interleaved in a single bitstream, which must be read backward, due to ANS property of inverting directions for encoding vs decoding.
On 64-bits CPU, a single read operation is generally enough to grab all bits necessary to decode the 3 symbols forming the sequence. All it takes now is to apply the sequence : copy some bytes from the literals buffer, then copy some bytes from the past.
Decode next sequence. Rince, repeat. Stop when there is no more sequence left in the bitstream.
At this stage, whatever remains in the literals buffer is simply copied to complete the block.
And the decoder can move on to the next block.

Window

While decoding literals and sequence is a block-oriented job, that could be achieved in parallel within multiple blocks (expect a multi-threaded version in the future), the LZ copy operation is not.
It depends on previous blocks being already decoded, and is therefore serial in nature.
That’s where the frame header comes into play : it specifies how much past data the decoder must keep around to be able to decode next block.
The specification recommends to keep this value <= 8 MB, though it’s only a recommendation. --ultra levels for example go beyond this limit.
In most cases though, the decoder will not need that much. All levels <= 10 for example, which tend to be preferred due to their speed, require a memory budget <= 2 MB.
As could be guessed, using less memory is also good for speed.

Wrap up

That’s basically it. All these operations form the core of Zstandard compression format. There are a few more little details involved, such as the repeat offset symbols, shortened header with repeat tables and so on, which are described in the specification, but this description should be enough to grab the essence of the decoder.
The encoder is a bit more complex, not least because there are, in fact, multiple encoders.
The format doesn’t impose a single way to find or select references into the past. At every position into the file, there are always multiple possibilities to encode what’s next.
That’s why different strategies exist, providing different speed / compression trade off. Lower level are mapped onto LZ4, being very fast. Upper levels can be very complex, on top of very slow, offering improved compression ratio.
But the decoder remains always the same, preserving its speed properties.

Thursday, July 13, 2017

Dealing with library version mismatch

 Note : this article was initially redacted as an answer to David Jud's comment, but it became long enough to be worth converting into a full blog entry.
In previous article, I attempted to introduce a few challenges related to designing an extensible API.

In this one, I'll cover an associated but more specific topic, on how to handle a library version mismatch.

Version mismatch is a more acute problem in a DLL scenario. In a static linking scenario, the programmer has several advantages :
  • Compiler will catch errors (if a type, or a prototype, has changed for example). This gives time to fix these errors. Of course, the application maintainer will prefer that a library update doesn't introduce any change in existing code, but worst case is, most errors should be trapped before shipping the product.
  • Compiler will automatically adapt ABI changes : if an updated type is larger/smaller than previous version, it will be automatically converted throughout the code base. Same thing happens in case of enum changes : adaptation to new enum values is automatically applied by compiler.
  • Library is available during compilation, which means programmer has a local copy that he can update (or not) according to its requirements.

Well, this last property is not always true : in larger organisations, library might belong to a "validated" pool, which cannot be easily adapted for a specific project. In which case, the user program will either have to host its own local copy, or adapt to the one selected by its organisation.

But you get the idea : problematic version mismatches are likely to be trapped or automatically fixed by the compiler, and therefore should be corrected before shipping a binary. Of course, the less changes, the better. Program maintainers will appreciate a library update as transparent as possible.

For a dynamic library though, the topic is a lot harder.
To begin with, user program typically does not have direct control over the library version deployed on target system. So it could be anything. The library could be more recent, or older, than expected during program development.

Now these two types of mismatches are quite different, and trigger different solutions :

Case 1 - library version is higher than expected

This one can, and should, be solved by the library itself.

It's relatively "easy" : never stop supporting older entry points.
This is of course easier said than done, but to be fair, it's first and foremost a question of respecting a policy, and therefore is not out of reach.
Zstandard tries to achieve that by guaranteeing that any prototype reaching "stable" status will be there "forever". For example, ZSTD_getDecompressedSize(), which has been recently superceded by ZSTD_getFrameContentSize(), will nonetheless remain an accessible entry point in future releases, because it's labelled "stable".

A more subtle applicable problem is ABI preservation, in particular structure size and alignment.
Suppose, for example, that version v1.1 defines a structure of size 40 bytes.
But v1.2 add some new capabilities, and now structure has a size of 64 bytes.
All previous fields from v1.1 are still there, at their expected place, but there are now more fields.

The user program, expecting v1.1, would allocate the 40-bytes version, and pass that as an argument to a function expecting a 64-bytes object. You can guess what will follow.

This could be "manually" worked around by inserting a "version" field and dynamically interpreting the object with the appropriate parser. But such manipulation is a recipe for complexity and errors.
That's why structures are pretty dangerous. For best safety, structure definition must remain identical "forever", like the approved "stable" prototypes.

In order to avoid such issue, it's recommended to use incomplete types. This will force the creation of underlying structure through a process entirely controlled by current library, whatever its version, thus avoiding any kind of size/alignment mismatch.

When above conditions are correctly met, the library is "safe" to use by applications expecting an older version : all entry points are still there, behaving as expected.

Whenever this condition cannot be respected anymore, an accepted work-around is to increase the Major digit of the version, indicating a breaking change.


Case 2 - library version is lower than expected

This one is more problematic.
Basically, responsibility is now on the application side. It's up to the application to detect the mismatch and act accordingly.

In David Jud's comment, he describes a pretty simple solution : if the library is not at the expected version, the application just stops there.
Indeed, that's one way to safely handle the matter.

But it's not always desirable. An application can have multiple library dependencies, and not all of them might be critical.
For example, maybe the user program access several libraries offering similar services (encryption for example). If one of them is not at the expected version, and cannot be made to work, it's not always a good reason to terminate the program : maybe there are already plenty of capabilities available without this specific library, and the program can run, just with less options.

Even inside a critical library dependency, some new functionality might be optional, or there might be several ways to get one job done.
Dealing with this case requires writing some "version dependent" code.
This is not an uncommon situation by the way. Gracefully handling potential version mismatches is one thing highly deployed programs tend to do well.

Here is how it can be made to work : presuming the user application wants access to a prototype which is only available in version v1.5+, it first tests the version number. If condition matches, then program can invoke target prototype as expected. If not, a backup scenario is triggered, be it an error, or a different way to get the same job done.

Problem is, this test must be done statically.
For example, in Zstandard, it's possible to ask for library version at runtime, using ZSTD_versionNumber(). But unfortunately, it's already too late.
Any invocation of a new function, such as ZSTD_getFrameContentSize() which only exists since v1.3.0, will trigger an error at link time, even if the invocation itself is protected by a prior check with ZSTD_versionNumber() .

What's required is to selectively remove any reference to such prototype from compilation and linking stages, which means this code cannot exist. It can be excluded through pre-processor.
So the correct method is to use a macro definition, in this case, ZSTD_VERSION_NUMBER

Example :
#if ZSTD_VERSION_NUMBER >= 10300
size = ZSTD_getFrameContentSize(src, srcSize);
#else
size = ZSTD_getDecompressedSize(src, srcSize);
/* here, 0-size answer can be mistaken for "error", so add some code to mitigate the risk */
#endif

That works, but requires to compile binary with the correct version of zstd.h header file.
When the program is compiled on target system, it's a reasonable expectation : if libzstd is present, zstd.h is also supposed to be accessible. And it's reasonable to expect them to be synchronised. There can be some corner case scenarios where this does not work, but let's say that in general, it's acceptable.

The detection can also be done through a ./configure script, in order to avoid an #include error during compilation, should the relevant header.h be not even present on target system, as sometimes the library is effectively optional to the program.

But compilation from server side is a different topic. While this is highly perilous to pre-compile a binary using dynamic libraries and then deploy it, this is nonetheless the method selected by many repositories, such as aptitude, in order to save deployment time. In which case, the binary is compiled for "system-provided libraries", which minimum version is known, and repository can solve dependencies. Hence, by construction, the case "library has a lower version than expected" is not supposed to happen. Case closed.

So, as we can see, the situation is solved either by local compilation and clever usage of preprocessing statements, or by dependency solving through repository rules.

Another possibility exists, and is, basically, the one proposed in ZSTD_CCtx_setParameter() API : the parameter to set is selected through an enum value, and if it doesn't exist, because the local library version is too old to support it, the return value signals an error.

Using safely this API feels a lot like the previous example, except that now, it becomes possible to check library version at runtime :

if (ZSTD_versionNumber() >= 10500) {
   return ZSTD_CCtx_setParameter(cctx, ZSTD_p_someNewParam, value);
} else {
   return ERROR(capability_not_present);
}

This time, there is no need to be in sync with the correct header.h version. As the version number comes directly at runtime from the library itself, it's necessarily correct.

Note however that ZSTD_CCtx_setParameter() only covers the topic of "new parameters". It cannot cover the topic of "new prototypes", which still requires using exclusion through macro detection.

So, which approach is best ?

Well, that's the good question to ask. There's a reason the new advanced API is currently in "experimental" mode : it needs to be field tested, to experience its strengths and weaknesses. There are pros and cons to both methods.
And now, the matter is to select the better one...

Thursday, July 6, 2017

The art of designing advanced API

 A library API (Application Programming Interface) is even more important than its implementation.

There are many reasons for this statement :
- An API exposes a suitable abstraction. Should it prove broken, unclear or just too complex, the library will be misused, which will ultimately be paid by users' time (multiplied by nb of users).
- An API is a contract. Break it, and existing applications can no longer work with your library. Adaptation cost is once again paid by users' time (if it ever happens).
- Because of this, API tend to stick around for a long time, much longer than underlying implementation.

If an implementation is modified to provide, say, a 5% speed improvement, it's all free, every user can immediately benefit about it without further hassle. But if one has to add a single parameter, it's havoc time.

Because it's so stressful to modify an API, one can be tempted to look very hard once, in order to design and expose a perfect API, one that will stand the test of time, and will never need to be changed. This search is (mostly) a delusion.
- perfect API, embedding such a strong restriction to never change in the future, can take forever to build, all under intense stress, as there is always a question mark hanging around : "is there a use case that it does not cover ?". Eventually, it's only a matter of time before you (or your users) find one.
- perfect API lean towards "complex API", as the requirement to support everything makes it add more parameters and control, becoming more and more difficult to manage by users.
- "complex" quickly leads to "misleading", as supporting some "future scenarios" for which there is no current implementation, and maybe no current need, will be categorised bad ideas after all, but side-effects of this useless abstraction will remain in the API.

So, the next great idea is to plan for API changes.
The way Zstandard library tries to achieve this is by quickly converging towards some very simple prototypes, which offer "core" functionalities at a low complexity level.
Then, more complex use cases, not covered by simple API, do show up, and the need to serve them introduce the creation of an "experimental section", a place where it's still possible to play with API, trying to find an optimal abstraction for intended use case, before moving into "stable" status (aka: this method will no longer change).

A consequence of this strategy is the creation of more and more prototypes, dedicated to serving their own use case.
Need to compress with dictionary ? Sure, here comes ZSTD_compress_usingDict() .
Need to process data in a streaming fashion ? Please find  ZSTD_compressStream() .
In a streaming fashion with a dictionary ? ZSTD_compressStream_usingDict() .
Need control over specific parameters ? Go to _advanced() variants.
Preprocess dictionary for faster loading time ? Here are _usingCDict() variants.
Some multithreading maybe ? ZSTDMT_*()
Combine all this ? Of course. Here is a list of a gazillion methods.

As one can see, this doesn't scale too well. It used to be "good enough" for a dozen or so methods, but as combinatorial complexity explodes, it's no longer suitable.

In latest release of Zstandard, we try to get a fresh look to this situation, and provide an API simpler to manage. The result of which is the new advanced API candidate, which actually stands a chance to become stable one day.

It features 2 core components : 
ZSTD_compress_generic() is the new main compression method. It's designed to support all other compression methods. It can do block, streaming, dictionary, multithreading, and any combination of those. We have plan for even more extensions, and they all seem to fit in.

This is possible because now sending parameters is a separate operation, which can be completed in as many steps as necessary.
The main vehicle to set these parameters is ZSTD_CCtx_setParameter() .
It uses an enum based policy, where the parameter is selected in an enum list, and new value is provided as an unsigned type.
This design has been favoured over previous one, which was using a struct to pass all parameters in a single step. The struct was inconvenient as it forced user to select a value for each and every parameter, even out-of-scope ones, in order to change just one of them. Perhaps even more importantly, the struct is problematic for future changes : adding any new parameter would change the struct size, which is an ABI break. It can quickly get ugly when the program and library work on common memory areas using different sizes.
The enum policy allow us to add any new parameter while preserving API and ABI, so it looks very flexible.

However, it comes with its own set of troubles.
To begin with, enum values can very easily change : just add a new enum in the list, and see all enum values after that one slide by one.
It can be a problem if, in a version of the library, ZSTD_p_compressionLevel is attributed a 2, but in a future version, it becomes a 3. In a dynamic library scenario, where version mismatch can easily happen, it means the caller is changing some other random parameter.
To counter that, it will be necessary to pin down all enum values to a manually assigned one. This will guarantee the attribution.

Another issue is that the value of the parameter is provided as an unsigned type, so the parameter must fit this type. That's not always possible.
For example, there is a dedicated method to set pledgedSrcSize, which is essentially a promise about how much data is going to be processed. This amount can be very large, so an unsigned type is not enough. Instead, we need an unsigned long long, hence a dedicated method.
Another even more obvious one happens when referencing a prepared dictionary in read-only mode : this parameter is a const ZSTD_CDict* type, so it is  set through a dedicated method, ZSTD_CCtx_refCDict().
And we have a few other exceptions using their own method, as the argument cannot fit into an unsigned.

But the large majority of them uses ZSTD_CCtx_setParameter().
In some cases, the adaptation works though it's not "best".
For example, a few parameters are selected among a list of enums, for example ZSTD_strategy . The enum is simply casted to an unsigned and passed as argument. It works. But it would have been even nicer to keep the argument type as the intended enum, giving the compiler a chance to catch potential type mismatch (example).

So this design could be in competition with another one : define one method per parameter. The most important benefit would be that each parameter can have its own type.
But this alternate design has also its own flaws :
adding any new parameter means adding a method. Therefore, if a program uses a "recent" method, but links against an older library version, this is a link error.
In contrast, the enum policy would just generate an error in the return code, which can be identified and gracefully dealt with.


Creating future-proof API is hard. There is always a new unexpected use case which shows up and would require another entry point or another parameter. The best we can do is plan for those changes.
The new Zstandard's advanced API tries to do that. But since it is a first attempt, it likely is perfectible. 
This is design time, and it will cost a few revisions before reaching "stable" status. As always, user feedback will be key to decide if the new design fits the bill, and how to amend it.

Follow up : Dealing with library version mismatch

Edit :
Arseny Kapoulkine made an interesting comment, arguing that specialized entry points make it possible for compiler's DCE (Dead Code Elimination) optimisation to kick in, removing useless code from the final binary :
https://twitter.com/zeuxcg/status/882816066296172550

In general this is true. Calling
specialized_function1(...)
is clear for the linker,
then it's possible to remove any potentially unused specialized_function2() from binary generation.

In contrast calling
generic_function(mode=1, ...)
with
void generic_function(int mode, ...)
{
   switch(mode) {
      case 1 : return specialized_function1(...);
      case 2 : return specialized_function2(...);
}

makes it much more difficult. In general, for the second case, specialized_function2() will remain in the binary.
(an exception being usage of inline, associated with -flto, and non-ambiguous selection parameter, but that's no longer a "simple" setup).

For Zstandard though, it doesn't make a lot difference.
The reason is, whatever "specialized" entry point is invoked, (ZSTD_compress(), or ZSTD_compress_usingDict() for example), it's just an entry point. The compression code is not "immediately behind", it's reached after several indirection levels. This design make it possible for a single compression code to address multiple usage scenarios with vastly different set of parameters, which is vital for maintenance. But disentagling that for DCE is a lot more difficult.
If required, -flto makes it possible to optimize size better, and some difference become visible, but remain relatively small.