RealTime Data Compression: The art of designing advanced API

Thursday, July 6, 2017

The art of designing advanced API

A library API (Application Programming Interface) is even more important than its implementation.

There are many reasons for this statement :
- An API exposes a suitable abstraction. Should it prove broken, unclear or just too complex, the library will be misused, which will ultimately be paid by users' time (multiplied by nb of users).
- An API is a contract. Break it, and existing applications can no longer work with your library. Adaptation cost is once again paid by users' time (if it ever happens).
- Because of this, API tend to stick around for a long time, much longer than underlying implementation.

If an implementation is modified to provide, say, a 5% speed improvement, it's all free, every user can immediately benefit about it without further hassle. But if one has to add a single parameter, it's havoc time.

Because it's so stressful to modify an API, one can be tempted to look very hard once, in order to design and expose a perfect API, one that will stand the test of time, and will never need to be changed. This search is (mostly) a delusion.
- perfect API, embedding such a strong restriction to never change in the future, can take forever to build, all under intense stress, as there is always a question mark hanging around : "is there a use case that it does not cover ?". Eventually, it's only a matter of time before you (or your users) find one.
- perfect API lean towards "complex API", as the requirement to support everything makes it add more parameters and control, becoming more and more difficult to manage by users.
- "complex" quickly leads to "misleading", as supporting some "future scenarios" for which there is no current implementation, and maybe no current need, will be categorised bad ideas after all, but side-effects of this useless abstraction will remain in the API.

So, the next great idea is to plan for API changes.
The way Zstandard library tries to achieve this is by quickly converging towards some very simple prototypes, which offer "core" functionalities at a low complexity level.
Then, more complex use cases, not covered by simple API, do show up, and the need to serve them introduce the creation of an "experimental section", a place where it's still possible to play with API, trying to find an optimal abstraction for intended use case, before moving into "stable" status (aka: this method will no longer change).

A consequence of this strategy is the creation of more and more prototypes, dedicated to serving their own use case.
Need to compress with dictionary ? Sure, here comes ZSTD_compress_usingDict() .
Need to process data in a streaming fashion ? Please find ZSTD_compressStream() .
In a streaming fashion with a dictionary ? ZSTD_compressStream_usingDict() .
Need control over specific parameters ? Go to _advanced() variants.
Preprocess dictionary for faster loading time ? Here are _usingCDict() variants.
Some multithreading maybe ? ZSTDMT_*()

Combine all this ? Of course. Here is a list of a gazillion methods.

As one can see, this doesn't scale too well. It used to be "good enough" for a dozen or so methods, but as combinatorial complexity explodes, it's no longer suitable.

In latest release of Zstandard, we try to get a fresh look to this situation, and provide an API simpler to manage. The result of which is the new advanced API candidate, which actually stands a chance to become stable one day.

It features 2 core components :

ZSTD_compress_generic() is the new main compression method. It's designed to support all other compression methods. It can do block, streaming, dictionary, multithreading, and any combination of those. We have plan for even more extensions, and they all seem to fit in.

This is possible because now sending parameters is a separate operation, which can be completed in as many steps as necessary.

The main vehicle to set these parameters is ZSTD_CCtx_setParameter() .

It uses an enum based policy, where the parameter is selected in an enum list, and new value is provided as an unsigned type.

This design has been favoured over previous one, which was using a struct to pass all parameters in a single step. The struct was inconvenient as it forced user to select a value for each and every parameter, even out-of-scope ones, in order to change just one of them. Perhaps even more importantly, the struct is problematic for future changes : adding any new parameter would change the struct size, which is an ABI break. It can quickly get ugly when the program and library work on common memory areas using different sizes.

The enum policy allow us to add any new parameter while preserving API and ABI, so it looks very flexible.

However, it comes with its own set of troubles.

To begin with, enum values can very easily change : just add a new enum in the list, and see all enum values after that one slide by one.
It can be a problem if, in a version of the library, ZSTD_p_compressionLevel is attributed a 2, but in a future version, it becomes a 3. In a dynamic library scenario, where version mismatch can easily happen, it means the caller is changing some other random parameter.

To counter that, it will be necessary to pin down all enum values to a manually assigned one. This will guarantee the attribution.

Another issue is that the value of the parameter is provided as an unsigned type, so the parameter must fit this type. That's not always possible.

For example, there is a dedicated method to set pledgedSrcSize, which is essentially a promise about how much data is going to be processed. This amount can be very large, so an unsigned type is not enough. Instead, we need an unsigned long long, hence a dedicated method.

Another even more obvious one happens when referencing a prepared dictionary in read-only mode : this parameter is a const ZSTD_CDict* type, so it is set through a dedicated method, ZSTD_CCtx_refCDict().

And we have a few other exceptions using their own method, as the argument cannot fit into an unsigned.

But the large majority of them uses ZSTD_CCtx_setParameter().

In some cases, the adaptation works though it's not "best".

For example, a few parameters are selected among a list of enums, for example ZSTD_strategy . The enum is simply casted to an unsigned and passed as argument. It works. But it would have been even nicer to keep the argument type as the intended enum, giving the compiler a chance to catch potential type mismatch (example).

So this design could be in competition with another one : define one method per parameter. The most important benefit would be that each parameter can have its own type.

But this alternate design has also its own flaws :

adding any new parameter means adding a method. Therefore, if a program uses a "recent" method, but links against an older library version, this is a link error.
In contrast, the enum policy would just generate an error in the return code, which can be identified and gracefully dealt with.

Creating future-proof API is hard. There is always a new unexpected use case which shows up and would require another entry point or another parameter. The best we can do is plan for those changes.

The new Zstandard's advanced API tries to do that. But since it is a first attempt, it likely is perfectible.

This is design time, and it will cost a few revisions before reaching "stable" status. As always, user feedback will be key to decide if the new design fits the bill, and how to amend it.

Follow up : Dealing with library version mismatch

Edit :
Arseny Kapoulkine made an interesting comment, arguing that specialized entry points make it possible for compiler's DCE (Dead Code Elimination) optimisation to kick in, removing useless code from the final binary :
https://twitter.com/zeuxcg/status/882816066296172550

In general this is true. Calling
specialized_function1(...)
is clear for the linker,
then it's possible to remove any potentially unused specialized_function2() from binary generation.

In contrast calling
generic_function(mode=1, ...)
with
void generic_function(int mode, ...)
{
switch(mode) {
case 1 : return specialized_function1(...);
case 2 : return specialized_function2(...);
}

makes it much more difficult. In general, for the second case, specialized_function2() will remain in the binary.
(an exception being usage of inline, associated with -flto, and non-ambiguous selection parameter, but that's no longer a "simple" setup).

For Zstandard though, it doesn't make a lot difference.
The reason is, whatever "specialized" entry point is invoked, (ZSTD_compress(), or ZSTD_compress_usingDict() for example), it's just an entry point. The compression code is not "immediately behind", it's reached after several indirection levels. This design make it possible for a single compression code to address multiple usage scenarios with vastly different set of parameters, which is vital for maintenance. But disentagling that for DCE is a lot more difficult.
If required, -flto makes it possible to optimize size better, and some difference become visible, but remain relatively small.

8 comments:

ZedJuly 8, 2017 at 11:25 PM
So, the topic is "extensible APIs" and i think it should be a part of "System Design" CS course. Being self-taught programmer, I never took such course - may be it's a good time to finally pass some MOOC on it. I also seen "Microservices" book on Amazon - probably it's just about it.

Did you look into other compression libraries, in particular LZMA? And generic compression APIs, such as Squash, CLS/CELS and 7-zip API - they also deal with extensible APIs, so you can learn their tricks:
https://encode.ru/threads/2718-Standard-compression-library-API?p=51904&viewfull=1#post51904
https://github.com/Bulat-Ziganshin/FA/tree/master/CELS

A few months ago, I published design of CELS - my third API solving this exact problem. Now it's time to describe WHY i made what i made. So, let's start from beginning.

---

When you develop usual library API, the main goal is to make it convenient to use. When you add new features, you just add more parameters to functions. Refactoring all existing calls to the new API is just a matter of few clicks in modern IDE.

OTOH, when you make extensible API, your main concern is compatibility - new clients with old servers, old clients with new servers, and generally - client having X,Y,Z extensions with server having X,Z,T extensions.

The classic way to do it is to use optional named parameters, available in some languages, i.e. function compress(inbuf=...) can be extended to support compress(read_func=...) where the compress() function checks which named parameters it got and processes appropriately. Later, support for compress(infile=...) may be added. This allows to combine any server with any client as far as client is able to fallback to older APIs.

---

The only problem is that DLLs provide only C ABI which doesn't support named arguments. Again, we have developed generic solutions to this problem - packaging data with XML, JSON and later - libraries such as Thrift and Protobuf with extensions similar to calling functions with named arguments. But these libraries are pretty large, so I developed my own micro-library TABI, that allows to call functions like f(TABI("inbuf",buf)("insize",size)) and check in f which parameters/types we received. The second version of my CLS design employed TABI to make all APIs extensible.

But in CELS i abandoned TABI by reevaluating the main question. We are looking for C API that is both:
1) convenient for application usage
2) extensible

Indeed, TABI is the best way to fit these goals SIMULTANEOUSLY. But do we really need that? Users rarely program in plain C, so the main goal of C API is to lay out foundation for other bindings - C++, Python... But we can do the same with C API - let's DLL export ugly but extensible API, and then library build convenient API on top of it! When we need to upgrade fancy API, we just declare next, incompatible version and provide it to new clients (via #if, namespaces or multiple header files) as well as continue to provide old convenient API to old clients. But both fancy APIs will work on top of the same low-level DLL API, just requesting different set of features from the library.

You just need to develop once the flexible API that can be used to provide any functionality in modular way (i.e. each server can support its own set of extensions and client can use any set of extensions as far as it can fallback to use less extensions). And then provide any number of fancy APIs you wish. Or allow 3rd-parties to provide better APIs than your one (f.e. they may provide named parameter API using TABI/Thrift).
ReplyDelete
Replies
ZedJuly 8, 2017 at 11:26 PM
Hope now it makes sense. Don't try to make your API convenient, only extensible. Now, when we halved our designed requirements, fitting them would be much easier.

Let's start with the following design:

ctx = createOperationContext(int OPERATION_CLASS)
ctx->set(int PARAMETER_ID, void* parameter_value)
ctx->doit(int OPERATION_ID)
ctx->destroy()

It's
1) universal - interpretation of the parameter_value depends on the PARAMETER_ID, and it can point to structure containing arbitrary C types
2) modular - each PARAMETER_ID is set independent of other ones, so we can combine servers with X,Y,Z extensions and clients with X,Z,T ones.

We also need to establish some rules for mandatory/optional parameters and so on, but basically we just done that - developed extensible API that can survive next 30 years. Yes, it's ugly for direct usage by application C developers, but its simplicity and uniformity can make bindings to other languages even simpler than the existing zstd API. F.e. Python universal compress() operation with keyword arguments just needs a few codepaths to pass various parameter types plus parameter name->id translation dictionary.

---

Now when we developed some extensible API, we can think how to make it better. First, we will greatly simplify usage by adding 3 specialized set* functions that cover all C types:

setInt (long long) serving almost all integer types
setReal(double)
setFunction( (void*)() ) serving any function type

This means that from now on we will need to pass structures only for parameters requiring multiple data items. Moreover, i will prefer to use multiple separate parameters even in such case, i.e. prefer to write

ctx->set(DICT_BUF,buf)
ctx->set(DICT_SIZE,size)

rather than

dict d{buf,size)
ctx->set(DICT,&d)

Last code should be a bit faster, but this API is harder to use in dynamic languages and it will make entire API less uniform, i.e. with the first API Python binding just need to implement 4 standard set* functions and then set_dict can be implemented in high-level fashion without going into low-level details.

---

The last part of CELS is controversial - I don't advocate for going the same way in zstd, just will describe what and why was done in CELS.

FreeArc internal API (one that is nice, but not DLL-extensible) provides 20-30 standard operations for each compression method, so I just gone through them and made conclusion that there is "wide template" that covers them all. This template is a call to library with the following parameters:

void* compressor_instance
int operation
4 int64 parameters
4 void* parameters
callback

Overall, for 99% of possible operations 8 int/ptr parameters should be enough to pass all data. A few remaining cases can be fulfilled by requesting extra parameters via callback. The callback also allows to request back any services from client - reading input data, displaying progress indicator and so on.

Note however that I was operated with compressor_instance concept, so I added the callback to request any extra parameters that operation may require. OTOH, you are operating with operation_context concept, so you can just set up any extra parameters prior to calling operation - the idea that I haven't found.

In the current CELS version, each DLL exports only one "wide" function that implements entire functionality provided by the library - create contexts, set up parameters and perform any operations. This reduces low-level/unportable code, simplifies metaprogramming, and simplifies dealing with multiple compression methods. These points are less important for you, though.
ReplyDelete
Replies
UnknownJuly 10, 2017 at 1:29 PM
Thinking about it for a few minutes, my pretty naive conclusion is that as a developer I prefer the 'one method per parameter' approach. It is type safe (allowing the compiler to detect when I screw up) and more discoverable in an IDE with autocomplete.

I will argue that the fact that an old version will cause a linker error is a feature, not a bug: The dependencies of the program have not been met correctly, so the program will not work. I guess that many API users will not handle the error code from the generic API correctly, causing silent failures. And even if they do, in many cases the program will not be able to do much other than saying 'sorry this program cannot work', so I would argue to just let the linker handle that as it is known to be reliable.

Or if you really want to go the extra mile, provide both, one being a thin wrapper for the other one? Not sure if that will always hold up, even when the compiler starts to optimize...
ReplyDelete
Replies
JohnJuly 18, 2017 at 9:48 AM
Given you're already using lots of abstraction I'd like to suggest just one more: expand on your generic_function and use inline to lock down the main points in the client binary at compile time:

void ZSTD_msgSend(const char *functionName, unsigned libraryVersion, void *returnValue, ...);

ZSTDLIB_API size_t ZSTD_CCtx_setCompressionLevel(ZSTD_CCtx* cctx, unsigned value);

inline size_t ZSTD_CCtx_setCompressionLevel(ZSTD_CCtx* cctx, unsigned value)
{
size_t returnValue;
ZSTD_msgSend(__FUNCTION__, ZSTD_VERSION_NUMBER, &returnValue, cctx, value);
return returnValue;
}

This has the following features:
- maintains type safety for the user while giving you all the information you need to handle any call that comes in
- allows you to generate all the interfaces in your favourite scripting language at library release time to maintain type safety on your end
- allows easy bridging for any other languages
- the API only actually export one function, so there can't be a mismatch at that level
ReplyDelete
Replies

Add comment