Saturday, January 19, 2019

The type system

 btrc: compile2 : the type system

The type system is a simple yet powerful way to ensure that the code knows enough about the data it is manipulating. It is used to declare conditions at interfaces, which are then enforced at each invocation.
The compiler is very good to check types. The condition is trivial to enforce, and doesn’t cost much compilation resource, so it’s a powerful combination.

typedef and weak types

C is sometimes labelled a “weakly typed” language, presumably because it is associated to the behavior of one of its keywords typedef.
The keyword itself implies that typedef DEFines a new TYPE, but it’s unfortunately a misnomer.

As an example, typedef can be used this way :

typedef int meters;
typedef int kilograms;

This defines 2 new “types”, meters and kilograms, which can be used to declare variables.

meters m;
kilograms k;

One could logically expect that, from now on, it’s no longer allowed to mix meters and kilograms, since they represent different types, hence should not be compatible.

Unfortunately, that’s not the case : meters and kilograms are still considered as int from the C type system perspective, and mixing them works without a single warning, even when every possible compiler warning is enabled.

As such, typedef must be considered a mere tagging system. It’s still useful from a code reviewer perspective, since it has documenting value, and that may help notice discrepancies, such as this example. But the compiler won’t be able to provide any signal.

Strong types in C

To ensure that two types cannot be accidentally mixed, it’s necessary to strongly separate them. And that’s actually possible.
C has one thing called a struct (and its remote relative called union).
Two struct defined independently are considered completely foreign, even if they contain exactly the same members.
They can’t be mixed unintentionally.

This gives us a basic tool to strongly segregate types.

Operations

Using struct comes with severe limitations. To begin with, the set of default operations is much more restricted. It’s possible to allocate struct on stack, and make it part of a larger struct, it’s possible to assign with = or memcpy(), but that’s pretty much it. No simple operation like + - * /, no comparison like < <= => >, etc.

Users may also access members directly, and manipulate them. But it breaks the abstraction.
When structures are used as a kind of “bag of variables”, to simplify transport, and enforce naming for clarity, it’s fine to let users access members directly. Compared to a function with a ton of parameters, an equivalent function with a structure as input will help readability tremendously, just because it enforces naming parameters.
But in the present case, when structures are used to enforce abstractions, users should be clearly discouraged from accessing members directly. Which means, all operations must be achieved at struct level directly.

To comply with these limitations, it’s now necessary to create all allowed operations one by one, giving a uniquely named symbol to each one. So if meters and kilograms can be added, both operations need their own function signature, such as add_meters() and add_kilograms(). This feels like a hindrance, and indeed, if there are many types to populate, it can require a lot of glue code.

But on the plus side, only what’s allowed is now possible. For example, multiplying meters with meters shouldn’t produce some meters, but rather a square_meters surface, which is a different concept. Allowing additions, but not multiplications, is an impossible subtlety for basic typedef.

Composition

There is no “intermediate” situation, where a type would be “compatible” with another type, yet different. In the mechanisms explained so far, types are either compatible and identical, using typedef, or completely incompatible, using a new definition and a new name.

In contrast, in Object Oriented languages, a cat can also be an animal, thanks to inheritance, so it’s possible to use cat to invoke animal methods, or use functions with animal parameter(s).

struct strongly leans towards composition. A struct cat can include a struct animal, which makes it possible to invoke animal related functions, though it’s not transparent : it’s necessary to explicitly spell the substructure (cat.animal) as a parameter or return value of the animal related function.

Note that even Object Oriented languages generally approve the composition over inheritance guiding principle. The guiding principle states that, if there isn’t a very good reason to employ inheritance, composition must always be preferred, because it generally fares better as the code evolves and becomes more complex (multiple inheritances quickly translate into a nightmare).

struct can be made more complex, with tables of virtual function pointers, achieving something similar to inheritance and polymorphism. But this is a whole different level of complexity. I will rather avoid this route for the time being. The current goal is merely to separate types in a way which can be checked by the compiler. Enforcing a unified interface on top of different types is a more complex topic, better left for a future article.

Opaque types

struct are fine as strong types, but publishing their definition implies that their members are public, meaning any user can access and modify them.
When it’s the goal, it’s totally fine.

But sometimes, one could wish that, in order to protect users from unintentional mis-usage, it would be better to make structure members unreachable. This is called an opaque type. An additional benefit is that whichever is inaccessible cannot be relied upon, hence may be changed in the future without breaking user code.

Object oriented language have the private tag, which allows exactly that : some members might be published, but they are nonetheless unreachable from the user (well, in theory…).

A “poor man” equivalent solution in C is to comment the code, clearly indicating which members are public, and which ones are private. No guarantee can be enforced by the compiler, but it’s still a good indication for users.
Another step is to give private members terrible names, such as never_ever_access_me, which provides a pretty serious hint, and is less easy to forget than a code comment.

Yet, sometimes, one wishes to rely on stronger compiler-backed guarantee, to ensure that no user will access private structure members. C doesn’t have private, but can do something equivalent.
It relies on the principles of incomplete type.

My own preference is to declare an incomplete type by pairing it with typedef :

typedef struct house_s house;
typedef struct car_s car;

Notice that we have not published anything about the internals of struct house_s. This is intentional. Since nothing is published, nothing can be accessed, hence nothing can be misused.

Fine, but what can we do about such a thing ? To begin with, we can’t even allocate it, since its size is not known.
That’s right, the only thing that can be declared at this stage is a pointer to the incomplete type, like this :

house* my_house;
car* my_car;

And now ?
Well, only functions with house* or car* as parameter or return type can actually do something with it.
These functions must access struct house_s and struct car_s internal definitions. These definitions are therefore published in a relevant unit *.c file, rather than the header *.h. Being not part of the public interface, the structure’s internal remains effectively private.

The first functions required are allocators and destructors.
For example, I’m used to the following name convention :

thing* PREFIX_createThing();
void PREFIX_freeThing(thing* t);

Now, it’s possible to allocate space for thing*, and eventually do something with it (with additional functions).
A good convention is that functions which accept thing* as mutable argument should have thing* as first parameter, like in this example :

int PREFIX_pushElement(thing* t, element e);
element PREFIX_pullElement(thing* t);

Notice that we are getting pretty close to object oriented programming with this construction. Functions and data members, while not declared in an encompassing “object”, must nonetheless be defined together: the need to know the structure content to do anything about it forces function definitions to be grouped into the unit that declares the structure content. It’s fairly close.

Compared with a direct struct, a few differences stand out :

  • Members are private
  • Allocation is implemented by a function, it can only be invoked
    • no way to allocate on stack
    • no way to include a thing into another struct
      • but it’s possible to include a pointer thing*
    • Initialization can be enforced directly in the constructor
      • removes risks of garbage content due to lack of initialization.
  • The caller is in charge of invoking the destructor.
    • The pattern is exactly identical to malloc() / free() (see future article on Resource Control)

The responsibility to invoke the destructor after usage is very important.
It’s no different than invoking free() after a malloc(),
but that’s still an additional detail to take care of, with the corresponding risk to forget or mismanage it.

To bypass this responsibility, and take control of the allocation process, it can be preferable to consider opaque types with static allocation. That’s the topic of the next article.

Summary

This closes this first chapter on the type system. We have seen that it’s possible to create strong types, and we can use this property to ensure users can’t mix up different types accidentally. We have seen that it’s possible to create opaque types, and ensure users can only invoke allowed operations, or can’t rely on secret internal details, clearing the path of future evolution. These properties are compiler-checked, so they are always automatically enforced.

That’s not bad. Just using these properties will seriously improve code resistance to potential mis-usages.

Yet, there is more the compiler can do to detect potential bugs in our code. To be continued…

Writing safer C code

Writing safer C code may feel like an overwhelming goal. After all, we are told that C gives programmers plenty of opportunities to shoot their own foot.

But that’s doesn’t mean there is no possible improvement. Actually, in the last decade, programming practices have already evolved dramatically, and for the better, as a consequence of multiple forces, such as improved tooling, shared programming and rising cost of failures, as the numerous Internet exploits tend to remind us all too often.

I expected to start this series with an introduction on C, its strengths, and guiding principles on safer coding practices. But it doesn’t fit the blog post format, being too long, boring, and at times potential troll magnet. Suffice to say that “safer” implies writing Reviewer-Oriented source code, aka highly readable, and as much error automation as possible, favoring fast methods (immediate feedback while editing code) over longer ones (long offsite test sessions in dedicated environments).

One thing I can’t escape though is to mention a few words on the intended audience. These articles are not meant to learn new things for “experts”, which know a lot more than I do. Neither are they intended to guide the first steps towards C programming. The intended audience has good enough C programming skills, and can actually ship products. Shipping real products is important, because the whole concept of “safer programming” is better understood under the pressure and experience of a product’s maintenance.

The main driver is to make it more difficult to ship bugs, as the code base lives and evolves, and new team members get onboard, adding much-needed automated controls at every opportunity. Issues are centered around modifying / fixing an existing code base, and managing the cascading impacts on the rest of the project. This requires to prepare the code for this challenge, hence the design patterns proposed are also useful for new codes with an expected “long” life expectancy (beyond a few months).

Now let’s shorten this introduction and go directly into the meat of the topic.
I’ll start this series with design patterns that leverage compiler checks, to help make C code more resistant to mis-usages and future refactoring.



As a quick background, the compiler is a fairly central part of the development process for compiled language. Compiling a source code incurs a delay, more or less noticeable. That’s a cost.
Interpreted languages (most scripts, python, ruby, basic, bash, etc.) can evade it, making the initial code writing experience more agreeable, with quick modification / experience feedback loop.
The real cost though comes later, and it is steep : compiled languages have this constraint that the compiler must understand and therefore sanitize the code in order to produce the executable binary. This constraint becomes a huge advantage as it catches many categories of errors before they get a chance to run. This typically includes many flavors of mis-typings. Interpreted languages, in contrast, will have to find a majority of problems at run time (note: a good editor’s parser can definitely help both language types there).

And compiler can go much farther. One of the big lessons from modern languages favoring safety like rust is that using the compiler as a primary tool to guide design patterns towards safer practices improves code quality substantially. It’s a good choice : the compiler is a compulsory part of the development chain, it sits close to the programmer, its diagnosis is part of the valuable “short” feedback loop (in contrast with complementary techniques such as code analyzers, test suites and sanitizers). Whatever the compiler can flag gets solved more quickly, reducing load and risks at later stages of the development.

Hence, this is the first topic to explore : let’s make the compiler work for us, check the validity of our code to the best of its abilities. To reach that goal, we will have to purposely leverage its capabilities, in effect help the compiler help us.

And let’s start with its first weapon, the type system.