Featured

Tuesday, May 20, 2014

Streaming API for LZ4

 For quite some time, the LZ4 Streaming API project has been started and delayed, as other priorities stepped in the way. To be fair, one important delaying factor was the difficulty to define a "clean enough" API, something that would be simple to use and understand, while also providing the level of fine-tuning required by advanced programmers (typically within embedded environments).

I feel quite close today from breaking this shell, with an interface definition which matches my definition of "clean enough" API. So let's share some preliminary results.

Current streaming interface

The current streaming API exposes the following functions :

void* LZ4_create (const char* inputBuffer);
int   LZ4_compress_continue (void* LZ4_Data, const char* source, char* dest, int inputSize);
char* LZ4_slideInputBuffer (void* LZ4_Data);
int   LZ4_free (void* LZ4_Data);

Although it works as intended, this API introduces some critical limitations.

First, there is a need to create a data structure which tracks the behavior of the stream.
This is void* LZ4_data, which will be generated by LZ4_create().

It looks fine, but a first issue is that allocation happens inside LZ4_create(). In some circumstances, this may not be acceptable : sometimes allocation must be fully controlled by the host process. For example, standard functions such as malloc() may be unavailable. For this use case have been created 2 more functions :

int LZ4_sizeofStreamState(void);
int LZ4_resetStreamState(void* state, const char* inputBuffer);

It looks fine too, but is unfortunately still not usable for static allocation purpose (on stack, for example).

The second issue is the const char* inputBuffer argument, specified at creation stage, because it will be fixed during the entire compression process. It's not possible to jump between memory segments. This is a major limitation, preventing "in-place" compression of scattered memory segments.

LZ4_compress_continue() looks fine as a function definition, but what is less so is the documented requirement : up to 64KB of previously processed data must stand *before* the data to be compressed. This is a major constraint, typically requiring to prepare data into an intermediate buffer (which in some circumstances, might prove an unaffordable burden).

Then there is LZ4_slideInputBuffer(), which is arguably not at its right responsibility level. Its mission is to copy the last previous 64KB of source data to the beginning of the input buffer. It exists because the content of void* LZ4_data is not accessible.

No function is defined to simply load a dictionary : to achieve that goal, one has to compress a segment using LZ4_compress_continue(), throw the result, and compress the next segment using data accumulated so far.

To summarize, yes it can work, but it's cumbersome. One has to accept the idea that data to compress must be prepared into an intermediate buffer where above conditions can be met.
It's not convenient, and may sometimes not be possible, for example due to allocation limitations.
It also has a negative impact on performance, due to memory copy operations overhead.

New streaming interface definition

While defining the streaming API, I constantly struggled to find the right balance between a "fully automated" API, and a "user-controlled" one. The "fully automated API" would take in charge all kind of workload, generate data respecting the LZ4 Streaming Format with full headers and markers, and take care of allocation for required buffers. On the other hand, the "user-controlled" API would serve the needs for host programs which require full control over resource allocation.

I therefore settled with providing both.
The "user-controlled" version will be part of LZ4 library. It's the simpler one, and only takes care of generating compressed block format, chained or independent. It depends on the host process to pay attention to all memory operations (allocation, position and availability) and to provide its own custom format linking the successive blocks.

The "fully automated" API will be part of a new library, which will be called LZ4S for "LZ4 Streaming". It is basically a user program of the previous API.

In this blog post, we will therefore concentrate on the first API, since it is also an underlying foundation for the 2nd one.

typedef struct { unsigned int table[LZ4_STREAMSIZE_U32]; } LZ4_stream_t;

int LZ4_compress_continue (void* LZ4_stream, const char* source, char* dest, int inputSize);

OK, let's focus on the most important differences with current API.

An LZ4_stream_t structure is exposed. It is provided to give a better control over memory management to the host program. For example, allocation can be static (stack, global) or dynamic, and use any user-defined allocation function.

Obviously, the host program is both in charge of allocating and freeing this memory space. It may also duplicate it "at will", which is a good idea for "static dictionary compression" scenarios.

Cleaning (or resetting) LZ4_stream_t is only a matter of memset() it to zero. It's a requirement to do it once before using this structure.

LZ4_stream_t 's size is controlled by the value LZ4_STREAMSIZE_U32, which is checked at compile time thanks to a static assert piece of code, as already used within xxHash. It mostly depends on LZ4's hash table (which typical size is about 16KB). Hash table size is a parameter controlled by the defined value LZ4_MEMORY_USAGE. Up to now, this define was present into lz4.c. To be coherent with the new interface, it will be moved to lz4.h instead. I don't foresee any issue with that.

LZ4_stream_t is described as a table of unsigned int. This is intentional. The real usage of this memory bank remains hidden inside lz4.c. This way, internal variables within the structure cannot be used as a kind of implicit interface contract, allowing future modifications to happen without breaking compatibility with existing programs.

Once the streaming structure is created, you can start to populate it by loading a static dictionary using :

int LZ4_loadDict (void* LZ4_stream, const char* source, int size);

This part is optional. Loading a dictionary flushes any prior data reference from LZ4_stream_t , if there is any, so you may also use this function to initialize an LZ4_stream_t structure by simply setting a size of 0.

LZ4_compress_continue() looks the same as previously, and is indeed compatible,  but a major difference is that it no longer requires to compress next data block "right after" previous data. Previous and Next data can be anywhere in memory. The LZ4_stream_t structure will keep track of it automatically.

One strong assumption of LZ4_compress_continue() is that previously compressed data is still available. Unfortunately, this might not be the case, and this function cannot guess that.
Should previously compressed data be no longer accessible, for example because you need the space for something else, or you can't guarantee its availability, it's possible to solve that situation :
  • You can "save" the relevant data segment from previous block (typically, the last 64KB, it can also be less than that) into a "safe" place. At which point, you will need to indicate this change of position.
int LZ4_moveDict (void* LZ4_stream, char* safeBuffer, int dictSize);

Memory space available at char* safeBuffer must be at least dictSize,  since it is the amount of data  LZ4_moveDict() will copy there (Note : the maximum amount of data that will be copied is 64KB, even if dictSize is larger).


Decompression

The current streaming decompression API is defined as follows :

int LZ4_decompress_safe_withPrefix64k (const char* source, char* dest, int compressedSize, int maxOutputSize);
int LZ4_decompress_fast_withPrefix64k (const char* source, char* dest, int originalSize);

Its documented requirement is that previously decompressed data must stand *right before* the memory address where the next data block is going to be decompressed. It works, but implies the creation of a temporary buffer to match the requirement.

The new streaming API get rid of this limitation, allowing previous data to stand anywhere within accessible memory :

int LZ4_decompress_safe_usingDict (const char* source, char* dest, int compressedSize, int maxOutputSize, const char* dictStart, int dictSize);
int LZ4_decompress_fast_usingDict (const char* source, char* dest, int originalSize, const char* dictStart, int dictSize);

The dictionary definition is part of the argument list (const char* dictStart, int dictSize), therefore there is no need for a tracking structure, in contrast with compression.



A first code implementing this API is currently available in the "dev" branch of LZ4 on github. It is still early stage, and probably the right time to be provide your comments should you have any.

[Edit] Added : LZ4_moveDict() as a potential replacement of LZ4_SetDictPos()
[Edit] Added : Paragraph on decompression
[Edit] Modified : move to LZ4_moveDict()
[Edit] Interface definition converges towards LZ4_compress_continue()
[Edit] "streaming" branch now merged into "dev" branch

7 comments:

  1. The usage of LZ4_setDictPos(...) is not clear for me from the comments. Do I assume correctly that you should move the last size bytes to the buffer? E.g. if I compressed an 8 MB block in my previous call, I copy the last 64kiB to a safe buffer and then call LZ4_setDictPos(pDict, pBuffer, 65536), right? What happens if I use a smaller size (e.g. 32kiB), will the call 'truncate' the dictionary structure to never use an offset larger than that? Also I chose 64 kiB here because the LZ4 format description says that the max possible offset is 65535 - I would prefer to have an API function to get this magic number though.

    Also as a C++ guy I wonder why you are consistently using int in places where I would use size_t, but since you are playing C in a different league than me I guess there are good reasons for that :).

    ReplyDelete
    Replies
    1. Yes, your understanding is correct.

      Actually, the choice is between a function which just tells where the last bytes previously compressed are (LZ4_setDictPos()), and another function which would copy the relevant bytes to a safe buffer (LZ4_moveDict()). In a nutshell, is the copy operation handled within the function, or kept as a host-controlled operation.

      LZ4_moveDict(const char* safePosition, int dictSize);
      // if dictSize>64KB, copy operation will nonetheless be limited to 64KB

      There is no problem in only keeping "just" 32KB, or just 4KB, or any other size. The API will make sure to never reference data before that limit.

      Delete
    2. int is preferred to size_t because it's more standard :
      size_t is only guaranteed to be present on C99+ compilers, and requires one more #include.

      This is a minor detail.

      Delete
    3. size_t has been in since C89. Use of int for memory buffer sizes should be considered terrible practice because of the number of bugs which have resulted from truncations and negative values. At the very least "unsigned" should be used, but really there is no reason not to use size_t.

      Delete
    4. I'm pretty sure a few complaints were expressed about size_t in the past, notoriously from embedded software territory. This is not your typical PC application environment, and many libraries or headers assumed to be present are either not available, or deliberately avoided to ensure full debugging capabilities.

      Comment regarding unsigned int is valid. I'm tempted to use it, although modifying the (non-streaming) API at this stage can be problematic due to installed user base. So I'll have to check first if it produces any side-effect within user programs.
      Note that "int" sizes are checked within LZ4, so I'm not expecting any problem. But I'm fine with improving practices, as long as they don't generate side effects.

      Delete
    5. Latest release of LZ4 (streaming branch) finally selects to only expose LZ4_moveDict(), which behavior seems easier to understand and explain.

      Delete
  2. Waiting for the version that has a password parameter like 7z, and also, it's much better to have a dll so it can be imported or used as an external file to compressed, decompressed filestream or memorystream.

    Glad to know this tools!

    ReplyDelete