RealTime Data Compression

Monday, June 10, 2013

Fighting Code Bloat - Part 2 - Inline functions

A while ago, I've written a blog post on a method to decrease source code size, improving maintainability, when the source contains multiple functions which are almost identical, except for a few differences in conditions or types.

The method uses a separate *.h file, which is included as many times as necessary, with a few #define to trigger the relevant sections of the code.

Although it works, I couldn't help but get the feeling that it looks almost like a hack. Also, as a drawback, the method slightly increases source complexity, by increasing the total number of files to be included into an external project (granted, this number remains fairly small for LZ4, but still, it's a step in the wrong direction).

An insightful comment from Bryce Schober kindly reminded that another method was rightly accessible, using inline functions. Yes, inline functions would remove the issue of separated #include files, but it comes with its own set of limitations; namely, inline is merely an "hint" to the compiler, not a guaranteed that the function will effectively get inlined, and it doesn't solve the issue of manipulating different types within the function.

However, I wanted to give this method a try, in case it would result in a cleaner solution. Inline functions are regularly advised as a better alternative to macro (whenever applicable), since inline functions are still compiled, with all the benefits of strong typing and semantic check. It greatly improves debugging and code maintainability.

For this attempt, I selected the set of decompression functions, which has two big advantages :
1) The function is small. As an heuristic, the smallest a function is, the more probable it will get inlined.
2) There are no 'type' manipulations between the different version, only different sets of tests.

The key aspect for this trick to work is to expect the compiler's optimizer to do its job properly. Namely, whenever a test is guaranteed to be 'true' or 'false', we want the associated branch to be eliminated and the relevant piece of code to be instantiated in place. That's why it's key for this function to be inlined : without it, the compiler can't remove the branches, resulting in sensible performance penalty.

Ultimately, the experiment proved successful, and is the basis of LZ4 newest version, r97.

A few key things that were learned in the process :

- Make sure that branches to be removed receive a very clear '1 or 0' signal.
For example, if one branch depends on (variable > 0), don't write it this way. You should know if variable will fulfill the condition, and add another input, such as testtrigger, which will receive the value 1 or 0. Now, the test becomes if (testtrigger) : this will ensure the compiler will understand the test's result, and therefore only instantiate the correct branch.

- Make sure the function will really get inlined. In my tests, GCC happily inlined the generic function, but Visual Studio would not. Hopefully, this can be solve by using __forceinline keyword (forceinline is not part of C99 specification, it is therefore compiler-specific. It's a portability issue).

- Impact on performance is not zero. Here, it is a mixed bag. Inlined functions perform slightly worse than macros for Visual, while performing slightly better for GCC. It's difficult to explain why. The resulting assembler code is fairly close, but definitely not identical. The very small assembler differences could be enough to explain the performance delta. We are hopefully talking about small deltas, in the range of 2%.

- Code readability is definitely improved, as compared to a separate file with macros. This is a big boost to code maintenance. As an added bonus, it even proved easier to find some more optimization.

As a result, this version features a small boost to decoding speed, as measured below :
(fullbench, compiled with GCC v4.6.1 64-bits, running on Core i5 560M with Linux Ubuntu 11.10)

Summary (MB/s):	r96	r97

LZ4_decompress_fast	1412	1460
LZ4_decompress_fast_withPrefix64k	1408	1457
LZ4_decompress_safe	1369	1391
LZ4_decompress_safe_withPrefix64k	1404	1416
LZ4_decompress_safe_partial	1327	1387

All in all, this is a step in the right direction.
Can the experiment be extended to the compression function itself ? Well, that's going to be a later story. It certainly looks much more complex.

Friday, April 26, 2013

Fighting code bloat (C template-style)

A little detail has always annoyed me within the LZ4 source code : the presence of 2 compression functions, and 2 decompression functions.

In both cases, the 2 variants are very close to each other. But the differences are large enough to justify 2 separate functions (such as different underlying types). While it's a minor annoyance, the situation could not last.

Creating the second function was relatively easy : just copy/paste the first function, and modify whatever is necessary. Problems start with code maintenance. Whenever modifying or correcting one function, it's necessary to not forget to also modify the second one, whenever it's applicable. Doing this multiple times, it's likely that a few minor changes get their way in, especially when the code is large (which, thankfully, it is not for LZ4 yet).
But that's only the beginning of the problems.
What if other variants are needed ?
Then, it will be necessary to create a new function almost similar to previous ones, or multiply the current number of variants by 2 if it is an additional on/off parameter. What if I need another on/off parameter ? That's 8 similar functions. Obviously, this doesn't scale.

C++ template
Dealing with a similar issue in C++ has its solution : it's called Template. Getting a list of parameters with predictable values in front of the function template will instruct the compiler to create as many versions of the function as required. It's a very handy way to write the function once, and have it instantiated many times with combination of modifications.

Alas, C coders are not blessed with such a comprehensive tool. C99, which is the latest language update to care about, doesn't define them. (The more recent C11 is still too young for widespread deployment, and anyway, even the new Type-generic defined in C11 is a long shot from Template).

Googling around shows this is a recurrent issue, with already multiple attempts at mimicking template behavior using C macros. With current limitations, it is the sole strategy to adopt. Most of these attempts are fairly complex, which doesn't help for debug, and come with various limitations, depending on attempted objectives (mostly focused on parameter types).

LZ4 is going to require new functions very soon, so the problem of duplicated lines of code is going to be adamant. It becomes urgent to solve it.

The case for inline functions
One potential way to do it is to use inline functions. Such functions will be instantiated in-place, where they are called. In many ways, inline functions behave the same as macros with parameters, but with the big advantage of being typed and compiled, resulting in much cleaner code to debug. By defining some parameters which mere objective is to enable or disable some parts of the code (typically some checks), the compiler will automatically create the right optimal code, removing useless branches. This is a good start.

However, inline functions also come with limitations, especially in C.
First, inline is merely a compilation hint. The compiler is free to decide if the function will be instantiated or referenced, the programmer has no direct control on this decision. A basic rule of thumb is : if the function is very small, it will most probably be instantiated in-place, while if it is large, it most likely won't. However, there is a large gap between these two extreme situations, where it's more difficult to guarantee anything.

This limitation push towards small inline functions, which by the way is the intention of the standard. So, instead of creating a very large all-encompassing function, it could be a good idea to cut it into smaller pieces. Alas, there is another problem : in contrast with macros, inline function do not modify their parameters (like normal functions do). But most of the time, the snippets of code have the responsibility to modify several variables, a single result is simply not enough. Since it's not possible to pass variables by references (this is C++ only), to reach this goal, it's possible to pass parameters as pointers to variables. It works (there is such an example into lz4hc). It just makes it even harder for the compiler to understand the intend and optimize such a mess, and therefore less likely to create the best fastest code.

The last limitation is even less forgiving : what can be done when the difference between two versions concerns underlying types ?
A good example is the HTYPE of LZ4, which can be either an U16 (for the 64K version), a BYTE* pointer (for 32-bits), or an U32 (for 64-bits). The code is almost the same, just this particular type changes depending on the version. How to do that with an inline function ? The simple answer is : you can't.

LZ4 solution
So here is my attempt. In order to reduce the number of functions written into LZ4 source code, preparing the creation of newer ones sharing almost the same code, each function is written into a dedicated file, included several times into the main .c source file, preceded by a set of #define which trigger specific behaviors.

Hence were created lz4_encoder.h and lz4_decoder.h.
Both files are unusual : they are neither real "headers", nor compilable piece of code. They are meant to be included into another source file (*.c), preceded by a list of #define, some of them being mandatory (such as FUNCTION_NAME), and others being optional (such as LIMITED_OUTPUT).
Each time the *.h file is included, it creates a new function.
It becomes possible to write the function once, and create multiple variations of it, greatly simplifying code maintenance.

The situation is not all rosy. First, the main function becomes more complex to write, since it encompasses all possible variations. It's sometimes necessary to refactor the code just to make it more readable, since a collection of :
#ifdef CONDITION
specific_code();
#endif
can quickly impact readability, resulting in a negative impact on code maintenance.

Another drawback is the necessity for any user program to import several files (lz4.c + lz4_encoder.h + lz4_decoder.h), while a single lz4.c was enough up to now.
This increased file management complexity was the main reason I've avoided this strategy up to now. But, with the streaming interface in the near future, it was a necessity to ensure that new functions can easily be created while keeping the source code size under control.

In the end, the code refactoring effort also created some immediate win. Performance of several functions improved, notably for the fast variant of the decompression function, the compression of small packets, and the High Compression variant. This is a consequence of "sanitizing" the code, removing useless tests from variants which don't need them, and finding minor differences between 2 versions, keeping only the better one.

Next goal in list is inter-dependent block compression and decompression, an essential piece of the puzzle towards Streaming Interface.

Tuesday, April 9, 2013

LZ4 Frame format : Final specifications

[Edit] : the specification linked from this blog post is quite old by now. Prefer consulting the up-to-date version, stored directly into the project's repository, at https://github.com/lz4/lz4/tree/master/doc .

The LZ4 Framing Format specification has progressed quite a bit since last post, taking into consideration most issues raised by commenters. It has now reached version 1.5 (see edit below), which looks stable enough.

LZ4 Frame format : Specifications v1.5

As a consequence, save any last-minute important item raised by contributors, the currently published specification will be used in upcoming LZ4 releases.

[Edit] : and last-minute change there is. Following a suggestion by Takayuki Matsuoka, the header checksum is now slightly different, in an effort to become more friendly with read-only media, hopefully improving clarity in the process. Specification version is now raised to v1.3.

[Edit 2] : A first version of LZ4c, implementing the above specification, is available at Google Code.

[Edit 3] : Following recommendations from Mark Adler, version v1.4 re-introduce frame content checksum. It's not correct to assume that block checksum makes frame content checksum redundant : block checksum only validates that each block has no error, while frame content checksum verify that all blocks are present and in correct order. Finally, frame content checksum also validates the encoding & decoding stages.
v1.4 also introduces the definition of "skippable frames", which can be used to encapsulate any kind of user-defined data into a flow of appended LZ4 frames.

[Edit 4] : Changed naming convention in v1.4.1 to "frame".

[Edit 5] : v1.5 removed Dict_ID from specification

Thursday, March 21, 2013

A Streaming format for LZ4

It is a long time since I'm willing to produce a suitable streaming format for LZ4. As stated earlier, the format defined into lz4demo was not meant to become widespread, and is too limited. It's also completely unsuitable for network-oriented protocols.

As a consequence, you'll find in the following link a nearly-final specification of the LZ4 streaming format, in OpenDocument format.

It's only "nearly" final, since there are still a few questions left, summarized at the end of each chapter.

However, the main properties seem settled. The format accomodates for a large selection of buffer sizes, authorizes sequential and independant blocks, embed a few checksum options, accept arbitrary flushes at any point, and even define provisions for preset dictionaries mode.

At this stage, the very last questions seem ready to be settled in the next few weeks. Inputs, comments are welcomed.

LZ4 streaming format specification

[Edit] progresses :
Settled thanks to your comments (here and direct email):

Endian convention : more votes for Little Endian.
Stream Size : 8 bytes seems okay
Stream checksum : removed (v0.7), block-level checksum seems enough
High compression flag : removed (v0.8), seems not useful enough
Block Maximum size : reduced table (v0.9), from 1KB to 4MB
Block size : simplified compressed/uncompressed flags (v1.0)

[Edit] answering spec-related questions directly into the post

Jim> suggestion is to allow different checksums, with a 16-bit word identifying which hash

Actually, it was my initial intention.
But i eventually backed off. Why ?

One of LZ4 strong points is its simple specification, which makes it possible for any programmer to produce an LZ4-compatible algorithm of its own, in any language. To reach this goal, complexity is always fought and reduced to a minimum.

If multiple hash algorithms are possible, then the decoder will have to support them all to be compatible with the specification. It's a significant burden, which will deter or slow down people willing to produce, test and maintain their own decoding routine.

Obviously, the streaming format proposed here is not supposed to be "the most appropriate for any usage". I intend it to become "mainstream", cross-platform, and to replace the current "static format" of "lz4demo". But there will always be some specific circumstances in which another format will do a better job.

Matt> deflate format supports preset dictionaries, but nobody uses them. I would drop it.

Actually, I had a few requests for such a feature. The idea is that pre-set dictionaries can make a great difference when it comes to sending a lot of small independant blocks.

Matt> Do you need checksums at all?

I think so. The format is for both short-lived transmission data, and for storage one. Checksum is almost mandatory for the second use-case. Anyway, in the proposal, Checksum is still an "option", so it can be disabled by the sender if it seems better for its own use case.

Matt> Do you need both block and stream checksums? Probably not.
Mark> Stream Checksum: I don't see the point if data block checksum gives the appropriate protection

That's the good question. It's kind of 50/50 when it comes to evaluating comments.
The simplicity argument looks good though.
[Edit 2] Stream checksum is removed (v0.7+), keeping only block-level checksum

Mark> Do you really think your block maximum size values make sense (...) All in all, I tend to think it is a too wide choice anyway

Good point. In v0.9, the choice of values for block maximum size has been reduced.

Matt> Do you need variable sized fields just to save a few bytes in the header? Probably not.
Mark> I would make that "compressed flag" explicit, and thus keep only one "data size" field disregarding if the data was compressed or not. (...) . I'm not even sure you would need two possible sizes just to save one byte per block.

Good points. Apparently, this part looks too complex, and would deserve some re-thinking.
[Edit 2] : the specification regarding block size is simplified in v1.0.

Adrien> Would it be better to let users select their own dictionary identifiers, rather than requiring Dict-ID to be the result of passing the dictionnary through xxHash-32 ?

Good point . This behavior mimics the spec of RFC1950 (zlib). The only advantage i can think of is that it allows the decoder (and encoder) to check if the dictionary is the right one. I'm unsure if this is really useful though....

Takayuki> LZ4 stream may be (pre) loaded on Ready Only Memory. In this case, temporal masking 0 for BC.Checkbits is difficult.

Correct. The current proposal is to load the header into RAM in order to perform the masking and checksum there. Is that an issue ?
Another proposal could be to not check the checkbits for ROM pre-loaded streams, since potential transmission error is nullified for such scenario.

Wednesday, December 12, 2012

xxHash : new version

It's a few monthes since the initial release of xxHash. The algorithm has almost fullfilled its initial objective, which is to provide a Hash function for error detection fast enough to use within LZ4 streaming interface.

Although the "fundamentals" were fine, a few details were still rough on the edges. The most important "missing feature" was the ability to provide input data in several consecutive blocks. When hashing a large file for example, the allocated buffer might not be large enough to store the whole input within a single block.

In order to accomodate this need, a new version of xxHash has been created, which is BSD licensed.

The way it works is by dividing the job into 3 parts :
XXH32_init() creates the context structure in which intermediate results will be stored.
This structure must be passed as an argument of function XXH32_feed(), which is used to provide input in several consecutive blocks. Any number of blocks is possible, there is no limit.
When all data has been provided, it's time to retrieve the result, using XXH32_result(). This function also takes care of de-allocating the context structure.

A "single pass" function is also provided, both for simplicity (a simple function call is enough) and for performance. The latter is important if the amount of data to hash is small (<100 bytes, also called "small keys"). In this case, since there is no intermediate structure to allocate & maintain, the savings are significant.

To simplify usage and interoperability, there is now a single xxHash version, which is "strong" (meaning it successfully pass all tests from SMHasher test suite). This is possible because the new version is also faster (5.4GB/s on my Core 2 Duo, to be compared with 4.2GB for the older one). The speed difference does no longer justify a "fast" version with lessened distribution guarantee.

The framework is also more extensible, meaning that versions for 64-bits, 128-bits and 256-bits can appear in the future. But for the time being, the focus is really on the 32-bits version. It's designed to be very fast on all kind of 32-bits CPU, including embedded ones (such as ARM), with still the objective to become a companion error checker for LZ4 streaming.

Tuesday, July 3, 2012

Log file compression

Although i'm currently on holliday, with limited access to the world wide web,
i would like to link here an interesting contribution from Vitor Oliveira, an LZ4 user, which seems to have found some pretty impressive results for log file compression by using a simple technique :
multi-pass compression.

More specifically, his method, which involves several pass of LZ4 algorithms, seems to produce compressed file which are several times smaller than zlib, while requiring only a fraction of the computation cost.

Some preliminary results :
zlib (one pass) : 54 MB, 265ms
LZ4 (one pass) : 56 MB, 6ms
LZ4 (multi-pass) : 4 MB, 16 ms

Since log file compression is a relatively common scenario, i figure this was interesting to share :
https://groups.google.com/d/msg/lz4c/DcN5SgFywwk/AVMOPri0O3gJ

Wednesday, May 30, 2012

Compressed data transmission

If there is a situation where data is inherently short-lived, it is communication. Data starts its live on the sender side, and end it on the receiving side, a few milliseconds later.

Or does it ? Sometimes, data comes from a file into a local storage, or can be stored at the receiving side. In such case, data is merely "traveling", but is not "short-lived".

Does it make any difference ? In fact, yes, it does.

When it comes to sending a file content, this data can be "prepared" in advance. Which means it can be compressed ahead of sending it. Very strong (asymmetric) algorithms can be used for compression, as long as decoding remains "fast enough" to cope with data speed. This leads to major bandwidth reduction, and therefore improve cost and perceived transmission speed.

When it comes to sending "short-lived" data, it means this data did not exist before being produced, and the sole purpose of this data existence is to be sent, and (generally) consumed on receiving end. There is no way to "prepare" such data in advance, it must be compressed "on the fly", which means "fast".

But there is another terrible side effect : compression performance primarily comes from its capacity to "learn patterns", and re-apply them in an optimal way. Which means, for compression to be effective, a minimum of "historic data" must have already been processed for the next data to be properly compressed. With a file content, the history can be the entire file itself, which could mean a lot of megabytes, and therefore excellent perspectives for compression.
The situation is much more severe when data is generated and compressed "on the fly" : maybe the message to be sent is only a few tens of bytes long. How to compress such a thing ?

Let's study this use case.
A first "naive" implementation would simply create a message, make a packet out of it, compress it and then send it.
This implementation is unlikely to bring tangible benefits, since IP packets are typically small, trying to match MTU in order to avoid fragmentation side-effects.

A second, more compression-friendly, implementation, could try to amass enough information before starting to compress it, and then send the compressed data using as many packets as necessary.
This will certainly bring better compression performance, but introduces another problem, latency. Waiting for "enough data to be compressed" can lead to unacceptable delays.
For example, in real-time games, player command must be sent basically a.s.a.p.
As another use case, some systems may generate little data (a temperature probe for example), separated by long cycle duration.
Therefore, waiting for "enough data" is not a valid strategy in such circumstances.

A third, more complex, strategy, would use all past transmitted data as a kind of "dictionary", to help compress the next packet to come.
This basically requires the dictionary to remain "synchronized" at both end, sender and receiver. This is achievable in an "ideal" environment (no loss, no error), which is quite common in fact when using TCP transmission.

So, to sum up, we have some freshly generated data to send, of any size but typically small (a few hundreds of bytes), and we want to use all previously transmitted data as dictionary to improve compression, which requires some kind of "memory" at both sender and receiver end.
This looks possible.
In fact, this is a direct usage of "variable block sizes" concept which i expressly ruled out as "not useful" in an earlier blog note :). Now seems a good time to consider it again...

Such implementation would however require some new functions, able to re-use and save some "history data", instead of starting from freshly clean tables. This will require quite some work to achieve.

As a side effect of such methodology, it also means that such compressed packet are not compatible with stateless protocols : since they depend on previously sent data, they are inherently stateful. But so are TCP sessions anyway...

Monday, May 28, 2012

Members properties

After spending some time on expected properties at streaming level, let's now get to the core of the objective, regarding the compressed data parameters.

As stated previously, a compressed stream consists of several members, the most important ones being compressed data sets. Each member starts with a header, in order to identify its content. And each header starts with a magic number, a kind of 'ID tag'.

We'll focus here on "LZ4 compressed data set". The stream design above allows adding any future compression algorithm at a later stage.

And let's take as an example the old legacy framing format, defined into lz4demo.

1) There is a magic number, which is 0x184C2102,in little endian format.
2) There are no explicit parameters. In fact, all parameters are implicit.
They are :
- The compressed data set is cut into blocks of 8MB
- Each block starts with a field giving its size (therefore, the compressed size)
- Blocks are independent
- The original data size is not stored. It will be known on decoding completion
- There is no checksum

Well, even with such limitations, the format nonetheless works perfectly fine. It's just a little too restricted to become a "generic format", and therefore, the objective of the specification is to provide more room for parameters selections.

We have already established in previous blog posts that allowing checksum for Error detection is an important selectable feature.
Another important one is the ability to select block size, since they directly control the amount of memory buffers necessary at decoding side.

Let's now study and establish potential needs for a few other properties :

Source data size
The original size of source data is not an absolute necessity : it's always possible to decode without it, as long as buffer sizes are properly described.

But it is nonetheless useful. For example, thanks to this information, the number of blocks within the current member can be calculated beforehand. Moreover the amount of data to decode from the last block is known.
Or, if there is a single block, the exact amount of memory can be allocated, instead of the block maximum size.
It is also useful to display the processing position (yep, we decoded 80MB, but does that represent 10% or 90% of the stream to decode ?)

However, there are also circumstances in which this data is not known. For example, if the input was piped to the compressing process, then the size will be known only on hitting its end. This might be too late to "retrofit" the output.
Another situation is when several compressed data sets are appended into a single stream : then the "source data size" field only applies to the current data set, but the total size is not known.

Therefore, since it is useful but not compulsory, this information shall be present, but as an option only.

Uncompressed blocks
A simple but important feature, since it avoids the bandwidth overhead and CPU consumption of the compression format when it's useless.
This could be done very cheaply, by accepting that, if the size of the compressed block is the same as the defined one, then it's necessarily uncompressed.

This suggestion looks simple enough for most blocks, except for the last one, which size is unknown (but capped).
Therefore, it would be necessary to know the size of the last block to compare it to the compressed one, and therefore determine if the block is compressed or not.

Another idea would be : let's give up this complexity, the last block is always compressed, even if compression is either useless or detrimental.
Actually, it's not a good idea to "not handle the last block", since there is a disastrous corner case : supposed that the compressed size of the last block is exactly the size of an uncompressed full block : then the decoding will assume it's uncompressed, leading to data corruption.

This corner case can be avoided by enforcing a simple rule : a compressed block is necessary smaller than original size. Therefore, as the last block has a size <= block size, its compressed size is necessarily < block size. Hence, if the size of this last block is the maximum size, then we are in the specific but valid corner case where the last block size is exactly the maximum size of a block, and is not compressible.

OK, enough of corner cases, let's now be in the normal situation where the last block size is a fraction of the maximum block size. How could we know it is uncompressed ?

This problem could be mitigated by inserting an information to know that we are dealing with the last block. For example, knowing the original size of the source data is enough for this need.

But it's not always available. As said previously, this is just an option, since in some streaming mode, this information is unknown. Therefore we need another indicator.

It could be something as simple as a bit, which simply tells that there is another block to follow, and as a consequence, the current block is full sized. As a bonus, this mechanism also protects against "silent block truncation" (when the compressed stream is cut exactly at the border between 2 blocks).
On reaching the last block, we need another piece of information, either the uncompressed size of the block, or if the block is compressed. The latter seems more compact.

Zero-filled blocks
This idea was actually proposed by Charles Bloom : it's not rare, for a section of input data, to be filled with zeros.
The idea would be to "mark" such blocks with a specific prefix, such as "0".
For such situation to have reasonable chances to happen, the block size must be small enough. For example, this will probably almost never happen with lz4demo block size (8MB), while this is going to be much more frequent with very small blocks, such as 4KB ones.

Error correction
While error detection has been much talked about, nothing has been said up to now about error correction.
That's both because this feature is much more complex to provide and of questionable usefulness.

Error correction is mostly useful in situations when there is no way to "resend" erroneous data. This applies to real-time codec (such as voice or video) and stored archive.
The difficulty in both cases is that erroneous data tends to be "bursty". For example, when a storage sector fails, we don't lose just a few bytes, but an entire sector size, typically 4KB. Same for transmission, where the most common error is a missing packet.
Dealing with large burst of errors requires some specific techniques, which unfortunately cost much processing power and memory. As a consequence, the CPU and memory budget for error correction is way beyond LZ4 one, which makes the association a questionable choice.

Therefore, it seems this feature is not expected to be "generic enough" to reserve a place into the generic framing format specification. Obviously, forking is always possible, and even recommended, to support specific features.

Allow multi-threaded compression and decompression
Multi-threaded compression is easily achievable thanks to the division of input data into "blocks".
Multi-threaded decoding is also possible if those blocks are "independent".
Both mode shall be possible, and selectable

Variable block sizes
This one is tricky : up to now, we have been talking about "fixed size" blocks only, with only the last block of a compressed data set having an unknown (but capped) size.
The idea here would be to authorize blocks of arbitrary size, instead of fixed ones.

The benefits are two-fold :

Separate data on "natural boundaries", in order to improve compression ratio and speed
Allow data insertion of any size

The first point is simple to argue with : such benefit only occurs with very-high ratio (and slow) compression algorithms, such as CM, which "learn" the data behavior through statistics. There is no tangible benefit in trying to do the same for LZ4.

The second benefit is more interesting, since it authorizes some flexibility in archive management.
Actually, this is mitigated by the possibility to concatenate compressed data sets (or "members") together in a single stream or file.
Inserting data could therefore be realized by cutting the initial member into 2 parts, inserting the new member, and concatenating the 3 members together.
As a consequence, it seems the format already supports such scenario, without needing variable block sizes.

Partial extraction and Quick Jump Table
Another potential idea is that, within a member, one could need to only extract a specific portion.
It's always possible to decode the full data set and get to the required location, but sometimes this might be overkill. For example, one may need a few MB of data which happen to be a few GB away from the starting point.

However, the idea to decode just the necessary part introduces a few restrictions :

First, the input media should be seekable. It makes little sense to partially decode a piped streams, since the decoding process is likely faster than the pipe itself.
Second, the compressed data shall be cut into independent blocks. Otherwise, it would be necessary to decode, and therefore read, all previous blocks
Third, to avoid to decode "too much data", the blocks shall be small enough, with corresponding impact on compression ratio (the smaller the block, the lower the compression ratio).
Fourth, since the i/o process is likely slower than LZ4 decoding, there is a benefit only if it is possible to quick-jump to the right location immediately.
This can be achieved thanks to a table at the beginning of the compressed file. Such a table can only be filled after compression, and therefore is incompatible with non-seekable output.
Fifth, such "table" mechanism at member level would be useless in members appending scenarios.

These are quite many restrictions, for the benefit of a hardly-requested feature.
So probably this capability shall be left to a dedicated framing format.
Moreover, should the input stream be seekable, it's still possible to "hop" over blocks without reading/decoding them. This is still slower than a direct jump, but still a sensible potential speed improvement.

Error detection algorithm
As a quick follow up of selecting-checksum-algorithm, one could note that i had not specified a preferred checksum algorithm, only a preferred checksum size (32-bits).
Although at this stage i'm somewhat inclined to use xxhash-strong, due to its speed and very good distribution property, there is still a chance that the algorithm might be found unsuitable at a later stage. Therefore, some provision should be left to allow another algorithm to take over later on if need be.

Pushing the idea a bit further, one could think "let the user select its own checksum algorithm". While the idea may sound nice, it goes against the principle of interoperability, which is exactly what this framing format tries to achieve. Therefore, only clearly defined checksum algorithms shall be allowed.

I believe this post went through most foreseeable requirements for the LZ4 framing format.
So now seems a reasonable time to start a header specification.