RealTime Data Compression

Tuesday, April 9, 2013

LZ4 Frame format : Final specifications

[Edit] : the specification linked from this blog post is quite old by now. Prefer consulting the up-to-date version, stored directly into the project's repository, at https://github.com/lz4/lz4/tree/master/doc .

The LZ4 Framing Format specification has progressed quite a bit since last post, taking into consideration most issues raised by commenters. It has now reached version 1.5 (see edit below), which looks stable enough.

LZ4 Frame format : Specifications v1.5

As a consequence, save any last-minute important item raised by contributors, the currently published specification will be used in upcoming LZ4 releases.

[Edit] : and last-minute change there is. Following a suggestion by Takayuki Matsuoka, the header checksum is now slightly different, in an effort to become more friendly with read-only media, hopefully improving clarity in the process. Specification version is now raised to v1.3.

[Edit 2] : A first version of LZ4c, implementing the above specification, is available at Google Code.

[Edit 3] : Following recommendations from Mark Adler, version v1.4 re-introduce frame content checksum. It's not correct to assume that block checksum makes frame content checksum redundant : block checksum only validates that each block has no error, while frame content checksum verify that all blocks are present and in correct order. Finally, frame content checksum also validates the encoding & decoding stages.
v1.4 also introduces the definition of "skippable frames", which can be used to encapsulate any kind of user-defined data into a flow of appended LZ4 frames.

[Edit 4] : Changed naming convention in v1.4.1 to "frame".

[Edit 5] : v1.5 removed Dict_ID from specification

Thursday, March 21, 2013

A Streaming format for LZ4

It is a long time since I'm willing to produce a suitable streaming format for LZ4. As stated earlier, the format defined into lz4demo was not meant to become widespread, and is too limited. It's also completely unsuitable for network-oriented protocols.

As a consequence, you'll find in the following link a nearly-final specification of the LZ4 streaming format, in OpenDocument format.

It's only "nearly" final, since there are still a few questions left, summarized at the end of each chapter.

However, the main properties seem settled. The format accomodates for a large selection of buffer sizes, authorizes sequential and independant blocks, embed a few checksum options, accept arbitrary flushes at any point, and even define provisions for preset dictionaries mode.

At this stage, the very last questions seem ready to be settled in the next few weeks. Inputs, comments are welcomed.

LZ4 streaming format specification

[Edit] progresses :
Settled thanks to your comments (here and direct email):

Endian convention : more votes for Little Endian.
Stream Size : 8 bytes seems okay
Stream checksum : removed (v0.7), block-level checksum seems enough
High compression flag : removed (v0.8), seems not useful enough
Block Maximum size : reduced table (v0.9), from 1KB to 4MB
Block size : simplified compressed/uncompressed flags (v1.0)

[Edit] answering spec-related questions directly into the post

Jim> suggestion is to allow different checksums, with a 16-bit word identifying which hash

Actually, it was my initial intention.
But i eventually backed off. Why ?

One of LZ4 strong points is its simple specification, which makes it possible for any programmer to produce an LZ4-compatible algorithm of its own, in any language. To reach this goal, complexity is always fought and reduced to a minimum.

If multiple hash algorithms are possible, then the decoder will have to support them all to be compatible with the specification. It's a significant burden, which will deter or slow down people willing to produce, test and maintain their own decoding routine.

Obviously, the streaming format proposed here is not supposed to be "the most appropriate for any usage". I intend it to become "mainstream", cross-platform, and to replace the current "static format" of "lz4demo". But there will always be some specific circumstances in which another format will do a better job.

Matt> deflate format supports preset dictionaries, but nobody uses them. I would drop it.

Actually, I had a few requests for such a feature. The idea is that pre-set dictionaries can make a great difference when it comes to sending a lot of small independant blocks.

Matt> Do you need checksums at all?

I think so. The format is for both short-lived transmission data, and for storage one. Checksum is almost mandatory for the second use-case. Anyway, in the proposal, Checksum is still an "option", so it can be disabled by the sender if it seems better for its own use case.

Matt> Do you need both block and stream checksums? Probably not.
Mark> Stream Checksum: I don't see the point if data block checksum gives the appropriate protection

That's the good question. It's kind of 50/50 when it comes to evaluating comments.
The simplicity argument looks good though.
[Edit 2] Stream checksum is removed (v0.7+), keeping only block-level checksum

Mark> Do you really think your block maximum size values make sense (...) All in all, I tend to think it is a too wide choice anyway

Good point. In v0.9, the choice of values for block maximum size has been reduced.

Matt> Do you need variable sized fields just to save a few bytes in the header? Probably not.
Mark> I would make that "compressed flag" explicit, and thus keep only one "data size" field disregarding if the data was compressed or not. (...) . I'm not even sure you would need two possible sizes just to save one byte per block.

Good points. Apparently, this part looks too complex, and would deserve some re-thinking.
[Edit 2] : the specification regarding block size is simplified in v1.0.

Adrien> Would it be better to let users select their own dictionary identifiers, rather than requiring Dict-ID to be the result of passing the dictionnary through xxHash-32 ?

Good point . This behavior mimics the spec of RFC1950 (zlib). The only advantage i can think of is that it allows the decoder (and encoder) to check if the dictionary is the right one. I'm unsure if this is really useful though....

Takayuki> LZ4 stream may be (pre) loaded on Ready Only Memory. In this case, temporal masking 0 for BC.Checkbits is difficult.

Correct. The current proposal is to load the header into RAM in order to perform the masking and checksum there. Is that an issue ?
Another proposal could be to not check the checkbits for ROM pre-loaded streams, since potential transmission error is nullified for such scenario.

Wednesday, December 12, 2012

xxHash : new version

It's a few monthes since the initial release of xxHash. The algorithm has almost fullfilled its initial objective, which is to provide a Hash function for error detection fast enough to use within LZ4 streaming interface.

Although the "fundamentals" were fine, a few details were still rough on the edges. The most important "missing feature" was the ability to provide input data in several consecutive blocks. When hashing a large file for example, the allocated buffer might not be large enough to store the whole input within a single block.

In order to accomodate this need, a new version of xxHash has been created, which is BSD licensed.

The way it works is by dividing the job into 3 parts :
XXH32_init() creates the context structure in which intermediate results will be stored.
This structure must be passed as an argument of function XXH32_feed(), which is used to provide input in several consecutive blocks. Any number of blocks is possible, there is no limit.
When all data has been provided, it's time to retrieve the result, using XXH32_result(). This function also takes care of de-allocating the context structure.

A "single pass" function is also provided, both for simplicity (a simple function call is enough) and for performance. The latter is important if the amount of data to hash is small (<100 bytes, also called "small keys"). In this case, since there is no intermediate structure to allocate & maintain, the savings are significant.

To simplify usage and interoperability, there is now a single xxHash version, which is "strong" (meaning it successfully pass all tests from SMHasher test suite). This is possible because the new version is also faster (5.4GB/s on my Core 2 Duo, to be compared with 4.2GB for the older one). The speed difference does no longer justify a "fast" version with lessened distribution guarantee.

The framework is also more extensible, meaning that versions for 64-bits, 128-bits and 256-bits can appear in the future. But for the time being, the focus is really on the 32-bits version. It's designed to be very fast on all kind of 32-bits CPU, including embedded ones (such as ARM), with still the objective to become a companion error checker for LZ4 streaming.

Tuesday, July 3, 2012

Log file compression

Although i'm currently on holliday, with limited access to the world wide web,
i would like to link here an interesting contribution from Vitor Oliveira, an LZ4 user, which seems to have found some pretty impressive results for log file compression by using a simple technique :
multi-pass compression.

More specifically, his method, which involves several pass of LZ4 algorithms, seems to produce compressed file which are several times smaller than zlib, while requiring only a fraction of the computation cost.

Some preliminary results :
zlib (one pass) : 54 MB, 265ms
LZ4 (one pass) : 56 MB, 6ms
LZ4 (multi-pass) : 4 MB, 16 ms

Since log file compression is a relatively common scenario, i figure this was interesting to share :
https://groups.google.com/d/msg/lz4c/DcN5SgFywwk/AVMOPri0O3gJ

Wednesday, May 30, 2012

Compressed data transmission

If there is a situation where data is inherently short-lived, it is communication. Data starts its live on the sender side, and end it on the receiving side, a few milliseconds later.

Or does it ? Sometimes, data comes from a file into a local storage, or can be stored at the receiving side. In such case, data is merely "traveling", but is not "short-lived".

Does it make any difference ? In fact, yes, it does.

When it comes to sending a file content, this data can be "prepared" in advance. Which means it can be compressed ahead of sending it. Very strong (asymmetric) algorithms can be used for compression, as long as decoding remains "fast enough" to cope with data speed. This leads to major bandwidth reduction, and therefore improve cost and perceived transmission speed.

When it comes to sending "short-lived" data, it means this data did not exist before being produced, and the sole purpose of this data existence is to be sent, and (generally) consumed on receiving end. There is no way to "prepare" such data in advance, it must be compressed "on the fly", which means "fast".

But there is another terrible side effect : compression performance primarily comes from its capacity to "learn patterns", and re-apply them in an optimal way. Which means, for compression to be effective, a minimum of "historic data" must have already been processed for the next data to be properly compressed. With a file content, the history can be the entire file itself, which could mean a lot of megabytes, and therefore excellent perspectives for compression.
The situation is much more severe when data is generated and compressed "on the fly" : maybe the message to be sent is only a few tens of bytes long. How to compress such a thing ?

Let's study this use case.
A first "naive" implementation would simply create a message, make a packet out of it, compress it and then send it.
This implementation is unlikely to bring tangible benefits, since IP packets are typically small, trying to match MTU in order to avoid fragmentation side-effects.

A second, more compression-friendly, implementation, could try to amass enough information before starting to compress it, and then send the compressed data using as many packets as necessary.
This will certainly bring better compression performance, but introduces another problem, latency. Waiting for "enough data to be compressed" can lead to unacceptable delays.
For example, in real-time games, player command must be sent basically a.s.a.p.
As another use case, some systems may generate little data (a temperature probe for example), separated by long cycle duration.
Therefore, waiting for "enough data" is not a valid strategy in such circumstances.

A third, more complex, strategy, would use all past transmitted data as a kind of "dictionary", to help compress the next packet to come.
This basically requires the dictionary to remain "synchronized" at both end, sender and receiver. This is achievable in an "ideal" environment (no loss, no error), which is quite common in fact when using TCP transmission.

So, to sum up, we have some freshly generated data to send, of any size but typically small (a few hundreds of bytes), and we want to use all previously transmitted data as dictionary to improve compression, which requires some kind of "memory" at both sender and receiver end.
This looks possible.
In fact, this is a direct usage of "variable block sizes" concept which i expressly ruled out as "not useful" in an earlier blog note :). Now seems a good time to consider it again...

Such implementation would however require some new functions, able to re-use and save some "history data", instead of starting from freshly clean tables. This will require quite some work to achieve.

As a side effect of such methodology, it also means that such compressed packet are not compatible with stateless protocols : since they depend on previously sent data, they are inherently stateful. But so are TCP sessions anyway...

Monday, May 28, 2012

Members properties

After spending some time on expected properties at streaming level, let's now get to the core of the objective, regarding the compressed data parameters.

As stated previously, a compressed stream consists of several members, the most important ones being compressed data sets. Each member starts with a header, in order to identify its content. And each header starts with a magic number, a kind of 'ID tag'.

We'll focus here on "LZ4 compressed data set". The stream design above allows adding any future compression algorithm at a later stage.

And let's take as an example the old legacy framing format, defined into lz4demo.

1) There is a magic number, which is 0x184C2102,in little endian format.
2) There are no explicit parameters. In fact, all parameters are implicit.
They are :
- The compressed data set is cut into blocks of 8MB
- Each block starts with a field giving its size (therefore, the compressed size)
- Blocks are independent
- The original data size is not stored. It will be known on decoding completion
- There is no checksum

Well, even with such limitations, the format nonetheless works perfectly fine. It's just a little too restricted to become a "generic format", and therefore, the objective of the specification is to provide more room for parameters selections.

We have already established in previous blog posts that allowing checksum for Error detection is an important selectable feature.
Another important one is the ability to select block size, since they directly control the amount of memory buffers necessary at decoding side.

Let's now study and establish potential needs for a few other properties :

Source data size
The original size of source data is not an absolute necessity : it's always possible to decode without it, as long as buffer sizes are properly described.

But it is nonetheless useful. For example, thanks to this information, the number of blocks within the current member can be calculated beforehand. Moreover the amount of data to decode from the last block is known.
Or, if there is a single block, the exact amount of memory can be allocated, instead of the block maximum size.
It is also useful to display the processing position (yep, we decoded 80MB, but does that represent 10% or 90% of the stream to decode ?)

However, there are also circumstances in which this data is not known. For example, if the input was piped to the compressing process, then the size will be known only on hitting its end. This might be too late to "retrofit" the output.
Another situation is when several compressed data sets are appended into a single stream : then the "source data size" field only applies to the current data set, but the total size is not known.

Therefore, since it is useful but not compulsory, this information shall be present, but as an option only.

Uncompressed blocks
A simple but important feature, since it avoids the bandwidth overhead and CPU consumption of the compression format when it's useless.
This could be done very cheaply, by accepting that, if the size of the compressed block is the same as the defined one, then it's necessarily uncompressed.

This suggestion looks simple enough for most blocks, except for the last one, which size is unknown (but capped).
Therefore, it would be necessary to know the size of the last block to compare it to the compressed one, and therefore determine if the block is compressed or not.

Another idea would be : let's give up this complexity, the last block is always compressed, even if compression is either useless or detrimental.
Actually, it's not a good idea to "not handle the last block", since there is a disastrous corner case : supposed that the compressed size of the last block is exactly the size of an uncompressed full block : then the decoding will assume it's uncompressed, leading to data corruption.

This corner case can be avoided by enforcing a simple rule : a compressed block is necessary smaller than original size. Therefore, as the last block has a size <= block size, its compressed size is necessarily < block size. Hence, if the size of this last block is the maximum size, then we are in the specific but valid corner case where the last block size is exactly the maximum size of a block, and is not compressible.

OK, enough of corner cases, let's now be in the normal situation where the last block size is a fraction of the maximum block size. How could we know it is uncompressed ?

This problem could be mitigated by inserting an information to know that we are dealing with the last block. For example, knowing the original size of the source data is enough for this need.

But it's not always available. As said previously, this is just an option, since in some streaming mode, this information is unknown. Therefore we need another indicator.

It could be something as simple as a bit, which simply tells that there is another block to follow, and as a consequence, the current block is full sized. As a bonus, this mechanism also protects against "silent block truncation" (when the compressed stream is cut exactly at the border between 2 blocks).
On reaching the last block, we need another piece of information, either the uncompressed size of the block, or if the block is compressed. The latter seems more compact.

Zero-filled blocks
This idea was actually proposed by Charles Bloom : it's not rare, for a section of input data, to be filled with zeros.
The idea would be to "mark" such blocks with a specific prefix, such as "0".
For such situation to have reasonable chances to happen, the block size must be small enough. For example, this will probably almost never happen with lz4demo block size (8MB), while this is going to be much more frequent with very small blocks, such as 4KB ones.

Error correction
While error detection has been much talked about, nothing has been said up to now about error correction.
That's both because this feature is much more complex to provide and of questionable usefulness.

Error correction is mostly useful in situations when there is no way to "resend" erroneous data. This applies to real-time codec (such as voice or video) and stored archive.
The difficulty in both cases is that erroneous data tends to be "bursty". For example, when a storage sector fails, we don't lose just a few bytes, but an entire sector size, typically 4KB. Same for transmission, where the most common error is a missing packet.
Dealing with large burst of errors requires some specific techniques, which unfortunately cost much processing power and memory. As a consequence, the CPU and memory budget for error correction is way beyond LZ4 one, which makes the association a questionable choice.

Therefore, it seems this feature is not expected to be "generic enough" to reserve a place into the generic framing format specification. Obviously, forking is always possible, and even recommended, to support specific features.

Allow multi-threaded compression and decompression
Multi-threaded compression is easily achievable thanks to the division of input data into "blocks".
Multi-threaded decoding is also possible if those blocks are "independent".
Both mode shall be possible, and selectable

Variable block sizes
This one is tricky : up to now, we have been talking about "fixed size" blocks only, with only the last block of a compressed data set having an unknown (but capped) size.
The idea here would be to authorize blocks of arbitrary size, instead of fixed ones.

The benefits are two-fold :

Separate data on "natural boundaries", in order to improve compression ratio and speed
Allow data insertion of any size

The first point is simple to argue with : such benefit only occurs with very-high ratio (and slow) compression algorithms, such as CM, which "learn" the data behavior through statistics. There is no tangible benefit in trying to do the same for LZ4.

The second benefit is more interesting, since it authorizes some flexibility in archive management.
Actually, this is mitigated by the possibility to concatenate compressed data sets (or "members") together in a single stream or file.
Inserting data could therefore be realized by cutting the initial member into 2 parts, inserting the new member, and concatenating the 3 members together.
As a consequence, it seems the format already supports such scenario, without needing variable block sizes.

Partial extraction and Quick Jump Table
Another potential idea is that, within a member, one could need to only extract a specific portion.
It's always possible to decode the full data set and get to the required location, but sometimes this might be overkill. For example, one may need a few MB of data which happen to be a few GB away from the starting point.

However, the idea to decode just the necessary part introduces a few restrictions :

First, the input media should be seekable. It makes little sense to partially decode a piped streams, since the decoding process is likely faster than the pipe itself.
Second, the compressed data shall be cut into independent blocks. Otherwise, it would be necessary to decode, and therefore read, all previous blocks
Third, to avoid to decode "too much data", the blocks shall be small enough, with corresponding impact on compression ratio (the smaller the block, the lower the compression ratio).
Fourth, since the i/o process is likely slower than LZ4 decoding, there is a benefit only if it is possible to quick-jump to the right location immediately.
This can be achieved thanks to a table at the beginning of the compressed file. Such a table can only be filled after compression, and therefore is incompatible with non-seekable output.
Fifth, such "table" mechanism at member level would be useless in members appending scenarios.

These are quite many restrictions, for the benefit of a hardly-requested feature.
So probably this capability shall be left to a dedicated framing format.
Moreover, should the input stream be seekable, it's still possible to "hop" over blocks without reading/decoding them. This is still slower than a direct jump, but still a sensible potential speed improvement.

Error detection algorithm
As a quick follow up of selecting-checksum-algorithm, one could note that i had not specified a preferred checksum algorithm, only a preferred checksum size (32-bits).
Although at this stage i'm somewhat inclined to use xxhash-strong, due to its speed and very good distribution property, there is still a chance that the algorithm might be found unsuitable at a later stage. Therefore, some provision should be left to allow another algorithm to take over later on if need be.

Pushing the idea a bit further, one could think "let the user select its own checksum algorithm". While the idea may sound nice, it goes against the principle of interoperability, which is exactly what this framing format tries to achieve. Therefore, only clearly defined checksum algorithms shall be allowed.

I believe this post went through most foreseeable requirements for the LZ4 framing format.
So now seems a reasonable time to start a header specification.

Friday, May 25, 2012

Useful compressed streaming properties

The previous articles were primarily targeted at error detection and memory buffer management, which are, arguably, very important features. We'll continue digging into the properties which seem necessary to build a suitably universal compressed streaming format.

Compressed data appending
There are some use cases in which newly created compressed data is simply appended to an existing file or stream. Some time later, the file will be decoded and/or filtered, searching for data of potential interest.
The decoder must not choke with such input. It shall simply continue the decoding, dealing with each compressed data one after another.

This feature is relatively simple to support : it is merely necessary to not assume that, after the current compressed stream, an EOF marker will necessarily happen.

The change is simple, but it also means that a single stream may host several compressed data sets. This situation is reminiscent of the well specified and documented gzip RFC 1952 :
"A gzip file consists of a series of "members" (compressed data sets)."
This is a good fit for this situation, so we'll use the same naming convention.
Authorized members
A typical compressed stream will have a single member.
The member will, necessarily, start with a header, which allows its screening as a "valid" or "invalid" input.
A simple way to achieve this selection is to start with a "magic number", which then is compared to a list. If it's not in the list, input is rejected as "invalid".
The "magic number" is enough to tell what kind of decoder is necessary, and which header parameters follow.
To ensure proper screening from invalid (noisy) input, the magic number shall be large enough for most of its values to be considered "invalid", thus reducing the perspective of "false positive" to negligible levels.

There is however a specific category that could be defined before-hand : skippable members.

Skippable members do not contain compressed data. Instead, they may, for example, contain user comments, or a description of the compressed content (for example, file properties), or an electronic signature, it is even possible to have a program suitable for a certain environment, well, whatever.
The point is that in some circumstances, these members are not strictly necessary, and may be skipped entirely to retrieve just the compressed content.

Skipping such member requires, obviously, to know its size.
Therefore, the specification shall allow :

A range of authorized "magic number values", known as skippable members
A mandatory "member size" field
Beyond that, any data in the member is "user-specified"

Any magic number which fall outside of the list of authorized member is considered "invalid", and stop the decoding process, flagging the input as "corrupted".

Clear endian convention
Some CPU read data using Little Endian convention, others use Big Endian convention.
Other conventions exist too (such as PDP endian), but it's fair to say they are rare.

Anyway, the point is that, to ensure all these platforms can exchange data between themselves, a single convention shall be selected and enforced

In my view, Little Endian is the most common convention these days, with x86/x64 on one side and ARM on the other side. Together, they almost "own" the CPU market, TV game console being the main noticeable market outside of their reach.

Compatibility with non-seekable streams
Being compatible with "streaming mode" explicitly excludes dependency on "seekable" property.

This use case targets pipes, in either input or output mode, or both.
Piped input data is provided "on the fly" by a generating process, such as tar for example. There is no way to "know its size". It will be discovered on hitting the end of the stream.
Piped output data is immediately consumed by a user process. There is no way to "come back" to an earlier position and correct or add an information there : every new information must necessarily be appended.

When both input and output are piped, it creates a set of restrictions.

Hopefully, it is not so difficult to overcome.
First, a stream doesn't work "byte by byte". It is in fact a succession of memory buffers. One of them is the compression buffer.
Knowing the size of input source data is useful in one case : when its smaller than the default block size. This is fine, since in this case, all input can be "loaded" into the compression buffer, and then compressed. Since we then write the result to the output buffer, it's possible to write there the size of input. During decoding, this size will allow allocation of just the necessary amount of memory, instead of the full block-size.
This is a useful but small benefit : we could have allocated the buffer at its maximum "block" size instead.

When input is larger than a single block (=buffer size), we have a problem, and we can't "write" the size of the input, since we don't know it yet.
Well, it's not that much of a problem. LZ4 doesn't "need" this size to decode. And the buffer is already being allocated to its maximum block size anyway.

In other words, knowing the uncompressed source size is not mandatory. We can live without.

Enforce data alignment
LZ4 does not require any data alignment, so it does not look necessary.

The next article will look more closely at "members" properties.

Tuesday, May 15, 2012

Dealing with blocks side-effects

Continuing on the previous post analysis of the lz4demo's current framing format, another side-effect created by this simple format is "latency".

Since the fixed block size is 8MB, the codec must wait for the first block to be completely filled before starting any decoding or compressing operation. It effectively defers processing by a few microseconds.

This issue may not seem large, especially if underlying I/O is fast. Nonetheless, not all I/O are fast, and even in such cases, an 8MB "starting time" is bound to be measurably worse than a 256KB one for example.

As a consequence, a framing format with a smaller block size would offer better and smoother processing throughput.

Which leads us to a last and important issue : independent blocks.
While this strategy is good for simplicity and multi-threading, it's bad for compression : it translates into a worsened compression ratio on the first 64KB of each block.

With block sizes of 8MB, this effect is negligible (affecting compression ratio by less than 0.1%). However, the smaller the block size, the worse the side-effect. With small block sizes, this effect can no longer be neglected.

Therefore, should the blocks remain independent ?

Indeed. By making the next block depending on the previous one, it nullifies the problem of worsened compression ratio. But it also makes it impossible to decode a compressed block independently, with negative consequences on multi-threading and partial decoding capabilities.
Is that really an issue ? Well, it really depends on the use case.

In many circumstances, such as simple file compression or stream compression, it does not matter, since data is encoded and decoded sequentially anyway. Throwing away the capability to multi-thread decompression seems bad, but in fact, most of the time, I/O is unable to cope with LZ4 decoding speed (around 1GB/s). So a single decoding thread is enough to handle almost any I/O load.
Since there is little need for partial decoding, nor for multithreaded decoding, the compression ratio gain looks more useful.

There is just a little remaining problem :
While the decoding function will need few adaptation to handle this new use case, most of the complexity being located into the buffer manager, the compression function on the other hand has to be adapted.

While each block were independant, compression could start with a pristine clean reference table.
But with sequentially dependant blocks, the initialization becomes more complex : the previous 64K needs to be copied in front of the next block, and then loaded/hashed into the reference table, before starting compression. It obviously costs CPU and time.
A variant is to just "translate" the references already loaded into the table as a result of compressing the previous block, but it's limited to "single thread" scenario.

OK, but now that we can reference data from previous blocks, how far should we go ? The natural maximum distance is the "copy window size". This size is, by default, 64KB for LZ4. But it could happen that the receiving end of the compressed stream has not enough memory to store that much data. In such case, the compressor must be careful in not using references beyond the memory capacity of the receiver. In other words, it must deliberately discard long-distance copy operations.

Should such a use case be part of the generic framing format or not ?
My answer would be : it's okay, as long as an easy solution can be found.

How could that work ? Once again, let's focus on the decoder side.
I'll imagine a controller with only 4K memory available as buffer.
A simple way to handle such case is by dividing this space into 2 parts : 2K for the "previous block", and 2K for the "block to decode". So we end up with :
- Block size = 2K = memory limit / 2
- Maximum copy reference = lower limit of previous block

Obviously, there are other possibilities, such as cutting data into even small parts (for example 1K blocks) and having a back reference of up to 3 previous blocks. But as a first approximation, it seems these variant will provide almost equivalent results while being more complex.

This situation can be summarized with a simple rule : never reference data beyond one block distance.

With only this simple rule in place, it seems the default LZ4 framing format could be made compatible even with environments with severely limited RAM, provided the encoder selects a suitable block size.