Comments on RealTime Data Compression: Finalizing a compression format

I have run into content errors before, but it'...

2016-07-07T01:21:52.725+02:00

I have run into content errors before, but it's typically on very large files downloaded by chrome...hmm...almost hate to waste the space if it isn't required, maybe make the first 8 bits be "zero or one" to tell you if the rest is a checksum or not?

Good points Daniel. An issue with CRC-32C is havin...

2016-07-06T19:27:48.425+02:00

Good points Daniel.
An issue with CRC-32C is having excellent performance tied to Intel hardware. Outside of it, though, it's less clear.
What about other SoC, such as ARM for example ? or mips, powerpc, and such ?

Considering the objective of portability, it's mandatory to provide a fast software backup to cover these systems.
Alas, last time I checked, there was no fast software backup competitive with xxHash. Could have changed though.

CRC have the advantage to guarantee finding errors below a certain threshold. But only if the full 32-bits crc result is stored. There is no such guarantee on a subset of it (which is a pity here as the plan is to store 22 bits only).
Another drawback is that above the error threshold, collision rate is in fact *higher*. Nothing dramatic nor unexpected.

So a question is, in how many cases are the errors below the threshold ?
That's a hard one.
My feeling is that nowadays, hardware is full of error correction codes already, be it on the storage side, or the transmission side. That means, data is either clean, or completely shambled.
This goes against the need for a detector specialized in "small" errors.

I see however some software scenarios where it could happen. For example, a write operation which would reach beyond assigned buffer and pollute the next one. That could result in a limited error of just a few bytes.
Though it's just one use case, and likely not the most important one.

On the content checksum: Have you considered usin...

2016-07-05T23:51:28.203+02:00

On the content checksum:

Have you considered using CRC-32C: the Castagnoli polynomial, used by iSCSI et al, implemented in modern x86 hardware instruction set?

It's faster, and also better at catching random errors. Using the h/w instruction, I believe the speed is about 20GB/s.

Hashes catch any error with a frequency of 2^n-1 / 2^n. CRC-32C would also catch:

1. All burst errors up to 32-bits.
2. All one-bit errors.
3. All multiple-bit errors up to the Hamming Distance. The message size for this application is 1Mbit. I haven't seen a Hamming Distance calculation for CRC-32C for messages that large. For 128kbit, the Hamming Distance is 4 [ref 1]. I'd guess for 1Mbit it'd be 2 (catch all 2-bit errors), but that's just a guess.

If your data parser works on 3-byte words, I suppose you can serialize the CRC32-C in two words.

Since with modern hardware it's so cheap to calculate, and the common case of using the codec is with large blocks, I think the default should be to enable content checksumming, and the library user only turns it off for the rarer cases of working with small blocks.

[ref 1] http://users.ece.cmu.edu/~koopman/networks/dsn02/dsn02_koopman.pdf

I believe to have one proposition that can match t...

2016-06-02T19:53:54.410+02:00

I believe to have one proposition that can match this recommendation.

Currently, the WindowLog value is a 4-bits field (0-15), ranging from 4KB to 128MB by x2 increment.

The proposition would be to extend that amount to 8-bits, making for a single byte.

On this total, 5-bits give the power of 2, while 3-bits provide the "fractional" part. Exponent-mantissa if you wish.

So it would be possible to specify 10 KB as 8 KB + 2/8th, for example. It's more precise, but not completely precise (doesn't allow every value).

The 5-bits for exponent allow very large values. Almost ridiculously high. Keeping a baseline of 4 KB, it means the highest value (31) translates into 2<<12+31 == ~8 TB. So it might be preferable to use this extra room to lower the minimum value, for example to 1 KB.

In complement, it would be possible to say "use the frame content size instead", allowing precise selection of any buffer size that can be expressed within 8 bytes (~16 EB). In which case, the "window" byte would simply not be necessary (I rather see this feature interesting when the content size is extremely small, such as ~100 bytes).

That also means the decoder must defend itself against ridiculous memory requirement, so any decoder must have an implementation-specific limit and refuse to decode any frame with too high requirement.
Currently, such limit is 32 MB for 32-bits and 128 MB for 64-bits version of zstd. But since it is implementation dependent, it could be extended in a future version, without jeopardizing the specification.

That's indeed the plan : by default, a diction...

2016-05-26T11:21:08.994+02:00

That's indeed the plan :
by default, a dictionary ID will be a 4-bytes "random" number, but the "random" will be in fact a hash of dictionary content.

That looks fine for 4-bytes.
The problem is for people which consider that 4-bytes is too much for an ID. That's indeed very small, but when the goal is to compress a lot of very small packets, ending in the ~50 bytes range, 4-bytes is almost 10% ...

2-bytes can still be fine with "random" methodology for a handful of dictionaries, but not much more.

1-byte is not really compatible with random ID, accidental collision probability is too large.

So, as an option, it will also be possible to manually specify which ID should have a dictionary.
This will be an advanced option, for people who know what they are doing.

Dictionary ID : can't a checksum computed on t...

2016-05-26T09:35:36.254+02:00

Dictionary ID : can't a checksum computed on the used dictionary be used for the dictionary ID ?
As far as I understand, this would give it all the necessary properties, and probably get rid of any collision possibilities without any burdensome IDs housekeeping.

Thanks for your insightful suggestions Bulat, I...

2016-05-18T11:56:01.473+02:00

Thanks for your insightful suggestions Bulat, I'll certainly make use of them

about dictionary size: consider the format live of...

2016-05-17T18:52:07.135+02:00

about dictionary size: consider the format live of 10-20 years and ensure that it will still work. since larger dictionary is anyway breaking compatibility, it may just enough to have "extended header" feature

your thoughts somethat resembles mmy own work on freearc 2.0 format, so i can give a few tips:

format starts with bit fields specifying presence of existing features. it may be further extended by extra bytes with fields for new features that aren't known at 1.0 timeframe. last bit of each extended byte means "extend me to next byte". so, you have N bytes with mandatory fields existing at 1.0 timeframe, and standard extension mechanism for newer versions - some, usually last, bit of format X means "extend me with format X+1". iff decoder doesn't support format X+1, it should fail seeing this bit enabled

optonal fields should be treated in another way. one possibility is to attach ID+size to each field, f.e. using 7+1 encoding: 0x12 0x81 0x23 means field with ID "0x12" and size 0x01 * 128 + 0x23. this way you can extend header with arbitrary fields

another way is to usу the same scheme as above, only with separate bute sequence, so decoder just stops readimng properties when it doesn't know the X+1 format

this way you can drop support of large dictionaries in the 1.0, but i suggest to reserve 3-4 bits for "mantissa" part of dictsize. it may greatly improve memory usage in pretty usual scenarios