Comments on RealTime Data Compression: When to use Dictionary Compression

> I imagine each compressed message (block) wo...

2020-07-28T01:17:33.723+02:00

> I imagine each compressed message (block) would have to have a dictionary uuid stamped into it,

That's exactly how it works today

> the current implementation assumes there is only one dictionary loaded/used for the compression/decompression of a given message.

Indeed, for a given "frame", only one dictionary can be used.

Took a look at the command line interface to the d...

2020-07-28T00:06:27.914+02:00

Took a look at the command line interface to the dictionary feature, and I infer the following:

Seems to me that the dictionary contains some "probable" string literals, so it is for the modeling phase of compression. It looks like it is oriented towards file (stateful) compression as it is now envisioned. I imagine that an implementation would have to have some dictionary repository and at least a file naming convention, to make available the common dictionaries. An implementation relying on custom generated dictionaries would have to transmit these along with the compressed message(s) if the recipient is to have any hope of decompressing. Managing the dictionaries for stateless compression would be a bit of a challenge, I imagine each compressed message (block) would have to have a dictionary uuid stamped into it, so the decompress code can select the correct (probably cached) dictionary at decompress time. To make custom dictionaries useful between independent entities would be a bit of a challenge. there would have to be an agreed upon protocol to distribute these along with the compressed data. (I there a standard method envisioned for dictionary identification/distribution? I will go read the rfc see if there is such thing already proposed).
Also: The shifting of the storage of some of the unique substring literals from the message to the dictionary has to be a "best effort". In other words, the dictionary procedure is inherently opportunistic. Unique substring literals would still have to be stored in the compressed message, but some of these substring could be replaced by dictionary references, if they are already present in the dictionary that is in use. I imagine the current implementation assumes there is only one dictionary loaded/used for the compression/decompression of a given message.

Please correct me if I got this wrong.

1) I'm not sure about what is meant by separat...

2020-07-27T23:40:29.479+02:00

1) I'm not sure about what is meant by separating modeling and encoding
2) Yes, dictionaries contain anticipated strings
3-4) The whole mechanism of shipping, sending, synchronizing dictionaries is implementation dependent. There are a lot of possibilities here, with distinct trade-offs. For a more detailed presentation of one possible solution, I recommend reading the following white paper, chapter "Managed Compression" :
https://engineering.fb.com/core-data/zstandard/
5) Dictionaries make a lot of sense when trying to remove stateful compression, since they replace point-to-point dedicated states with generic states
6) Dictionaries are especially useful for small blocks

I would like to understand a few basic things abou...

2020-07-27T23:30:45.299+02:00

I would like to understand a few basic things about dictionary based zstd compression:
1) Are these dictionaries used for modeling or encoding?
2) If they help with modeling, do they contain actual/anticipated unique sub-strings? (Did
we move the unique string literals from the compressed message into a database?)
3) Where are these dictionaries stored? Esp. custom generated dictionaries?
4) How are the dictionaries made available to decompression? Example:
We compressed a message using a custom dictionary on one system. Trying to decompress said
message on a different system. Where is the dictionary coming from? Is it attached to the
message?
5) Do these dictionaries make much sense for stateless compression?
6) Do dictionaries makes sense for small_block e.g.: 4 or 8k stateless compression?

> Could that be a sign that choosing different ...

2018-02-21T15:43:14.068+01:00

> Could that be a sign that choosing different compression settings (worse compression ratio but less cpu usage) could further improve performance in this scenario?

All tests have been performed at compression level 1.
It's the fastest compression setting available.

Using higher compression levels would improve compression ratio, and generally as a consequence, slightly improve decompression speed. So all graphs would improve.

In general, if your workload is io-read-limited, and if you have margin on the write side, it's better to increase compression level, to improve compression ratio _and_ read speed.

> Plot: Block Size vs Random Reads / s Thanks ...

2018-02-21T15:07:25.435+01:00

> Plot: Block Size vs Random Reads / s

Thanks for pointing that out David.
After verification, the issue is, I over-estimated the average size of individual records (which span in the [250-550] range). It does not matter when comparing speed at different block sizes, but it does matter when comparing blocks with single-record compression.

After fixing this value, I'm getting an almost flat ending : it's barely faster to read single records than to read a (small) block of multiple records and throw away the useless ones. But it's nonetheless *slightly* faster.
Which feels more logical.

I'll fix the graphs with the new values.

> Plot: Block Size vs Random Reads / s Do I un...

2018-02-21T08:52:13.745+01:00

> Plot: Block Size vs Random Reads / s

Do I understand correctly that the dip on the right side of the blue (non-dictionary) line is because I/O is becoming the bottleneck? In other words compression ratio is getting so bad that reading and decompressing individually compressed records is more expensive than reading a compressed block of multiple records, (partially) decompressing it, and throwing away most of the data. For me that's a very remarkable result.

> Plot: Compression Ratio vs Random Reads / s

I feel like this would be easier to understand if you had switched the axis. Doing that in my head, the blue line has a 'peak' where both lower and higher compression ratio provides worse results. The orange line does not have such a peak, the optimal point is the lowest compression (individual records). Could that be a sign that choosing different compression settings (worse compression ratio but less cpu usage) could further improve performance in this scenario?

Thanks for the link, it's a great read !

2018-02-18T19:39:47.161+01:00

Thanks for the link, it's a great read !

Thanks for writing about Dictionary compression. W...

2018-02-18T15:40:19.015+01:00

Thanks for writing about Dictionary compression. We recently used it to store data efficiently in RAM, and wrote about it here - https://clevertap.com/blog/clevertap-engineering-behavioral-messaging-at-scale/