Friday, February 16, 2018

When to use Dictionary Compression

 On the Zstandard website, there is a small chapter dedicated to Dictionary compression. In a nutshell, it explains that it can dramatically improve compression ratio for small files. Which is correct, but doesn’t nearly capture the impact of this feature. In this blog post, I want to address this topic, and present a few scenarios in which dictionary offers more than just “improved compression”.

Database example

Let’s start by a simple case. Since dictionary compression is good for small data, we need an application which handles small data. For example a log record storage, which can be simplified as an append-only database.
Writes are append-only, but reads can be random. A single query may require retrieving multiple individual records scattered throughout the database, in no particular order.
As usual, the solution needs to be storage efficient, so compression plays an important role. And records tend to be repetitive, so compression ratio will be good.
However, this setup offers a classic example of the “block-size trade-off” : in order to get any compression done, it’s necessary to group records together into “blocks”, then compress each block as a single entity. The larger the block, the better the compression ratio. But now, in order to retrieve a single record, it’s necessary to decompress the whole block, which translates into extra-work during read operations.

Impact of Block Size on Compression ratio (level 1)
Last data point compresses records individually ([250-550] bytes)

The “optimal block size” is application dependent. It highly depends on data (how repetitive it is), and on usage scenarios (how many random reads). Typical sizes may vary between 4 KB and 128 KB.

Enter dictionary compression. It is now possible to train the algorithm to become specifically good at compressing a selected type of data.
(For simplicity purpose, in this example, we’ll assume that the log storage server stores a limited variety of record types, small enough to fit into a single dictionary. If not, it is necessary to separate record types into “categories”, generate a dictionary per category, then route each record into a separate pool based on category. More complex, but it doesn’t change the principles.)
As a first step, we can now just use the dictionary to grab additional compression (see graph), and it's an instant win. But that wouldn’t capture all the value from this operation.

The other benefit is that now, the optimal block size can be adjusted to the new conditions, since dictionary compression preserves better compression ratio (even 1 KB block size becomes practicable!). What we gain in exchange is improved performance for random read extraction.

Impact of Block Size on single record extraction (Random Reads)

At its extreme, it might even be possible to compress each record individually, so that no more energy is lost to decompress additional records which aren’t used by the query. But that's not even required.
This change produces huge wins for random reads (collecting “single records” scattered throughout the database). In this example, the log storage can now extract up to 2.5M random records / second, while without dictionary, it was limited to 400K random records / second, and at the cost of a pitiful compression ratio.

In fact, the new situation unlocks new possibilities for the database, which can now accept queries that used to be considered too prohibitive, and might previously have required another dedicated database to serve them.

Normal Vs Dictionary Compression (level 1)
Comparing Compression Ratio and Random Reads per second

That’s when the benefits of dictionary compression really kick in : beyond “better compression”, it unlocks new capabilities that were previously rightly considered “out of reach”.

Massive connections

Another interesting scenario is when a server maintains a large number of connections (multiple thousands) with many clients. Let’s assume the connection is used to send / receive requests respecting a certain protocol, so there tend to be a lot of inter-messages repetition. But within a single message, compression perspective is low, because each message is rather small.

The solution to this topic is known : use streaming compression. Streaming will keep in memory the last N KB (configurable) of messages and use that knowledge to better compress future messages. It reaches excellent performance, as same tags and sequences repeated across messages get squashed in subsequent messages.

The problem is, preserving such a “context” costs memory. And a lot of memory that is. To provide some perspective, a Zstandard streaming context with an history of 32 KB requires 263 KB of memory (about the same as zlib). Of course, increasing history depth improves compression ratio, but also increases memory budget.
That doesn’t look like a large budget in isolation, or even for a few connections, but when applied to 10 thousands client connections, we are talking about > 2.5 GB of RAM. Situation worsen with more connections, or when trying to improve ratio through larger history and higher compression levels.

In such a situation, dictionary compression can help. Training a dictionary to be good at compressing a set of protocols, and then ignite each new connection with this dictionary, will produce instant benefits during the early stage of the connection lifetime. Fine, but then, streaming history takes over, so gains are limited, especially when connection lifetime is long.

A bigger benefit can be realised when understanding that the dictionary can be used to completely eliminate streaming. Dictionary will then partially offset compression ratio reduction, so mileage can vary, but in many cases, ratio just takes a small hit, as generic redundant information is already included in the dictionary anyway.
What we get in exchange are 2 things :
  • huge RAM savings : it’s no longer necessary to preserve a context per client, a single context per active thread is now enough, which is typically several order of magnitude less than the nb of clients.
  • vastly simplified communication framework, as each request is now technically “context-less”, eliminating an important logic framework dedicated to keeping contexts synchronised between sender and receiver (with all the funny parts related to time out, missed or disordered packets, etc.).
Memory budget comparison
Context-less strategy requires much less memory (and is barely visible)

Eliminating the RAM requirement, which evolves from dominant to negligible, and reducing complexity open in turn new possibilities, such as hosting even more connections per server, or improved performance for other local applications. Or simply reduce the number of servers required for the same task.

This kind of scenarios is where Dictionary Compression can give its best : beyond “better compression for small data”, it makes it possible to build target application differently, with second-order effects being as important if not more than compression savings alone.

These are just 2 examples. And I’m sure there are more. But that’s enough to illustrate the topic, and probably enough words for a simple blog post.


  1. Thanks for writing about Dictionary compression. We recently used it to store data efficiently in RAM, and wrote about it here -

  2. > Plot: Block Size vs Random Reads / s

    Do I understand correctly that the dip on the right side of the blue (non-dictionary) line is because I/O is becoming the bottleneck? In other words compression ratio is getting so bad that reading and decompressing individually compressed records is more expensive than reading a compressed block of multiple records, (partially) decompressing it, and throwing away most of the data. For me that's a very remarkable result.

    > Plot: Compression Ratio vs Random Reads / s

    I feel like this would be easier to understand if you had switched the axis. Doing that in my head, the blue line has a 'peak' where both lower and higher compression ratio provides worse results. The orange line does not have such a peak, the optimal point is the lowest compression (individual records). Could that be a sign that choosing different compression settings (worse compression ratio but less cpu usage) could further improve performance in this scenario?

    1. > Plot: Block Size vs Random Reads / s

      Thanks for pointing that out David.
      After verification, the issue is, I over-estimated the average size of individual records (which span in the [250-550] range). It does not matter when comparing speed at different block sizes, but it does matter when comparing blocks with single-record compression.

      After fixing this value, I'm getting an almost flat ending : it's barely faster to read single records than to read a (small) block of multiple records and throw away the useless ones. But it's nonetheless *slightly* faster.
      Which feels more logical.

      I'll fix the graphs with the new values.

    2. > Could that be a sign that choosing different compression settings (worse compression ratio but less cpu usage) could further improve performance in this scenario?

      All tests have been performed at compression level 1.
      It's the fastest compression setting available.

      Using higher compression levels would improve compression ratio, and generally as a consequence, slightly improve decompression speed. So all graphs would improve.

      In general, if your workload is io-read-limited, and if you have margin on the write side, it's better to increase compression level, to improve compression ratio _and_ read speed.

  3. I would like to understand a few basic things about dictionary based zstd compression:
    1) Are these dictionaries used for modeling or encoding?
    2) If they help with modeling, do they contain actual/anticipated unique sub-strings? (Did
    we move the unique string literals from the compressed message into a database?)
    3) Where are these dictionaries stored? Esp. custom generated dictionaries?
    4) How are the dictionaries made available to decompression? Example:
    We compressed a message using a custom dictionary on one system. Trying to decompress said
    message on a different system. Where is the dictionary coming from? Is it attached to the
    5) Do these dictionaries make much sense for stateless compression?
    6) Do dictionaries makes sense for small_block e.g.: 4 or 8k stateless compression?

    1. 1) I'm not sure about what is meant by separating modeling and encoding
      2) Yes, dictionaries contain anticipated strings
      3-4) The whole mechanism of shipping, sending, synchronizing dictionaries is implementation dependent. There are a lot of possibilities here, with distinct trade-offs. For a more detailed presentation of one possible solution, I recommend reading the following white paper, chapter "Managed Compression" :
      5) Dictionaries make a lot of sense when trying to remove stateful compression, since they replace point-to-point dedicated states with generic states
      6) Dictionaries are especially useful for small blocks

  4. Took a look at the command line interface to the dictionary feature, and I infer the following:

    Seems to me that the dictionary contains some "probable" string literals, so it is for the modeling phase of compression. It looks like it is oriented towards file (stateful) compression as it is now envisioned. I imagine that an implementation would have to have some dictionary repository and at least a file naming convention, to make available the common dictionaries. An implementation relying on custom generated dictionaries would have to transmit these along with the compressed message(s) if the recipient is to have any hope of decompressing. Managing the dictionaries for stateless compression would be a bit of a challenge, I imagine each compressed message (block) would have to have a dictionary uuid stamped into it, so the decompress code can select the correct (probably cached) dictionary at decompress time. To make custom dictionaries useful between independent entities would be a bit of a challenge. there would have to be an agreed upon protocol to distribute these along with the compressed data. (I there a standard method envisioned for dictionary identification/distribution? I will go read the rfc see if there is such thing already proposed).
    Also: The shifting of the storage of some of the unique substring literals from the message to the dictionary has to be a "best effort". In other words, the dictionary procedure is inherently opportunistic. Unique substring literals would still have to be stored in the compressed message, but some of these substring could be replaced by dictionary references, if they are already present in the dictionary that is in use. I imagine the current implementation assumes there is only one dictionary loaded/used for the compression/decompression of a given message.

    Please correct me if I got this wrong.

    1. > I imagine each compressed message (block) would have to have a dictionary uuid stamped into it,

      That's exactly how it works today

      > the current implementation assumes there is only one dictionary loaded/used for the compression/decompression of a given message.

      Indeed, for a given "frame", only one dictionary can be used.