tag:blogger.com,1999:blog-8341348527880854922024-03-12T05:52:26.889+01:00RealTime Data CompressionDevelopment blog on compression algorithmsCyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.comBlogger139125tag:blogger.com,1999:blog-834134852788085492.post-20814533357918759352019-03-15T18:21:00.001+01:002020-10-14T16:37:46.216+02:00Presenting XXH3<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>xxh3</title>
<link rel="stylesheet" href="https://stackedit.io/style.css" />
</head>
<body class="stackedit">
<div class="stackedit__html"><h1 id="xxh3---a-new-speed-optimized-hash-algorithm">XXH3 - a new speed-optimized hash algorithm</h1>
<p>The <a href="http://www.xxhash.com">xxHash</a> family of hash functions has proven more successful than anticipated. Initially designed as a checksum companion for LZ4, it has found its way into many more projects, requiring vastly different workloads.</p>
<p>I was recently summoned to investigate performance for a bloom filter implementation, requiring to generate quickly 64 pseudo-random bits from small inputs of variable length. <code>XXH64</code> could fit the bill, but performance on small inputs, never was its priority. It’s not completely wasteful either, it pays a bit attention to short inputs thanks to a small speed module in SMHasher. However, the module itself does the bare minimum, and it was not clear to me what’s exactly measured.</p>
<p>So I decided to create my own benchmark program, as a way to ensure that I understand and control what’s being measured. This was a very interesting journey, leading to surprising discoveries.</p>
<p>The end result of this investigation is <code>XXH3</code>, a cross-over inspired by many other great hash algorithms, which proves substantially faster than existing variants of xxHash, across basically all dimensions.<br>
Let’s detail those dimensions, and give some credit where inspiration is due.</p>
<h3 id="checksumming-long-inputs">Checksumming long inputs</h3>
<p><code>xxHash</code> started as a fast checksum for <a href="https://www.lz4.org">LZ4</a>, and I believe it can still be useful for this purpose. It has proven popular among movie makers for file transfer verification, saving a lot of time thanks to its great speed. The main downside is that <code>XXH64()</code> is limited to 64-bit, which is insufficient when comparing a really large number of files (and by large I mean many many million ones). For this reason, a 128-bit variant has often been requested,</p>
<p><code>XXH3</code> features a wide internal state of 512 bits, which makes it suitable to generate a hash of up to 256 bit. For the time being, only 64-bit and 128-bit variants are exposed, but a similar recipe can be used for a 256-bit variant if there is any need for it one day. All variant feature same speed, since only the finalization stage is different.</p>
<p><img src="https://user-images.githubusercontent.com/750081/61976096-b3a35f00-af9f-11e9-8229-e0afc506c6ec.png" alt="XXH3 bandwidth"></p>
<p>I’m using this opportunity to compare with a few other well known hash algorithms, either because their notoriety makes them frequently named in discussions related to hash algorithms (<code>FNV</code>, <code>CRC</code>), or because they are very good in at least one dimension.</p>
<p><code>XXH3</code> proves very fast on large inputs, thanks to a vector-friendly inner-loop, inspired <a href="https://github.com/Bulat-Ziganshin/FARSH">Bulat Ziganshin’s Farsh</a>, itself based on <a href="https://en.wikipedia.org/wiki/UMAC">UMAC</a> paper.</p>
<p>Unfortunately, UMAC features a critical flaw for checksumming, which makes it <em>ignore</em> 4 bytes of input, on average every 16 GB. This might not seem much, and it might even be acceptable if the goal is to generate a 32-bit checksum as in the original paper. But for checksumming large files with 64-bit or 128-bit fingerprints, this is a big no-no.<br>
So the version embedded into <code>XXH3</code> is modified, to guarantee that all input bytes are necessarily present in the final mix. This makes it a bit slower, but as can be seen in the graphs, it remains plenty fast.</p>
<p>Vectorization must be done manually, using intrinsic, as the compiler seems unable to properly auto-vectorize the scalar code path.<br>
For this reason, the code offers 4 paths : scalar (universal), <code>SSE2</code>, <code>AVX2</code>, and also <code>NEON</code> offered by <a href="https://github.com/easyaspi314">Devin Hussey</a>. It may be possible to vectorize additional platforms, though this requires dedicated efforts.</p>
<p><code>SSE2</code> is enough to reach substantial speed, which is great because all <code>x64</code> cpus necessarily support this instruction set. <code>SSE2</code> is also free of dynamic throttling issues, and is automatically enabled on all x64 compilers. Hence I expect it to be the most common target.</p>
<p>On a given code path, compilers can make a difference. For example, <code>AVX2</code> vectorization is significantly more effective with <code>clang</code>. Actually, the speed of this variant is so fast that I was wondering if it was faster than my main memory. So I graphed the speed over a variety of input sizes.</p>
<p><img src="https://user-images.githubusercontent.com/750081/62815356-53bcc480-bb18-11e9-8bb5-699d4972fcb6.png" alt="XXH3 Bandwidth, per size"></p>
<p>As one can see, the <code>AVX2</code> build <em>is</em> much faster than main memory, and the impact of cache size is now clearly observable, at 32 KB (L1), 256 KB (L2) and 8 MB (L3). As a rule, “top speed” is only achievable when data is already in cache.</p>
<p>So is it worth being so fast ?<br>
If data is very large (say, a movie), it can’t fit in the cache, so the bottleneck will be at best the main memory, if not I/O system itself. In which case, a faster hash may save cpu time, but will not make the checksumming operation faster.</p>
<p>On the other hand, there are many use cases where data is neither large nor small, say in the KB range. This includes many types of record, typical of database workloads. In these use cases, hashing is not the main operation : it’s just one of many operations, sandwiched between other pieces of code. Input data is already in the cache, because it was needed anyway by these other operations. In such a scenario, hashing faster helps to a faster overall run time, as cpu savings are employed by subsequent operations.</p>
<h4 id="bit-friendliness">32-bit friendliness</h4>
<p>The computing world is massively transitioning to 64-bit, even on mobile. The remaining space for 32-bit seems ever shrinking. Yet, it’s still present, in more places than can be listed. For example, many virtual environment generate bytecodes designed to produce a 32-bit application.</p>
<p>Thing is, most modern hash algorithms take advantage of 64-bit instructions, which can ingest data twice faster, so it’s key to great speed. Once translated for the 32-bit world, these 64-bit instructions can still be emulated, but at a cost. In most cases, it translates into a massive speed loss. That’s why <code>XXH32</code> remains popular for 32-bit applications, it’s a great performer in this category.</p>
<p>A nice property of <code>XXH3</code> is that it doesn’t lose so much speed when translated into 32-bit instructions. This is due to some careful choices in instructions used in the main loop. The result is actually pretty good :</p>
<p><img src="https://user-images.githubusercontent.com/750081/62815374-85ce2680-bb18-11e9-8fc4-4cd06a32935d.png" alt="XXH3, bandwidth in 32-bit mode"></p>
<p><code>XXH3</code> can overtake <code>XXH32</code>, even without vectorial instruction ! Enabling <code>SSE2</code> put it in another league.</p>
<p>A similar property can be observed on ARM 32-bit. The base speed is very competitive, and the <code>NEON</code> vectorial code path designed by Devin makes wonder, pushing speed to new boundaries.</p>
<h3 id="hashing-small-inputs">Hashing small inputs</h3>
<p>The high speed achieved on large input wasn’t actually the center of my investigation.<br>
The main focus is about short keys of random lengths, with a distribution of length roughly in the 20-30 bytes area, featuring occasional outliers, both tiny and large.</p>
<p>This scenario is very different. Actually, with such small input, the vectorized inner loop is <em>never</em> triggered. Delivering a good quality hash result must be achieved using a small amount of operations.</p>
<p>This investigation quickly converged onto <a href="https://github.com/google/cityhash">Google’s CityHash</a>, by Geoff Pyke and <a href="https://github.com/jyrkialakuijala">Jyrki Alakuijala</a>. This algorithm features an excellent access pattern for small data, later replicated into FarmHash, giving them an edge. This proved another major source of inspiration for <code>XXH3</code>.</p>
<p>A small concern is that Cityhash comes in 2 variants, with or without seed. One could logically expect that both variants are “equivalent”, with one just setting a default seed value.<br>
That’s not the case. The variant without seed forego the final avalanche stage, making it faster. Unfortunately, it also makes it <a href="https://pastebin.com/JT1KDJc0">fail SMHasher’s avalanche test</a>, showing very large bias. For this reason, I will distinguish both variants in the graph, as the speed difference on small inputs is quite noticeable.</p>
<p>The benchmark test looks simple enough : just loop over some small input of known size, and count the nb of hashes produced. Size is only known at run time, so there’s no way for the compiler to “specialize” the code for a given size. There are some finicky details in ensuring proper timing, but once solved, it gives an interesting ranking.</p>
<p><img src="https://user-images.githubusercontent.com/750081/62815468-8a470f00-bb19-11e9-9da5-f6c9db31a984.png" alt="XXH3, throughput, small fixed size"></p>
<p>Top algorithms are based on the same “access pattern”, and there are visible “steps” on reaching 33+ length, and then again at 65+. That’s because, in order to generate less branches, the algorithm does exactly the same work from 33 to 64 bytes. So the amount of instructions to run is comparatively large for 33 bytes.<br>
In spite of this, <code>XXH3</code> maintains a comfortable lead even at “bad” length values (17, 33, 65).</p>
<p>This first results looks good, but it’s not yet satisfying.<br>
Remember the “variable size” requirement ?<br>
This is not met by this scenario.</p>
<h3 id="impact-of-variable-input-sizes">Impact of variable input sizes</h3>
<p>Always providing the same input size is simply too easy for branches. The branch predictor can make a good job at guessing the outcome every time.</p>
<p>This is just not representative of most real-life scenarios, where there’s no such predictability. Mix inputs of different sizes, and it wreaks havoc on all these branches, adding a considerable cost at each hash. This impact is often overlooked, because measuring it is a bit more difficult. But it’s important enough to deserve some focus.</p>
<p>In the following scenario, input sizes are presumed randomly distributed between 1 and N. The distribution of lengths is pre-generated, and the same distribution is used for all hashes for a same N. This lean towards worst case scenario: generally, input sizes feature some kind of locality (as in target scenario, mostly between 20 and 30 bytes). But it gives us a good idea of how algorithms handle varying sizes.</p>
<p><img src="https://user-images.githubusercontent.com/750081/62815500-ce3a1400-bb19-11e9-8a8a-6365e54960b3.png" alt="XXH3, throughput on small inputs of random length"></p>
<p>This is a more significant victory for algorithms with an optimized access pattern. When input sizes become unpredictable, branch mispredictions become a much larger contributor to performance. The optimized access pattern makes the workload more predictable, and reduces the nb of branches which can be mispredicted. This is key to preserve a good level of performance in these conditions.</p>
<h3 id="throughput-versus-latency">Throughput versus Latency</h3>
<p>Throughput is relatively simple to measure : just loop over a bunch of inputs, hash them, then count the number of hashes completed in a given time.<br>
But one could wonder if throughput is an appropriate metric. It represents a “batch” workload, where a ton of hashes are feverishly completed one after another. It may happen sometimes.</p>
<p>But in many cases, hashing is just one operation sandwiched between other very different tasks. This is a completely different background.<br>
In this new setup, hashing must wait for prior operation to complete in order to receive its input, and later operation is blocked as long as the hash result is not produced. Hence latency seems a much better metric.</p>
<p>However, measuring latency is a lot more complex. I had many false starts in this experiment.<br>
I initially thought that it would be enough to provide the result of previous hash as <code>seed</code> of the next hash. It doesn’t work : not only some algorithms do not take <code>seed</code> as arguments, a few others only use the <code>seed</code> at the very end of the calculation, letting them start hash calculations before the end of previous hash.<br>
In reality, in a latency scenario, the hash is waiting for the input to be available, so it’s the input which must be based on previous hash result. After a lot of pain, the better solution was finally suggested by <a href="https://github.com/felixhandte">Felix Handte</a> : use a pre-generated buffer of random bytes, and start hashing from a variable position derived from previous hash result. It enforces that next hash has to wait for previous hash result before starting.</p>
<p>This new setup creates a surprisingly different ranking :</p>
<p><img src="https://user-images.githubusercontent.com/750081/62815421-02610500-bb19-11e9-8bb7-42d1dd0fdb02.png" alt="XXH3, latency, fixed size"></p>
<p>Measurements are a bit noisy, but trends look visible.</p>
<p>The latency-oriented test favors algorithms like <a href="https://github.com/vnmakarov/mum-hash">Vladimir Makarov’s <code>mumv2</code></a> and <a href="https://github.com/leo-yuriev/t1ha">Leo Yuriev’s <code>t1ha2</code></a>, using the 64x64=>128-bits multiplication. This proved another source of inspiration for <code>XXH3</code>.</p>
<p>Cityhash suffers in this benchmark. Cityhash is based on simpler instructions, and completing a hash requires many more of them. In a throughput scenario, where there is no serialization constraint, Cityhash can start next hash before finishing previous one. Its simple instructions can be spread more effectively over multiple execution units, achieving a high level of IPC (Instruction per Clock). This makes Cityhash throughput friendly.</p>
<p>In contrast, the 64x64=>128-bits multiplication has access to a very restricted set of ports, but is more powerful at mixing bits, allowing usage of <em>less</em> instructions to create a hash with good avalanche properties. Less instructions translate into a shorter pipeline.</p>
<p>In the latency scenario, <code>mumh2</code> fares very well, fighting for first place up to the 32-byte mark, after which <code>XXH3</code> starts to take a lead.</p>
<p>However, this scenario involves fixed input size. It’s simple to code and explain, but as we’ve seen before, fixed size is actually an uncommon scenario : for most real-world use cases, input has an unpredictable size.</p>
<p>Hence, let’s combine the benchmark techniques seen previously, and look at the impact of random input lengths on latency.</p>
<p><img src="https://user-images.githubusercontent.com/750081/61976089-aedeab00-af9f-11e9-9239-e5375d6c080f.png" alt="XXH3, latency, random length"></p>
<p>This is an important graph, as it matches the target use case of <code>XXH3</code>, and incidentally many real-world database/server use cases I’m aware of.</p>
<p>The variable size scenario favors algorithms using an optimized access pattern to reduce branch misprediction. <code>mumv2</code>, which was performing very well when input size was stable, loses a lot in this scenario. <code>t1ha2</code> makes a better effort, and while not as well optimized as Cityhash for this purpose, loses nonetheless much less performance to variable input size, overtaking second place (if one does not count the “seed-less” variants in the ranking, due to afore-mentioned avalanche problems).</p>
<p>As could be expected, <code>XXH3</code> is well tuned for this scenario. It’s no surprise since it was its target. So it’s basically mission accomplished ?</p>
<h3 id="hash-quality">Hash Quality</h3>
<p>It wouldn’t be a complete presentation without a note on hash quality. A good hash should make collisions as rare as possible, bounded by the birthday paradox, and offer great avalanche property : two different inputs shall produce vastly different output, even if they only differ by a single bit.</p>
<p>As expected, <code>XXH3</code> <a href="https://pastebin.com/ryLN24Qy">completes all tests</a> from <a href="https://github.com/aappleby/smhasher"><code>SMHasher</code></a> test suite. Both 64 and 128-bit variants were validated, as well as each of their 32-bit constituent.</p>
<p>But it went a bit farther.<br>
<code>SMHasher</code> was designed many years ago, at a time when hashing was mostly a single main loop iterating over input. But as hash algorithms have become more powerful, this model feels no longer correct : modern hashes tend to feature a large inner loop, which is only triggered after a certain length. That means that the algorithm being tested when there are only a few input bytes is actually different from the one run on large inputs.</p>
<p>Because search space tends to explode with input size, and because computing capacity used to be weaker when SMHasher was created, most tests are concentrated on small inputs. As a consequence, tests for larger input sizes are very limited.</p>
<p>In order to stress the algorithm, it was necessary to push the tests beyond their usual limits. So I created a <a href="https://github.com/Cyan4973/smhasher">fork</a> of <a href="https://github.com/rurban/smhasher">rurban’s excellent SMHasher fork</a>, methodically increasing limits to new boundaries. It’s still the same set of tests, but exploring a larger space, hence longer to run.<br>
This proved useful during the design stage, eliminating risks of long-distance “echo” for example (when bits cancel each other by virtue of being at some exact relative position).<br>
It also proved interesting to run these extended tests on existing algorithms, uncovering some “surprises” that were masked by the lower threshold of original tests.<br>
To this end, these changes will be offered back to rurban’s fork, in the hope that they will prove useful for future testers and implementers.</p>
<h3 id="release">Release</h3>
<p><code>XXH3</code> is now released as part of <a href="https://github.com/Cyan4973/xxHash/releases/tag/v0.7.0">xxHash v0.7.0</a>. It’s still labelled “experimental”, and must be unlocked using macro <code>XXH_STATIC_LINKING_ONLY</code>. It’s suitable for ephemeral data and tests, but avoid storing long-term hash values yet. This period will be used to gather user’s feedback, after which the algorithm will transferred into stable in a future release.</p>
<p><b>Update</b>: Since the release of <a href="https://github.com/Cyan4973/xxHash/releases/latest">xxHash v0.8.0</a>, <code>XXH3</code> is now labelled "stable", meaning produced hash values can be stored on disk or exchanged over a network, as any future version is now guaranteed produce the same hash value. Compared with initial release, <code>v0.8.0</code> comes with streaming capabilities, 128-bit variant support, and better inlining.</p>
</div>
</body>
</html>
Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com36tag:blogger.com,1999:blog-834134852788085492.post-66658582392887929292019-01-30T08:13:00.000+01:002019-01-30T08:13:47.180+01:00Compiler-checked contracts<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>btrc: compile6 : compiler-validated contracts</title>
<link rel="stylesheet" href="https://stackedit.io/style.css" />
</head>
<body class="stackedit">
<div class="stackedit__html">
<p><a href="https://en.wikipedia.org/wiki/Design_by_contract">Contract programming</a> is not a new concept. It’s a clean way to design narrow contracts, by spelling explicitly its conditions, and checking them actively at contract’s interface. In essence, it’s translated into a bunch of <code>assert()</code> at the entrance and exit of a function. It’s a fairly good formal design, although one associated with a runtime penalty.</p>
<p>We left the previous episode with an ability to express function preconditions and make them checked by the compiler, but no good way to transport the outcome of these checks into the function body. We’ll pick up from there.</p>
<p>The proposed solution is to re-express these invariants in the function as <code>assert()</code>, as they should have been anyway if <code>EXPECT()</code> was absent. It works, but it also means that downstream <code>EXPECT()</code> can only be validated when <code>assert()</code> are enabled, aka. in debug builds.</p>
<p>Let’s try to improve this situation, and keep <code>EXPECT()</code> active while compiling in release mode, aka with <code>assert()</code> disabled.<br>
What is needed is an <code>assert()</code> that still achieves its outcome on Value Range Analysis while disabled. Such a thing exists, and is generally called an <code>assume()</code>.</p>
<h3 id="assume"><code>assume()</code></h3>
<p><code>assume()</code> is not part of the language, but most compilers offer some kind of hook to build one. Unfortunately, they all differ.</p>
<p>On <code>gcc</code>, <code>assume()</code> can be created using <code>__builtin_unreachable()</code> :</p>
<pre class=" language-c"><code class="prism language-c"><span class="token macro property">#<span class="token directive keyword">define</span> assume(cond) do { if (!(cond)) __builtin_unreachable(); } while (0)</span>
</code></pre>
<p><code>clang</code> provides <code>__builtin_assume()</code>. <code>icc</code> and Visual provide <code>___assume()</code>. etc.<br>
You get the idea.</p>
<p>An important point here is that, in contrast with all techniques seen so far, <code>assume()</code> actually reduces compiler’s effectiveness at catching bug. It’s an explicit “trust me, I’ll tell you what to know” situation, and it’s easy to get it wrong.</p>
<p>One way to mitigate this negative impact is to make sure <code>assume()</code> are converted into <code>assert()</code> within debug builds, so that there is at least a chance that wrong assumptions get caught during tests.</p>
<p><code>assume()</code> however have additional restrictions compared to <code>assert()</code>. <code>assert()</code> merely need to produce no side-effect, and offer a tractable runtime cost, though even this last point is negotiable. But <code>assume()</code> must be transformed into pure compiler hints, leaving no trace in the generated binary (beyond the assumption’s impact). In particular, the test itself should not be present in the generated binary.<br>
This reduces eligible tests to simple conditions only, such as <code>(i>=0)</code> or <code>(ptr != NULL)</code>. A counter example would be <code>(is_this_graph_acyclic(*graph))</code>. “complex” conditions will not provide any useful hint to the compiler, and on top of that, may also leave a runtime trace into the generated binary, resulting in a <em>reduction</em> of performance.</p>
<p><em>Note</em> : I haven’t found a way to ensure this property on <code>gcc</code> : if <code>assume()</code> is served a complex condition, it will <a href="https://godbolt.org/z/lKNMs3">happily generate additional asm code</a>, without issuing any warning.<br>
Fortunately, <code>clang</code> is <a href="https://godbolt.org/z/aLDyrP">way better at this game</a>, and will correctly flag bad <code>assume()</code> conditions, which would generate additional code in <code>gcc</code>.<br>
As a consequence, it’s preferable to have <code>clang</code> available and check the code with it from time to time to ensure all <code>assume()</code> conditions are correct.</p>
<p>In our case, <code>assume()</code> main objective is not really performance : it is to forward conditions already checked by <code>EXPECT()</code> within function’s body, so that they can be re-used to automatically comply with downstream <code>EXPECT()</code> conditions, even when <code>assert()</code> are disabled, aka during release compilation.</p>
<p>So here we are : every time a condition is required by <code>EXPECT()</code> and cannot be deducted from the local code, express it using <code>assume()</code> rather than <code>assert()</code>. This will make it possible to keep <code>EXPECT()</code> active irrespective of the debug nature of the build.<br>
Note that, if any condition is too complex for <code>assume()</code>, we are back to square one, and need to rely on <code>assert()</code> only (hence debug builds only).</p>
<p>Being able to keep <code>EXPECT()</code> active in release builds is nice, but not terrific. At this stage, we still need to write all these <code>assume()</code> in the code, and we cannot take advantage of pre-conditions already expressed at the entrance of the function.</p>
<p>Worse, since pre-conditions are expressed on one side, in the <code>*.h</code> header where function prototype is published, while the corresponding <code>assume()</code> are expressed in the function body, within <code>*.c</code> unit file, that’s 2 separate places, and it’s easy to lose sync, when one side changes the conditions.</p>
<h3 id="expressing-conditions-in-one-place">Expressing conditions in one place</h3>
<p>What we need is to express preconditions in a <a href="https://en.wikipedia.org/wiki/Single_source_of_truth">single source of truth</a>. This place should preferably be close to the prototype declaration, since it can also serve as function documentation. Then the same conditions will be used in the function body, becoming assumptions.</p>
<p>The solution is simple : define a macro to transport the conditions in multiple places.<br>
Here is <a href="https://godbolt.org/z/oWoT3I">an example</a>.<br>
The conditions are transferred from the header, close to prototype declaration, into the function body, using a uniquely named macro. It guarantees that conditions are kept in sync.<br>
In the example, note how the knowledge of <code>minus2()</code> preconditions, now considered satisfied within function body, makes it possible to automatically comply with the preconditions of invoked <code>minus1()</code>, without adding any <code>assert()</code> or <code>assume()</code>.</p>
<p>In this example, the condition is trivial <code>(i>=2</code>), using a single argument. Using a macro to synchronize such a trivial condition may seem overkill. However, synchronization is important in its own right. Besides, more complex functions, featuring multiple conditions on multiple arguments, will be served by a design pattern which can be just reproduced mindlessly, whatever the complexity of the preconditions : <code>assume(function_preconditions());</code>.</p>
<p>There is still a variable element, related to the number of arguments and their order.<br>
To deal with that variance, argument names could be baked directly into the preconditions macro. Unfortunately, this would only work within a function. But since the macro transporting preconditions is itself invoked within a macro, it wouldn’t expand correctly.</p>
<p>Another downside is that we just lost a bit of clarity in the warning message : conditions themselves used to be part of the warning message, now only the macro name is, which transmits less information.<br>
Unfortunately, I haven’t found a way around this issue.</p>
<p>To preserve clarity of the warning message, it may be tempting to keep the previous format, with conditions expressed directly in the masking function macro, whenever such conditions are not required afterwards in the body. However, it creates a special cases, with some functions which replicate conditions in their body, and those that don’t.</p>
<p>Transmitting preconditions compliance into function body makes it easier to comply with a chain of preconditions. A consequence of which, it makes it more tractable to use compile-time pre-conditions onto a larger scope of the code base.</p>
<h3 id="post-conditions">Post conditions</h3>
<p>Yet we are not completely done, because the need to check preconditions implies that all contributors of any parameter are part of the game. Function’s return values themselves are contributors.</p>
<p>For example, one may invoke a function <code>f1()</code> requiring an argument <code>i>0</code>.<br>
The said argument may be provided as a return value of a previous function <code>f2()</code>.<br>
<code>f2()</code> might guarantee in its documentation that its return value is necessarily <code>>0</code>, hence is compliant,<br>
but the compiler doesn’t read the documentation. As far as it’s concerned, the return value could be any value the type allows.</p>
<p>The only way to express this situation is to save the return value into an intermediate variable,<br>
and then <code>assert()</code> or <code>assume()</code> it with the expected guarantee,<br>
then pass it to the second function.<br>
This is a bit more verbose than necessary, especially as <code>f2()</code> was already fulfilling the required preconditions. Besides, if <code>f2()</code> guarantees change, the local assumption will no longer be correct.</p>
<p>Guarantees on function’s outcome are also called post-conditions. The whole game is to pass this information to the compiler.</p>
<p>This could be done by bundling the post-conditions into the macro invoking the function.<br>
Unfortunately, that’s a bit hard to achieve with a portable macro, usual woes get in the way : single-evaluation, variable declarations and returning a value are hard to achieve together.</p>
<p>For this particular job, we are better off using an <code>inline</code> function.<br>
See <a href="https://godbolt.org/z/jNxO6b">this example</a> on godbolt.<br>
It works almost fine : the guarantees from first function are used to satisfy preconditions of second function. This works without the need to locally re-assess first function’s guarantees.<br>
As an exercise, removing the post-conditions from encapsulating <code>inline</code> function immediately triggers a warning on second invocation, proving it’s effective.</p>
<p>However, we just lost a big property by switching to an <code>inline</code> function : warnings now locate precondition violations <em>into</em> the <code>inline</code> function, instead of the place where the function is invoked with incorrect arguments. Without this information, we just know there is a contract violation, but we don’t know <em>where</em>. This makes fixing it sensibly more difficult.</p>
<p>To circumvent this issue, let’s use a macro again. This time we will combine a macro to express preconditions with an inlined function to express outcome guarantees. Here is <a href="https://godbolt.org/z/i4pdFH">an example</a>.<br>
This one gets it right on almost everything : it’s portable, conditions and guarantees are transferred to the compiler, which triggers a warning whenever a condition is not met, indicating the correct position of the problem.</p>
<p>There is just one last little problem : notice how the input parameter <code>v</code> get evaluated twice in the macros. This is fine if <code>v</code> is a variable, but not if it’s a function. Something like <code>f1( f2(v) )</code> will evaluate <code>f2()</code> twice, which is bad, both for runtime and potentially for correctness, should <code>f2(v)</code> return value be different on second invocation.</p>
<p>It’s a pity because this problem was solved in the first proposal, using only an <code>inline</code> function. It just could not forward the position where a condition was broken. Now we are left with two incomplete proposals.</p>
<p>Let’s try it again, using a special kind of macro.<br>
<code>gcc</code> and by extension <code>clang</code> support a special kind of <a href="https://gcc.gnu.org/onlinedocs/gcc-5.2.0/gcc/Statement-Exprs.html">statement expression</a>, which makes it possible to create a compound statement able to return a value (its last expression). This construction is not portable. In general, I wouldn’t advocate it due to portability restrictions. But in this case, <code>EXPECT()</code> only works on <code>gcc</code> to begin with, so it doesn’t feel too bad to use a <code>gcc</code> specific construction. It simply must be disabled on non-gcc targets.</p>
<p>The new formulation, reproduced below, works perfectly, and now <a href="https://godbolt.org/z/dukzpt">enforces the contract while avoiding the double-evaluation problem, and correctly indicates the position at which a condition is violated</a>, significantly improving diagnosis.</p>
<pre class=" language-c"><code class="prism language-c"><span class="token keyword">int</span> <span class="token function">positive_plus1</span><span class="token punctuation">(</span><span class="token keyword">int</span> v<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token macro property">#<span class="token directive keyword">define</span> positive_plus1_preconditions(v) ((v)>=0) </span><span class="token comment">// Let's first define the conditions. Name is long, because conditions must be unique to the function.</span>
<span class="token macro property">#<span class="token directive keyword">define</span> positive_plus1_postconditions(r) ((r)>0) </span><span class="token comment">// Convention : r is the return value. Only used once, but published close to the prototype, for documentation.</span>
<span class="token comment">// Encapsulating macro</span>
<span class="token comment">// The macro itself can be published in another place of the header,</span>
<span class="token comment">// to leave complete visibility to the prototype and its conditions.</span>
<span class="token comment">// This specific type of macro is called a statement-expression,</span>
<span class="token comment">// a non-portable construction supported by `gcc` (and `clang`)</span>
<span class="token comment">// It's okay in this case, because `EXPECT()` only works with `gcc` anyway.</span>
<span class="token comment">// But it will have to be disabled for non-gcc compilers.</span>
<span class="token macro property">#<span class="token directive keyword">define</span> positive_plus1(iv) ({ \
int const _v = iv; </span><span class="token comment">/* avoid double-evaluation of iv */</span> \
<span class="token keyword">int</span> _r<span class="token punctuation">;</span> \
<span class="token function">EXPECT</span><span class="token punctuation">(</span><span class="token function">positive_plus1_preconditions</span><span class="token punctuation">(</span>_v<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">/* also used within function body */</span> \
_r <span class="token operator">=</span> <span class="token function">positive_plus1</span><span class="token punctuation">(</span>_v<span class="token punctuation">)</span><span class="token punctuation">;</span> \
<span class="token function">assume</span><span class="token punctuation">(</span><span class="token function">positive_plus1_postconditions</span><span class="token punctuation">(</span>_r<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">/* only used here */</span> \
_r<span class="token punctuation">;</span> <span class="token comment">/* last expression is the return value of compound statement */</span> \
<span class="token punctuation">}</span><span class="token punctuation">)</span>
</code></pre>
<h3 id="summary">Summary</h3>
<p>That’s it. This construction gives all the tools necessary to use compiler-checked contracts in a C code base. Such strong checks increase the reliability of the code base, especially during refactoring exercises, by catching <em>at compile time</em> all potential contract breaches, and <em>requiring</em> to deal with them, either through branches or at least explicitly <code>assert()</code> them. This is a big step up from a situations where breaking conditions was plain silent at compilation, and <em>may</em> break during tests <em>if</em> <code>assert()</code> are not forgotten <em>and</em> the test case is able to break the condition.</p>
<p>It can be argued that applying this design pattern makes declaring functions more verbose, and it’s true. The effort though was supposed to be already done in a different way : as part of code documentation, and as part of runtime checks (list of <code>assert()</code> within function body). The difference is that they are expressed upfront, and are known to the compiler, which is more powerful.</p>
<p>Nonetheless, it would be even better if conditions could become part of the function signature, making the notation clearer, better supported, and by extension possibly compatible with automatic documentation or IDE’s context info, simplifying their presentation.<br>
There is currently a C++20 proposal, called <a href="https://en.cppreference.com/w/cpp/language/attributes/contract">attribute contract</a>, which plans to offer something close. Okay, it’s not C, and quite importantly it is a bit different in subtle ways : it’s more focused on runtime checks. There is a specific <code>[[expects axiom: (...)]]</code> notation which seems closer to what is proposed in this article, because it doesn’t silently insert automatic runtime checks. However, as far as I know, it also doesn’t guarantee any compiler check, reducing the contract to a simple <code>assume()</code>. It implies this topic is left free to compiler’s willingness, which may or may not pick it up, most likely resulting in significant behavior differences.</p>
<p>But hopefully, the trick presented in this article is available right now, and doesn’t need to wait for any committee, it can be used immediately on existing code bases.</p>
<p>I hope this article will raise awareness on what compilers already know as part of their complex machinery primarily oriented towards better runtime performance, and make a case on how to re-purpose a small part of it to improve correctness too.</p>
</div>
</body>
</html>Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com2tag:blogger.com,1999:blog-834134852788085492.post-71470528680018518062019-01-28T08:04:00.000+01:002019-08-20T16:54:09.398+02:00Compile-time tests<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>btrc: compile5 : compile-time tests</title>
<link rel="stylesheet" href="https://stackedit.io/style.css" />
</head>
<body class="stackedit">
<div class="stackedit__html"><h1 id="compile-time-tests">Compile-time tests</h1>
<p>A function generally operates on states and parameters. The function’s result is deemed valid if its inputs respect a number of (hopefully documented) conditions. It can be as simple as saying that a size should be positive, and a state should be already allocated.</p>
<p>The usual way to check if conditions are met is to <code>assert()</code> them, right at the beginning of the function. The <code>assert()</code> adds a runtime check, which is typically active during tests. The hope if that, if tests are thorough enough, any scenario which can violate the conditions will be found during tests, and fixed.</p>
<p>As one can already guess, this method is imperfect. Don’t get me wrong: adding <code>assert()</code> is <em>way way better</em> than not adding them, but the whole precinct is to hope that tests will be good enough to find the bad paths leading to a condition violation, and one can never be sure that all bad paths were uncovered.</p>
<p>In some cases, it’s possible to transfer a check at compile time instead.<br>
It only works for a subset of what can be checked. But whatever is validated at compilation stage carries much stronger guarantees : it’s like a mini-proof that always holds, for whatever state the program is in.</p>
<p>As a consequence, it eliminates the need for a runtime check, which saves cpu and binary size.<br>
More importantly, it removes the need of a “failure code path”, requiring the caller to test and consider carefully what must be done when an incorrect condition happens. This leads to a corresponding simplification of the code, with massive maintenance benefits.<br>
On top of that, since the condition can be checked immediately during compilation or parsing, it’s right in the short feedback loop of the programmer, allowing failures to be identified and fixed quickly.</p>
<p>This set of benefits is too strong to miss. As a general rule, whatever can be checked at compile time should be.</p>
<h2 id="static-assert">static assert</h2>
<p>Invariant guarantees can be checked at compile time with a <a href="https://en.wikichip.org/wiki/c/assert.h/static_assert"><code>static_assert()</code></a>. Compilation will stop, with an error, if the invariant condition is not satisfied. A successful compilation necessarily means that the condition is always respected (for the compilation target).</p>
<p>A typical usage is to ensure that the <code>int</code> type of target system is wide enough. Or that some constants respect a pre-defined order. Or, as suggested in an earlier article, that a shell type is necessarily large enough to host its target type.</p>
<p>It has all the big advantages mentioned previously : no trace in the generated binary, no runtime nor space cost, no reliance on runtime tests to ensure that the condition is respected.</p>
<h3 id="c90-compatibility">C90 compatibility</h3>
<p><a href="https://en.wikichip.org/wiki/c/assert.h/static_assert"><code>static_assert()</code></a> is a special macro added in the <code>C11</code> standard. While most modern compilers are compatible with this version of the standard, if you plan on making your code portable on a wider set of compilers, it’s a good thing to consider an alternative which is compatible with older variants, such as <code>C90</code>.</p>
<p>Fortunately, it’s not that hard. <code>static_assert()</code> started its life as a “compiler trick”, and many of them can be found over Internet. The basic idea is to transform a condition into an invalid construction, so that the compiler <em>must</em> issue an error at the position of the <code>static_assert()</code>. Typical tricks include :</p>
<ul>
<li>defining an <code>enum</code> value as a constant divided by <code>0</code></li>
<li>defining a table which size is negative</li>
</ul>
<p>For example :</p>
<pre class=" language-c"><code class="prism language-c"><span class="token macro property">#<span class="token directive keyword">define</span> STATIC_ASSERT(COND,MSG) typedef char static_assert_##MSG[(COND)?1:-1]</span>
</code></pre>
<p>One can find multiple versions, with different limitations. The macro above has the following ones :</p>
<ul>
<li>cannot be expressed in a block after the first statement for <code>C90</code> compatibility (declarations before statements)</li>
<li>require different error messages to distinguish multiple assertions</li>
<li>require the error message to be a single uninterrupted word, without double quotes, differing from <code>C11</code> version</li>
</ul>
<p>The 1st restriction can be circumvented by putting brackets around <code>{ static_assert(); }</code> whenever needed.<br>
The 2nd one can be improved by adding a <code>__LINE__</code> macro as part of the name, thus making it less probable for two definitions to use exactly the same name. The macro definition becomes more complex though.<br>
The last restriction is more concerning: it’s a strong limitation, directly incompatible with <code>C11</code>.</p>
<p>That’s why I rather recommend <a href="https://godbolt.org/z/K9RvWS">this more complete version by Joseph Quinsey</a>, which makes it possible to invoke the macro the same way as the <code>C11</code> version, allowing to switch easily from one to another. The declaration / statement limitation for <code>C90</code> is still present, but as mentioned, easily mitigated.</p>
<h3 id="limitations">Limitations</h3>
<p>A huge limitation is that static asserts can only reason about constants, which values are known at compile time.</p>
<p>Constants, in the <code>C</code> dialect, regroup a very restricted set :</p>
<ul>
<li>Literals value, e.g. <code>222</code>.</li>
<li>Macros which result in literals value, e.g. <code>#define value 18</code></li>
<li><code>enum</code> values</li>
<li><code>sizeof()</code> results</li>
<li>Mathematical operations over constants which can be solved at compile time, e.g. <code>4+1</code> or even <code>((4+3) << 2) / 18</code>.</li>
</ul>
<p>As a counter-example, one might believe that <code>const int i = 1;</code> is a constant, as implied by the qualifier <code>const</code>. But it’s a misnomer : it does not define a constant, merely a “read-only” (immutable) value.</p>
<p>Therefore it’s <a href="https://godbolt.org/z/SO8FLZ">not possible to <code>static_assert()</code> conditions on variables</a>, not even <code>const</code> ones. It’s also not possible to <a href="https://godbolt.org/z/Q0K5wV">express conditions using functions</a>, not even pure ones (only macro replacements are valid).</p>
<p>This obviously strongly restrains the set of conditions that can be expressed with a <code>static_assert()</code>.</p>
<p>Nonetheless, every time <code>static_assert()</code> is a valid option, it’s recommended to use it. It’s a very cheap, efficient zero-cost abstraction which guarantees an invariant, contributing to a safer code generation.</p>
<h2 id="arbitrary-conditions-validated-at-compile-time">Arbitrary conditions validated at compile time</h2>
<p>Checking an arbitrary condition at compile time? like a runtime <code>assert()</code> ? That sounds preposterous.<br>
Yet, that’s exactly what we are going to see in this paragraph.</p>
<p>The question asked changes in a subtle way : it’s no longer “prove that the condition holds given current value(s) in memory”, but rather “prove that the condition can never be false”, which is a much stronger statement.</p>
<p>The benefits are similar to <code>static_assert()</code> : as the condition is guaranteed to be met, no need to check it at run time, hence no runtime cost, no need for a failure path, no reliance on tests to detect bad cases, etc.</p>
<p>Enforcing such a strong property may seem a bit overwhelming. However, that’s exactly what is <em>already required</em> by the standard, for any operation featuring <a href="https://en.wikipedia.org/wiki/Undefined_behavior">undefined behavior</a> as a consequence of violation of their <a href="https://alexpolt.github.io/contract.html">narrow contract</a>.<br>
The real problem is that the full responsibility of knowing and respecting the contract is transferred onto the programmer, which receives, by default, no compile-time signal to warn when these conditions are broken.</p>
<p>Compile-time condition validation reverse this logic, and ensure that a condition is <em>always met</em> if it passes compilation. This is a big change, with corresponding safety benefits.</p>
<p>This method is not suitable for situations determined by some unpredictable runtime event. For example, it’s not possible to guarantee that a certain file <em>will</em> exist at runtime, so trying to open a file always requires a runtime check.</p>
<p>But there are a ton of conditions that the programmer expect to be always true, and which violation necessarily constitutes a programming error. These are our targets.</p>
<h4 id="example">Example</h4>
<p>Let’s give a simple example :<br>
dereferencing a pointer requires that, as a bare minimum, the pointer is not <code>NULL</code>. It’s not a loose statement, like “this pointer is probably not <code>NULL</code> in general”, it must be 100% true, otherwise, <a href="https://en.wikipedia.org/wiki/Undefined_behavior">undefined behavior</a> is invoked.<br>
How to ensure this property then ?</p>
<p>Simple : test if the pointer is <code>NULL</code>, and if it is, do not dereference it, and branch elsewhere.<br>
<strong>Passing the branch test guarantees the pointer is now non-<code>NULL</code></strong> .</p>
<p>This example is trivial, yet very applicable.<br>
It’s extremely common to forget such a test, since there’s no warning for the programmer. A <code>NULL</code> pointer can happen due to exceptional conditions which can be difficult to trigger during tests, such as a rare <code>malloc()</code> failure for example.</p>
<p>And that’s just a beginning : most functions and operations feature a set of conditions to be respected for their behavior and result to be correct. Want to divide ? better be by non-zero. Want to add signed values ? Well, be sure they don’t overflow. Let’s call <code>memcpy()</code> ? First, ensure memory segments are allocated and don’t overlap.<br>
And on, and on, and on.</p>
<p>While it’s sometimes possible to <code>assert()</code> some of these conditions, it’s not great, because in absence of compilation warnings, contract violation can still happen at runtime. And while the <code>assert()</code>, if enabled, will avoid the situation to degenerate into <a href="https://en.wikipedia.org/wiki/Undefined_behavior">undefined behavior</a>, it still translates into an abrupt <code>abort()</code>, which is another form of vulnerability.</p>
<p>A better solution is to ensure that the condition always hold. This is where a compile-time guarantee comes in.</p>
<h4 id="solution">Solution</h4>
<p>We want the compiler to emit a warning whenever a condition cannot be guaranteed to be true. Technically, this is almost like an <code>assert()</code>, though without a trace in the generated binary.</p>
<p>This outcome is already common : whenever an <code>assert()</code> can be proven to be always true, the compiler will remove it, through a fairly common optimization stage called Dead Code Elimination (DCE).</p>
<p>Therefore, the idea is to design an <code>assert()</code> that must be removed from final binary through DCE, and emits a warning if it does not.</p>
<p>Since no such instruction exists in the base language, we’ll have to rely on some compiler-specific extensions. <code>gcc</code> for example offers a <a href="https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html">function attribute</a> which does exactly that :</p>
<blockquote>
<p><code>warning ("message")</code><br>
If the <code>warning</code> attribute is used on a function declaration and a call to such a function is not eliminated through dead code elimination or other optimizations, a warning that includes <code>"message"</code> is diagnosed. This is useful for compile-time checking.</p>
</blockquote>
<p>This makes it possible to create this macro :</p>
<pre class=" language-c"><code class="prism language-c"><span class="token function">__attribute__</span><span class="token punctuation">(</span><span class="token punctuation">(</span>noinline<span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token function">__attribute__</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token function">warning</span><span class="token punctuation">(</span><span class="token string">"condition not guaranteed"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token keyword">static</span> <span class="token keyword">void</span> <span class="token function">never_reach</span><span class="token punctuation">(</span><span class="token keyword">void</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token function">abort</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// must define a side effect, to not be optimized away</span>
<span class="token comment">// EXPECT() : will trigger a warning if the condition is not guaranteed to be true</span>
<span class="token macro property">#<span class="token directive keyword">define</span> EXPECT(c) (void)((c) ? (void)0 : never_reach())</span>
</code></pre>
<p>The resulting macro is called <code>EXPECT()</code>, for consistency with a recent C++20 proposal, called <a href="https://en.cppreference.com/w/cpp/language/attributes/contract">attribute contract</a>, which suggests the notation <code>[[expects: expression]]</code> to achieve something similar (though not strictly identical, but that’s a later topic).</p>
<p><code>EXPECT()</code> is designed to be used the same way as <code>assert()</code>, the difference being it will trigger a warning at compile time whenever it cannot be optimized away, underlying that the condition can not be proven to be always true.</p>
<h3 id="limitations-1">Limitations</h3>
<p>It would be too easy if one could just start writing <code>EXPECT()</code> everywhere as an <code>assert()</code> replacement. Beyond the fact that it can only be used to test programming invariants, there are additional limitations.</p>
<p>First, this version of <code>EXPECT()</code> macro only works well on <code>gcc</code>. I have not found a good enough equivalent for other compilers, though it can be emulated using other tricks, such as an <a href="https://godbolt.org/z/OyoUjL">incorrect assembler statement</a>, or linking to some non existing function, both of which feature significant limitations : do not display the line at which condition is broken, or do not work when it’s not a program with a <code>main()</code> function.</p>
<p>Second, checking the condition is tied to compiler’s capability to combine <a href="https://en.wikipedia.org/wiki/Value_range_analysis">Value Range Analysis</a> with <a href="https://en.wikipedia.org/wiki/Dead_code_elimination">Dead Code Elimination</a>. That means the compiler must use at least a bit of optimization. These optimizations are not too intense, so <code>-O1</code> is generally enough. Higher levels can make a difference if they increase the amount of inlining (see below).</p>
<p>However, <code>-O0</code> definitely does not cut it, and all <code>EXPECT()</code> will fail. Therefore, <code>EXPECT()</code> must be disabled when compiling with <code>-O0</code>. <code>-O0</code> can be used for fast debug builds for example, so it cannot be ruled out. This issue makes it impossible to keep <code>EXPECT()</code> always active by default, so its activation must be tied to some explicit build macro.</p>
<p>Third, Value Range Analysis is limited, and can only track function-local changes. It cannot cross function boundaries.</p>
<p>There is a substantial exception to this last rule for <code>inline</code> functions : for these cases, since function body will be included into the caller’s body, <code>EXPECT()</code> conditions will be applied to both sides of the interface, doing a great job at checking conditions and inheriting VRA outcome for optimization.</p>
<p><strong><code>inline</code> functions are likely the best place to start introducing <code>EXPECT()</code></strong> into an existing code base.</p>
<h2 id="function-pre-conditions">Function pre-conditions</h2>
<p>When a function is not <code>inline</code>, the situation becomes more complex, and <code>EXPECT()</code> must be used differently compared to <code>assert()</code>.</p>
<p>For example, a typical way to check that input conditions are respected is to <code>assert()</code> them at the beginning of the function. This wouldn’t work with <code>EXPECT()</code>.</p>
<p>Since VRA does not cross function boundaries, <code>EXPECT()</code> will not know that the function is called with bad parameters. Actually, it will also not know that the function is called with good parameters. With no ability to make any assumption on function parameters, <code>EXPECT()</code> will just always fail.</p>
<pre class=" language-c"><code class="prism language-c"><span class="token comment">// Never call with `v==0`</span>
<span class="token keyword">int</span> <span class="token function">division</span><span class="token punctuation">(</span><span class="token keyword">int</span> v<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
<span class="token function">EXPECT</span><span class="token punctuation">(</span>v<span class="token operator">!=</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// This condition will always fail :</span>
<span class="token comment">// the compiler cannot make any assumption about `v` value.</span>
<span class="token keyword">return</span> <span class="token number">1</span><span class="token operator">/</span>v<span class="token punctuation">;</span>
<span class="token punctuation">}</span>
<span class="token keyword">int</span> <span class="token function">lets_call_division_zero</span><span class="token punctuation">(</span><span class="token keyword">void</span><span class="token punctuation">)</span>
<span class="token punctuation">{</span>
<span class="token keyword">return</span> <span class="token function">division</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// No warning here, though condition is violated</span>
<span class="token punctuation">}</span>
</code></pre>
<p>To be useful, <code>EXPECT()</code> must be declared <em>on the caller side</em>, where it can properly check input conditions.<br>
Yet, having to spell input conditions on the caller side at every invocation is cumbersome. Worse, it’s too difficult to maintain: if conditions change, all invocations must be updated !</p>
<p>A better solution is to spell all conditions in a single place, and encapsulate them as part of the invocation.</p>
<pre class=" language-c"><code class="prism language-c"><span class="token comment">// Never call with `v==0`</span>
<span class="token keyword">int</span> <span class="token function">division</span><span class="token punctuation">(</span><span class="token keyword">int</span> v<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
<span class="token keyword">return</span> <span class="token number">1</span><span class="token operator">/</span>v<span class="token punctuation">;</span>
<span class="token punctuation">}</span>
<span class="token comment">// The macro has same name as the function, so it masks it.</span>
<span class="token comment">// It encapsulates all preconditions, and deliver the same result as the function.</span>
<span class="token macro property">#<span class="token directive keyword">define</span> division(v) ( EXPECT(v!=0), division(v) )</span>
<span class="token keyword">int</span> <span class="token function">lets_call_division_zero</span><span class="token punctuation">(</span><span class="token keyword">void</span><span class="token punctuation">)</span>
<span class="token punctuation">{</span>
<span class="token keyword">return</span> <span class="token function">division</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// Now, this one gets flagged right here</span>
<span class="token punctuation">}</span>
<span class="token keyword">int</span> <span class="token function">lets_call_division_by_something</span><span class="token punctuation">(</span><span class="token keyword">int</span> divisor<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
<span class="token keyword">return</span> <span class="token function">division</span><span class="token punctuation">(</span>divisor<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// This one gets flagged too : there is no guarantee that it is not 0 !</span>
<span class="token punctuation">}</span>
<span class="token keyword">int</span> <span class="token function">lets_divide_and_pay_attention_now</span><span class="token punctuation">(</span><span class="token keyword">int</span> divisor<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
<span class="token keyword">if</span> <span class="token punctuation">(</span>divisor <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span>
<span class="token keyword">return</span> <span class="token function">division</span><span class="token punctuation">(</span>divisor<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// This one is okay : no warning</span>
<span class="token punctuation">}</span>
</code></pre>
<p>Here are some more <a href="https://godbolt.org/z/E2QgA0">example usages</a>. Note how <code>EXPECT()</code> are combined with a function signature into a macro, so that compile time checks get triggered every time the function is called.</p>
<h3 id="limitations-2">Limitations</h3>
<p>This construction solves the issue on the caller side, which is the most important one.</p>
<p>You may note that the macro features a typical flaw : its argument <code>v</code> is present twice. It means that, if <code>v</code> is actually a function, it’s going to be invoked twice. In some cases, like <code>rand()</code>, both invocations may even produce different results.</p>
<p>However, at this stage, it’s impossible to successfully invoke the macro using a function as argument to begin with.<br>
That’s because then function’s return value has no any guarantee attached beyond its type.<br>
So, if the function is <code>int f()</code>, its return value could be any value, from <code>INT_MIN</code> to <code>INT_MAX</code>.<br>
As a consequence, no function’s return value can ever comply with any condition. It will necessarily generate a warning.</p>
<p>The encapsulating macro can only check conditions on variables, and it will only accept variables which are <em>guaranteed</em> to respect the conditions. If a single one <em>may</em> break any condition, a warning is issued.</p>
<p>However, pre-conditions remain unknown to the function body itself. This is an issue, because without it, it is necessary to re-express the conditions within the function body, which is an unwelcome burden.</p>
<p>A quick work-around is to express these guarantees inside the function body using <code>assert()</code>. This is, by the way, what should have been done anyway.</p>
<p>An associated downside is that ensuring that <code>EXPECT()</code> conditions are respected using <code>assert()</code> presumes that <code>assert()</code> are present and active in source code, to guide the Value Range Analysis. If <code>assert()</code> are disabled, their corresponding <code>EXPECT()</code> will fail.<br>
This suggests that <code>EXPECT()</code> can only be checked in debug builds, and with optimization enabled (<code>-O1</code>).</p>
<p>With all these <code>assert()</code> back, it seems like these compile-time checks are purely redundant, hence almost useless.</p>
<p>Not quite. It’s true that so far, it has not reduced the amount of <code>assert()</code> present in the code, but the compiler now actively checks expressed pre-conditions, and mandates the presence of <code>assert()</code> for every condition that the local code does not explicitly rule out. This is still a step up : risks of contract violation are now underlined early, and it’s no longer possible to “forget” an <code>assert()</code>. As a side effect, tests will also catch condition violations sooner, leading to more focused and shorter debug sessions. This is still a notable improvement.</p>
<p>It nonetheless feels kind of incomplete. One missing aspect is an ability to transfer pre-conditions from the calling site to the function body, so that they can be re-used to satisfy a chain of pre-conditions.<br>
This capability requires another complementary tool. We’ll see that in the next blog post.</p>
</div>
</body>
</html>
Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com6tag:blogger.com,1999:blog-834134852788085492.post-65334018212280635642019-01-24T08:22:00.001+01:002021-02-22T11:00:39.431+01:00Compiler Warnings<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>btrc: compile4 : compiler warnings</title>
<link rel="stylesheet" href="https://stackedit.io/style.css" />
</head>
<body class="stackedit">
<div class="stackedit__html">
<p>One way to improve C code quality is to reduce the number of strange constructions that the standard does not explicitly forbid. This will greatly help code reviewers, who want less surprises, and try to understand what a segment of source code is achieving and impacting.</p>
<p>A straightforward way to create such a “constrained” C variant is to add compiler-specific warning flags. They will trigger warnings on detecting certain constructions considered dubious, if not downright dangerous.</p>
<p>A simple example is the condition <code>if (i=1) {</code>. This test seems to check if <code>i</code> equal <code>1</code>, but that’s not what it does : it <em>assigns</em> the value <code>1</code> to <code>i</code>. Also, as a consequence, it is always true. This is <em>most likely</em> a typo, the programmer probably wanted to express an equality test <code>if (i==1) {</code>. Yet, it’s not invalid, strictly speaking. So a compiler is <a href="https://godbolt.org/z/AFzE-u">allowed to accept it at face value</a> and generate corresponding assembly without any warning. That may take a while to debug …</p>
<p>The <code>if (i=1) {</code> typo statement is well known, and nowadays it triggers a warning in most compilers <a href="https://godbolt.org/z/1c5LhW">with the help of warning flags</a>.<br>
At the very least, the warning is an invitation to spell the intention more clearly.<br>
Sometimes, it was a genuine error, and the compiler just helped us catch this issue before it ever reaches production, saving some considerable debug time.</p>
<p>Multiplying the number of flags will increase the number of warnings. But sifting through a large list of warnings to find which ones are interesting and which one are merely informational can be daunting. Moreover, collaborative programming requires simple rules, that anyone can abide by.</p>
<p>Using warnings should be coupled with a strict “zero-warning” policy. Every warning must be considered an error to be dealt with immediately. This is a clean signal that everyone understand, and that any CI environment can act upon. If a warning message is considered not fixable, or not desirable to fix, it’s preferable to remove the associated flag from the build chain.</p>
<p>On <code>gcc</code>, ensuring that no warning can be ignored can be enforced by the <code>-Werror</code> flag, which makes any warning a fatal error. Visual has <a href="https://i2.wp.com/dailydotnettips.com/wp-content/uploads/2016/03/image11.png">“treat warnings as errors”</a>.<br>
More complex policies are possible, such as activating more warnings and only make some of them fatal (for example <code>-Werror=vla</code>) but it makes the setup more complex, and logs more difficult to follow.</p>
<p>As a consequence, it’s not a good idea to just “enable everything”. Each additional flag increases the number of false-positive to deal with. When too many warnings are generated, it will feel like a discouraging and low-value task, leading to its abandon. Only warnings which bring some value deserve to be tracked, fixed, and continuously maintained. Therefore, it is preferable to only add a flag when its benefit is clearly understood.</p>
<p>That being said, the best moment to crank up the warning level is at the beginning of a project. What tends to be difficult is to <em>add</em> new flags to an existing project, because new flags will reveal tons of programming patterns that where silently allowed and must now be avoided, anywhere within the repository. On the other hand, keeping an existing code clean is much simpler, as issues appear only in new commits, and can therefore be located and fixed quickly.</p>
<h2 id="ms-visual">MS Visual</h2>
<p>My programming habits have largely switched from Windows to Unix these last few years, so I’m no longer up to date on this topic.<br>
By default, Visual organizes its list of optional warnings into “levels”. The higher the level, the more warnings it generates. It’s also possible to opt-in for a single specific warning, but I have not enough experience to comment that usage.</p>
<p>By default, Visual compiler uses <a href="https://docs.microsoft.com/en-us/cpp/build/reference/compiler-option-warning-level?view=vs-2017">level 1 on command line , and level 3 on IDE</a>.<br>
Level 3 is already pretty good, but I recommend to aim for level <strong>4</strong> if possible. That level will catch additional tricky corner cases, making the code cleaner and more portable.<br>
Obviously, on an existing project, move up progressively towards higher levels, as each of them will generate more warnings to clean up.</p>
<p>The exact way to change the warning level may depend on the IDE version. On command line, it’s always <code>/W4</code>, so that one is pretty clear. On IDE, it’s generally accessible in the <code>properties->C</code> tab, which is one of the first displayed, <a href="https://www.learncpp.com/cpp-tutorial/configuring-your-compiler-warning-and-error-levels/">as shown here</a>.</p>
<p>Do not use <code>/Wall</code> as part of your regular build process. It contains too many warnings of “informational” value, which are not meant to be suppressed, hence will continuously drown the signal and make “zero warning policy” impossible.</p>
<h2 id="gcc-and-clang"><code>gcc</code> and <code>clang</code></h2>
<p><code>gcc</code> and by imitation <code>clang</code> offer a command line experience with a large list of compatible flags for warnings.<br>
Overtime, I’ve developed my own selection, which has become pretty long. I would recommend it to any code base. I’m going to detail it below. It is by no means a “final” or “ultimate” version. The list can always evolve, integrating more flags, either because I missed them, or they end up being more useful than I initially anticipated, or because they become more broadly supported.</p>
<p>For simplicity purpose, I tend to concentrate on flags that are well supported by <code>gcc</code> and <code>clang</code>, and present since a few revisions. Flags which only work on “latest version of X” are not considered in this list, because they can cause trouble for compilation on targets without version X. This issue can be solved by adding yet another machinery to maintain version-specific flags, complete with its own set of problems, but I would not recommend to start with such complexity.</p>
<p>If your project does not include those flags yet, I suggest to only enable them one after another. A project developed without a specific flag is likely to have used the flagged pattern in many places. It’s important to clean one flag completely before moving to next one, otherwise, the list of warnings to fix becomes so large that it will seem insurmountable. Whenever it is, just drop the flag for the time being, you’ll come back to it later.</p>
<h3 id="basics">Basics</h3>
<ul>
<li>
<p><code>-Wall</code> : This is the “base” warning level for <code>gcc</code>/<code>clang</code>. In contrast to what its name implies, it does not enable “all” warnings, far from it, but a fairly large set of flags that the compiler team believes is generally safe to follow. For a detailed list of what it includes, you can <a href="https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html">consult this page</a>, which is likely applicable to latest compiler version. The exact list of flags evolves with the specific version of the compiler. It’s even different depending on <code>gcc</code> or <code>clang</code>. It’s okay, because the flag itself doesn’t change.<br>
I would recommend to start with this flag, and get to the bottom of it before moving on to more flags. Should the generated list of warnings be overwhelming, you can break it down into a more narrow set of flags, or selectively disable a few annoying warnings with <code>-Wno-###</code>, then plan to re-enable them progressively later.</p>
</li>
<li>
<p><code>-Wextra</code> : This is the second level for <code>gcc</code> and <code>clang</code>. It includes an additional set of flags, which constrains the code style further, improving maintainability. For example, this level will <a href="https://godbolt.org/z/9Axq8E">raise a warning</a> whenever a <a href="https://godbolt.org/z/pGfevo"><code>switch() { case: }</code> uses a fall-through implicitly</a>, which is generally (but not always) a mistake.<br>
This flag used to be called <code>-W</code>, but I recommend the <code>-Wextra</code> form, which is more explicit.</p>
</li>
</ul>
<h3 id="correctness">Correctness</h3>
<ul>
<li>
<p><code>-Wcast-qual</code> : This flag ensures that a QUALifier is respected during a cast operation. This is especially important for the <code>const</code> “read-only” qualifier: it ensures that a pointer to a read-only area <a href="https://godbolt.org/z/09ejGE">cannot</a> be <a href="https://godbolt.org/z/T_6Ga9">silently transformed into another pointer with write capability</a>, which is quite an essential guarantee. I even don’t quite get it how come this is an optional warning, instead of a compulsory property of the language.</p>
</li>
<li>
<p><code>-Wcast-align</code> : the C standard requires that a type must be stored at an address suitable for its alignment restriction. For example, on 32-bits systems, an <code>int</code> must be stored at an address which is a multiple of 4. This restriction tends to be forgotten nowadays because x86 cpus have always been good at dealing with unaligned memory accesses, and ARM ones have become better at this game (they used to be terrible). But it’s still important to respect this property, for portability, for performance (avoid inter-pages accesses), and for compatibility with deep transformations such as auto-vectorization. Casting can unintentionally violate this condition. A typical example is when <a href="https://godbolt.org/z/LIu8pK">casting a memory area previously reserved as a table of <code>char*</code></a>, hence without any alignment restriction, in order to store <code>int</code> value, which require an alignment of 4. <code>-Wcast-align</code> <a href="https://godbolt.org/z/pcPHRw">will detect the violation</a>, and fixing it will make sure the code respect alignment restrictions, making it more portable.</p>
</li>
<li>
<p><code>-Wstrict-aliasing</code> : <a href="https://www.approxion.com/pointers-c-part-iii-strict-aliasing-rule/">Strict aliasing</a> is a complex and badly known rule. It states that, in order to achieve better performance, compilers are allowed to consider that 2 pointers of different types never reference the same address space, so their content cannot “collide”. If they nonetheless do, it’s an <a href="https://en.wikipedia.org/wiki/Undefined_behavior">undefined behavior</a>, hence anything can happen unpredictably.<br>
To ensure this rule is not violated, compilers may optionally offer some code analysis capabilities, that will flag suspicious constructions. <code>gcc</code> offers <code>-Wstrict-aliasing</code>, with various levels of caution, with <code>1</code> being the most paranoid.<br>
Issues related to strict aliasing violation only show up in optimized codes, and are among the most difficult to debug. It’s best to avoid them. I recommend using this flag at its maximum setting. If it generates too much noise, try more permissive levels. <code>-Wstrict-aliasing=3</code> is already included as part of <code>-Wall</code>, so if <code>-Wall</code> is already clean, the next logical step is level <code>2</code>, then <code>1</code>.<br>
One beneficial side-effect of this flag is that it re-inforces the separation of types, which is a safer practice. Cross-casting a memory region with pointers of different types is no longer an easy option, as it gets immediately flagged by the compiler. There are still ways to achieve this, primarily through the use of <code>void*</code> memory segments, which act as wildcards. But the extra-care required is in itself protective, and should remind the developer of the risks involved.</p>
</li>
<li>
<p><code>-Wpointer-arith</code> forbids pointer arithmetic on <code>void*</code> or function pointer. C unfortunately lacks the concept of “memory unit”, so a <code>void*</code> is not a pointer to an address: it’s pointer to an object “we don’t know anything about”. Pointer arithmetic is closely related to the concept of table, and adding <code>+1</code> is always relative to the size of the table element (which must be a constant). With <code>void*</code>, we have no idea what this element size could be, so it’s not possible to <code>+1</code> it, nor do more complex pointer arithmetic.<br>
To perform operation on bytes, it’s necessary to use a pointer to a byte type, be it <code>char*</code>, <code>unsigned char*</code> or <code>int8_t*</code>.<br>
This is a strict interpretation of the standard, and helps make the resulting code more portable.</p>
</li>
</ul>
<h3 id="variable-declaration">Variable declaration</h3>
<ul>
<li>
<p><code>-Winit-self</code> : prevents a fairly silly corner case, where a variable is initialized with itself, such as <code>int i = i+1;</code>, which can not be right. <code>clang</code> and <code>g++</code> make it part of <code>-Wall</code>, but not <code>gcc</code>.</p>
</li>
<li>
<p><code>-Wshadow</code> : A variable <code>v</code> declared at a deep nesting level shadows any other variable with same name <code>v</code> declared at an upper level. This means that invoking <code>v</code>at the deeper level will target the deeper <code>v</code>. This is legal from a C standard perspective, but it’s considered bad practice, because it’s confusing for the reviewer. Now 2 different variables with different roles and lifetime carry the same name. It’s better to differentiate them, by using different names.<br>
<em>Sidenote</em> : name shadowing can be annoying when using a library which unfortunately defines very common symbol names as part of its interface. Don’t forget that the C namespace is global. For this reason, whenever publishing an API, always ensure that no public symbol is too “common” (such as <code>i</code>, <code>min</code>, <code>max</code>, etc.). At a minimum, add a <code>PREFIX_</code> to the public symbol name, so that opportunities of collision get drastically reduced.</p>
</li>
<li>
<p><code>-Wswitch-enum</code> : This flag ensures that, in a <code>switch(enum) { case: }</code> construction, all declared values of the <code>enum</code> have a <code>case:</code> branch. This can be useful to ensure that no <code>enum</code> value has been forgotten (even if there is a <code>default:</code> branch down the list to deal with them). Forgetting an <code>enum</code> value is a fairly common scenario when the <code>enum</code> list changes, typically by adding an element to it. The flag will issue a warning on all relevant <code>switch() { case: }</code>, simplifying code traversal to ensure that no case has been missed.</p>
</li>
</ul>
<h3 id="functions">Functions</h3>
<ul>
<li>
<p><code>-Wstrict-prototypes</code> : historically, C functions used to be declared with just their name, without even detailing their parameters. This is considered bad practice nowadays, and this flag will ensure that a function is declared with a fully formed prototype, including all parameter types.<br>
A common side effect happens for functions without any parameter. Declaring them as <code>int function()</code> seems to mean “this function takes no argument”, but it’s not correct. Due to this historical background, it actually means “this function may have any number of arguments of any type, it’s just not documented”. Such definition will limit the effectiveness of the compiler in controlling the validity of an invocation, so it’s bad, and this flag will issue a warning. The correct way to tell that a function has no (zero) argument is <code>int function(void)</code>.</p>
</li>
<li>
<p><code>-Wmissing-prototypes</code> : this flag enforces that any public function (non-<code>static</code>) has a declaration somewhere. It’s easy to game that condition by just writing a prototype declaration right in front of the function definition itself, but it misses the point : this flag will help find functions which are (likely) no longer useful.<br>
The problem with public functions is that the compiler has no way to ensure they are not used anymore. So it will generate them, and wait for the linking stage to know more. In a library, such “ghost” function will be present, occupy valuable space, and more importantly will still offer a public symbol to be reached, remaining within the library’s attack surface, and offering a potential backdoor for would-be attackers. Being no longer used, these functions may also not be correctly tested anymore, and might allow unintended state manipulations. So it’s better to get rid of them.<br>
If a kind of “private function just for internal tests” is needed, and should not be exposed in the official <code>*.h</code> header, create a secondary header, like <code>*-debug.h</code> for example, where the function is declared. And obviously <code>#include</code> it in the <code>*.c</code> unit. This will be cleaner and compatible with this flag.</p>
</li>
<li>
<p><code>-Wredundant-decls</code> : A prototype should be declared only once, and this single definition should be <code>#include</code> everywhere it’s needed. This policy avoids multiple source of truth, with associated synchronization problems.<br>
This flag will trigger a warning if it detects that a function prototype is declared twice (or more).</p>
</li>
</ul>
<h3 id="floating-point">Floating point</h3>
<ul>
<li><code>-Wfloat-equal</code> : this flag prevents usage of <code>==</code> equality operator between <code>float</code> value. This is because floating point values are lossy representations of real numbers, and any operation with them will incur an inaccuracy uncertainty, which exact detail depends on target platform, hence is not portable. Two floating-point values should not be compared with equality, it’s not supposed to make sense given the lossy nature of the representation. Rather ensure that the distance between 2 floats is below a certain threshold to consider them “equivalent enough”.</li>
</ul>
<h3 id="preprocessor">Preprocessor</h3>
<ul>
<li><code>-Wundef</code> : forbids evaluation of a macro symbol that’s not defined. Without it, <code>#if SYMBOL_NOT_EXIST</code> is silently translated into <code>#if 0</code>, which may or may not generate the intended outcome. This is useful when the list of macro symbols evolves : whenever a macro symbol disappears, all related preprocessor tests get flagged with this warning, which makes it possible to review and adapt them.</li>
</ul>
<h3 id="standard-library">Standard Library</h3>
<ul>
<li><code>-Wformat=2</code> : this will track potential <code>printf()</code> issues which can be abused to create security hazard scenarios.<br>
An example is when the formatting chain itself can be under control of an external source, such as <code>printf(message)</code>, with <code>char* message</code> being externally manipulated. This can be used to read <em>and write</em> out of bound and take remote control of the system. Yep, it’s that dangerous.<br>
The solution to this specific issue is to write <code>printf("%s", message)</code>. It may look equivalent, but this second version is safer, as it interprets <code>message</code> only as a pure <code>char*</code> string to display, instead of a formatting string which can trigger read/write orders from inside <code>printf()</code>.<br>
<code>-Wformat=2</code> will flag this issue, and many more, such as ensuring proper correspondence between the argument type and control string statement, leading to a safer program.<br>
These issues go beyond the C language proper, and more into <code>stdio</code> library territory, but it’s good to enable more options to be protected from this side too.</li>
</ul>
<h3 id="extended-compatibility">Extended compatibility</h3>
<ul>
<li>
<p><code>-Wvla</code> : prevents usage of Variable Length Array.<br>
VLA were supported in <code>C99</code>, but are now optional since <code>C11</code> (support can be tested using <code>__STDC_NO_VLA__</code> macro). They allow nice things such as allocating on stack a table of variable size, depending on a function parameter. However, VLA have a pretty poor reputation. I suspect a reason is that they were served by sub-par implementations leading to all sort of hard issues, such as undetected stack-overflow, with unpredictable consequences.<br>
Note that even “good” implementations, able to dynamically expand stack to make room for larger tables, and correctly detect overflow issue to properly <code>abort()</code>, cannot provide any way for the program to be informed of such issue and react accordingly. It makes it impossible to create a program that is guaranteed to not <code>abort()</code>.<br>
For better portability, it’s enough to know that some implementations of VLA are/were poor, and that VLA is no longer required in <code>C11</code> to justify avoiding it. VLA is also not available for <code>C90</code>.</p>
</li>
<li>
<p><code>-Wdeclaration-after-statement</code> : this flag is useful for <code>C90</code> compatibility. Having all declarations at the top of the block, before any statement, is a strict rule that was dropped with <code>C99</code>, and it’s now possible to declare new variables anywhere in a block. This flag is mostly useful if the goal is to be compatible with <code>C90</code> compilers, such as MS Visual Studio C before 2015 as an example.</p>
</li>
<li>
<p><code>-Wc++-compat</code> : this flag ensures that the source can be compiled unmodified as both valid C and C++ code. This will require a few additional restrictions, such as casting from <code>void*</code>, which is unnecessary in C, but required in C++.<br>
This it handy for highly portable code, because it’s not uncommon for some users to just import the source file in their project and compile it as a C++ file, even though it’s clearly labelled as C. Moreover, when targeting <code>C90</code> compatibility, <code>C++</code> compatibility is not too far away, so the remaining effort is moderate.</p>
</li>
</ul>
<h3 id="other-interesting-flags">Other interesting flags</h3>
<ul>
<li>
<p><code>-Wconversion</code> : The C language allows most conversions to be performed silently. Transforming an <code>int</code> value into a <code>short</code> one ? No problem, just spell it. This design choice dates from the 70’s, when reducing the number of keystrokes was important, due to concerns we can’t even start to imagine today (slow printers, limited display space, hard key presses, etc.). Thing is, many type conversions are actually dangerous. That <code>int</code> to <code>short</code> ? What if the original value is larger than <code>SHRT_MAX</code> ? Yep, that’s undefined behavior. <code>short</code> to <code>int</code> conversion, on the other hand, is risk free.<br>
<code>-Wconversion</code> will flag any silent type conversion which is <em>not</em> risk free. In an existing code base developed without this flag, this will lead to a <em>very large</em> number of warnings, likely within intractable territory.<br>
The situation is even worse for <code>gcc</code>, because it flags type conversions resulting from implicit operation conversions. <a href="https://godbolt.org/z/o6r63G">In this short example</a>, all variables are <code>short</code> types. There is no other type anywhere. Yet, <code>gcc</code>'s <code>-Wconversion</code> flag will trigger multiple warnings, because a basic operation such as <code>+</code> is allowed to be performed into <code>int</code> space, hence storing the final result into a <code>short</code> is now considered a “risky” conversion. Some constructions, such as <code>+=</code> can’t even be fixed !<br>
Bottom line : starting a new code base with <code>-Wconversion</code> is doable, but adding this flag to an existing project is likely a too large burden.<br>
Special mention for the combination <code>clang</code> + <code>-Wconversion -Wno-sign-conversion</code>, which I use regularly, but only on <code>clang</code>.</p>
</li>
<li>
<p><code>-Weverything</code> (<code>clang</code> only) : While it’s not recommended to use too many warnings in the production build chain, it can be sometimes interesting to look at more options. Special mention can be given to <code>-Weverything</code> on <code>clang</code>, which will activate every possible warning flag.<br>
Now, <a href="https://quuxplusone.github.io/blog/2018/12/06/dont-use-weverything/"><code>-Weverything</code> is not meant to be used in production</a>. It’s mostly a convenient “discovery” feature for <code>clang</code> developers, which can track and understand new warnings as they are added to “trunk”.<br>
But for the purpose of testing if the compiler can help find new issues, it can be an interesting temporary digression. One or two of these warnings might uncover real issues, inviting to re-assess the list of flags used in production.</p>
</li>
</ul>
<h3 id="summary">Summary</h3>
<p>All the flags presented so far can be combined into the following list, provided below for copy-pasting purposes :<br>
<code>-Wall -Wextra -Wcast-qual -Wcast-align -Wstrict-aliasing -Wpointer-arith -Winit-self -Wshadow -Wswitch-enum -Wstrict-prototypes -Wmissing-prototypes -Wredundant-decls -Wformat=2 -Wfloat-equal -Wundef -Wvla -Wdeclaration-after-statement -Wc++-compat</code></p>
<p>Quite a mouthful. Adopting as-is this list into an existing project might result in an abundant list of warnings if they were not already part of the build. Don’t be afraid, your code is not completely broken, but consider having a look: it might be fragile in subtle ways that these flags will help find. Enable additional warnings one by one, selectively, pick those which add value to your project. In the long run, these flags will help keep the code better maintained.</p>
<p>Compiler warning flags can be seen as a giant list of patterns that the compiler is pre-trained to detect. It’s great. But beyond these pre-defined capabilities, one might be interested in adding one’s own set of conditions for the compiler to check and enforce. That’s the purpose of next blog post.</p>
<h4 id="special-thanks">Special Thanks</h4>
<p>An early version of this article was commented by Nick Terrell and Evan Nemerson.</p>
</div>
</body>
</html>Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com11tag:blogger.com,1999:blog-834134852788085492.post-85700337289785393672019-01-22T09:08:00.001+01:002019-01-24T19:43:50.882+01:00Opaque types and static allocation<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>btrc: compile3 : opaque type and static allocation</title>
<link rel="stylesheet" href="https://stackedit.io/style.css" />
</head>
<body class="stackedit">
<div class="stackedit__html">
In a <a href="https://fastcompression.blogspot.com/2019/01/the-type-system_19.html">previous episode</a>, we’ve seen that it is possible to create opaque types. However, creation and destruction of such type must be delegated to some dedicated functions, which themselves rely on dynamic allocation mechanisms.</p>
<p>Sometimes, it can be convenient to bypass the heap, and all its <code>malloc()</code> / <code>free()</code> shenanigans. Pushing a structure onto the stack, or within thread-local storage, are natural capabilities offered by a normal <code>struct</code>. It can be desirable at times.</p>
<p>The previously described opaque type is so secret that it has no size, hence is not suitable for such scenario.</p>
<p>Fortunately, static opaque types are possible.<br>
The main idea is to create a “shell type”, with a known size and an alignment, able to host the target (private) structure.</p>
<p>For safer maintenance, the shell type and the target structure must be kept in sync, by using typically a <a href="https://en.cppreference.com/w/c/language/_Static_assert">static assert</a>. It will ensure that the shell type is always large enough to host the target structure. This check is important to automatically detect future evolution of the target structure.</p>
<p>If it wasn’t for the <a href="http://dbp-consulting.com/tutorials/StrictAliasing.html">strict aliasing rule</a>, we would have a winner : just use the shell type as the “public” user-facing type, proceed with transforming it into the private type inside the unit. It would combine properties of <code>struct</code> while remaining opaque.</p>
<h2 id="strict-aliasing">Strict aliasing</h2>
<p>Unfortunately, the <a href="http://dbp-consulting.com/tutorials/StrictAliasing.html">strict aliasing rule</a> gets in the way : we can't manipulate the same memory region from two pointers of different type (edit Christer Ericson : for the lifespan of the stored value). That's because the compiler is allowed to make assumptions about pointer value provenance for the benefit of performance.</p>
<p>To visualize the issue, I like <a href="https://godbolt.org/z/6cSQvx">this simple example</a>, powered by Godbolt. Notice how the two <code>+1</code> get combined into a single <code>+2</code>, saving one save+load round trip, and allowing computation over <code>i</code> and <code>f</code> in parallel, so it’s real saving.<br>
But unfortunately, if <code>f</code> and <code>i</code> have same addresses, the result is wrong : the first <code>i+1</code> influences the operation on <code>f</code> which influences the final value of <code>i</code>.<br>
Of course, this example feels silly : it’s pretty hard to find a use case which justifies operations on <code>int</code> and <code>float</code> simultaneously <em>and</em> pointing at the same memory address. It shows that the rule is quite logical : if these pointers have different type, they most likely do not reference the same memory area. And since benefits are substantial, it’s tempting to use that assumption.</p>
<p>Interpreting differently the same memory area using different types of pointers is called “type punning”. It may work, as long as the compiler serializes operations as expected in the code, but there is no guarantee that it will continue to work safely in the future. A known way to break older programs employing type punning is to recompile them with modern compilers using advanced performance optimizations such as <code>-O3 -lto</code>. With enough inlining, register caching and dead code elimination, one will start to experience strange effects, which can be very hard to debug.</p>
<p>This is explained in greater details in this <a href="https://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html">excellent article from Mike Acton</a>. For an even deeper understanding of what can happen under the hood, you can <a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2263.htm">read this document</a> suggested by Josh Simmons. It demonstrates that there is a lot more to a pointer than just its binary representation.</p>
<p>One line of defense could be disable usage of strict aliasing by the optimizer, with a compilation directive such as <a href="https://godbolt.org/z/kYaAAb"><code>fno-strict-aliasing</code></a> on <code>gcc</code>.<br>
I wouldn’t recommend it though. On top of impacting performance, it ties code correctness to a specific compiler setting, which may or may not be present in user’s project. Portability is also impacted, since there is no guarantee that this capability will always be available on some different C compiler.</p>
<p>Another line of defense consists in using the <code>char*</code> pointer, which is the exception to the rule, and can alias anything. When one memory area is passed as a <code>char*</code>, the compiler <a href="https://godbolt.org/z/2G0dOZ">will pay attention to serialize <code>char*</code> read/write properly</a>. It works well in practice, at least in my tests. What is worrying though is that in theory, the compiler <a href="https://stackoverflow.com/a/28240251/646947">is only obliged to guarantee the read in correct order</a>. That it pays attention to serialize the write too seems to be “extra care”, presumably so that existing programs continue to work as intended. Not sure if it is reliable to depend on it on long term.</p>
<p>Another issue is, our proposed shell type is not a <code>char*</code> table. It’s a <code>union</code>, containing a <code>char*</code> table. That’s not the same, and in this case, <a href="https://godbolt.org/z/4Qly0X">the exception does not hold</a>.</p>
<p>As a consequence, the shell type must not be confused with the target type. The <a href="http://dbp-consulting.com/tutorials/StrictAliasing.html">strict aliasing rule</a> makes them non-interchangeable !</p>
<h2 id="safe-static-allocation-for-opaque-types">Safe static allocation for opaque types</h2>
<p>The trick is to use a 3rd party initializer, to convert the allocated space and return a pointer of appropriate type.<br>
To ensure strict compliance with C standard, it’s a multi-steps trick, hence a more complex setup. Consider this technique as “advanced”, implying limited usage scenarios.</p>
<p>Here is an example :</p>
<pre class=" language-c"><code class="prism language-c"><span class="token keyword">typedef</span> <span class="token keyword">struct</span> thing_s thing<span class="token punctuation">;</span> <span class="token comment">// incomplete (opaque) type</span>
<span class="token keyword">typedef</span> <span class="token keyword">union</span> <span class="token punctuation">{</span>
<span class="token keyword">char</span> body<span class="token punctuation">[</span>SIZE<span class="token punctuation">]</span><span class="token punctuation">;</span>
<span class="token keyword">unsigned</span> alignment_enforcer<span class="token punctuation">;</span> <span class="token comment">// ensures `thingBody` respect alignment of largest member of `thing`</span>
<span class="token punctuation">}</span> thingBody<span class="token punctuation">;</span>
<span class="token comment">// PREFIX_initStatic_thing() accepts any buffer as input, </span>
<span class="token comment">// and returns a properly initialized `thing*` opaque pointer.</span>
<span class="token comment">// It ensures `buffer` has proper size (`SIZE`) and alignment (4) restrictions</span>
<span class="token comment">// and will return `NULL` if it does not.</span>
<span class="token comment">// Resulting `thing*` uses the provided buffer only, it will not allocate further memory on its own.</span>
<span class="token comment">// Use `thingBody` to define a memory area respecting all conditions.</span>
<span class="token comment">// On success, `thing*` will also be correctly initialized.</span>
thing<span class="token operator">*</span> <span class="token function">PREFIX_initStatic_thing</span><span class="token punctuation">(</span><span class="token keyword">void</span><span class="token operator">*</span> buffer<span class="token punctuation">,</span> size_t size<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token comment">// Notice there is no corresponding destructor.</span>
<span class="token comment">// Since the space is reserved externally, its deallocation is controlled externally.</span>
<span class="token comment">// This presumes that `initStatic` does Not dynamically allocates further space.</span>
<span class="token comment">// Note that it doesn't make sense for `initStatic` to invoke dynamic allocation.</span>
<span class="token comment">/* ====================================== */</span>
<span class="token comment">/* Example usage */</span>
<span class="token keyword">int</span> <span class="token function">function</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token punctuation">{</span>
thingBody scratchSpace<span class="token punctuation">;</span> <span class="token comment">/* on stack */</span>
thing<span class="token operator">*</span> T <span class="token keyword">const</span> <span class="token operator">=</span> <span class="token function">PREFIX_initStatic_thing</span><span class="token punctuation">(</span><span class="token operator">&</span>scratchSpace<span class="token punctuation">,</span> <span class="token keyword">sizeof</span><span class="token punctuation">(</span>scratchSpace<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token function">assert</span><span class="token punctuation">(</span>T <span class="token operator">!=</span> <span class="token constant">NULL</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// Should be fine. Only exception is if `struct thing_s` definition changes and there is some version mismatch.</span>
<span class="token comment">// Now use `T` as a normal `thing*` pointer</span>
<span class="token comment">// (...)</span>
<span class="token comment">// do Not `free(T)` at function's end, since thingBody is part of the stack</span>
<span class="token punctuation">}</span>
</code></pre>
<p>In this example, the static size of <code>thingBody</code> is used to allocate space for <code>thing</code> on the stack. It’s faster, and there is no need to care about deallocation.</p>
<p>But that’s all it does. No data is ever read from nor written to <code>thingBody</code>. All usages of the memory region pass through <code>thing*</code>, which is safe.</p>
<p>Compared to a usual public <code>struct</code>, the experience is not equivalent.<br>
To begin with, the proposed stack allocation is a multi-liner and creates 2 variables : the shell type, and the target pointer. It’s not too bad, and this model fits well enough any kind of manual allocation scenario, be it on stack or within a pre-reserved area (for embedded environments typically).</p>
<p>If that matters, stack allocation could have been made a one liner, hidden behind a macro.<br>
But I tend to prefer the variant in above example. It makes it clear what’s happening. Since one of C strengths is a clear grasp of resource control, it is better to preserve that level of understanding.</p>
<p>There are more problematic differences though.<br>
It’s not possible to use the shell type as a return type of a function: once again, shell type and target incomplete type are different things. On the same line, it’s not possible to pass the shell type by value. The memory region can only be passed by reference, and only using the correctly typed pointer.</p>
<p>Embedding the shell type into a larger structure is dangerous and generally not recommended : it requires 2 members (the shell and the pointer), but the pointer is only valid if the <code>struct</code> is not moved around, nor copied. That’s a too strong constraint to make it safely usable.</p>
<h3 id="removing-the-pointer">Removing the pointer</h3>
<p>Suggested by Sebastian Aaltonen, it is generally possible to bypass the target pointer, and just reuse the address of the shell type instead. Since the shell type is never accessed directly, there is no aliasing to be afraid of.</p>
<p>The only issue is, <a href="https://godbolt.org/z/ldGoF2">some compilers might not like the pointer cast from <code>shellType*</code> to target <code>opaque*</code></a>, irrespective of the fact that the <code>shellType</code> is never accessed directly. This is an annoying false positive. That being said, newer compilers are <a href="https://godbolt.org/z/LYduX7">better at detecting this pattern</a>, and won’t complain.<br>
Note that the explicit casting <a href="https://godbolt.org/z/_My0Tz">is not optional,</a> so the notation cannot be shortened, hence this method will not save much keystrokes.</p>
<p>The real goal is to guarantee that the address transmitted is necessarily the address of <code>shell</code>. This makes sense when the intention is to move <code>shell</code> around or copy it : no risk to lose sync with a separate pointer variable.</p>
<p>To be complete, note that, in above proposal, <code>initStatic()</code> does more than casting a pointer :</p>
<ul>
<li>It ensures that the memory area has correct size & alignment properties
<ul>
<li><code>shellType</code> provides these guarantees too.
<ul>
<li>The only corner case is when the program invokes <code>initStatic()</code> from a dynamic library. If runtime library version is different from the one used during compilation of the program, it can lead to a potential discrepancy on size or alignment requirements.</li>
<li>No such risk when using static linking.</li>
</ul>
</li>
</ul>
</li>
<li>It ensures that the resulting pointer references a properly initialized memory area.</li>
</ul>
<p>The second bullet point, in particular, still needs to be done one way or another, so <code>initStatic()</code> is still useful, at least as an initializer.</p>
<h3 id="using-the-shell-type-directly">Using the shell type directly</h3>
<p>Removing the pointer is nice, but the real game changer is to be able to employ the opaque type as if it was a normal <code>struct</code>, in particular :</p>
<ul>
<li>assign with <code>=</code></li>
<li>can be passed by value as function parameter</li>
<li>can be received as return type from a function</li>
</ul>
<p>These properties can influence the API design, making the opaque type “feel” more natural to use. For example :</p>
<pre class=" language-c"><code class="prism language-c"><span class="token comment">// declaration</span>
<span class="token macro property">#<span class="token directive keyword">define</span> SIZE 8</span>
<span class="token keyword">typedef</span> <span class="token keyword">union</span> <span class="token punctuation">{</span>
<span class="token keyword">char</span> body<span class="token punctuation">[</span>SIZE<span class="token punctuation">]</span><span class="token punctuation">;</span>
<span class="token keyword">unsigned</span> align4<span class="token punctuation">;</span> <span class="token comment">// ensures `thing` is aligned on 4-bytes boundaries</span>
<span class="token punctuation">}</span> thing<span class="token punctuation">;</span>
<span class="token comment">// No need for a "separate" incomplete type.</span>
<span class="token comment">// The shell IS the public-facing type for API.</span>
thing <span class="token function">thing_init</span><span class="token punctuation">(</span><span class="token keyword">void</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
thing <span class="token function">thing_set_byValue</span><span class="token punctuation">(</span><span class="token keyword">int</span> v<span class="token punctuation">)</span><span class="token punctuation">;</span>
thing <span class="token function">thing_combine</span><span class="token punctuation">(</span>thing a<span class="token punctuation">,</span> thing b<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token comment">// usage</span>
thing <span class="token function">doubled_value</span><span class="token punctuation">(</span><span class="token keyword">int</span> v<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
thing <span class="token keyword">const</span> ta <span class="token operator">=</span> <span class="token function">thing_set_byValue</span><span class="token punctuation">(</span>v<span class="token punctuation">)</span><span class="token punctuation">;</span>
thing <span class="token keyword">const</span> tb <span class="token operator">=</span> ta<span class="token punctuation">;</span>
<span class="token keyword">return</span> <span class="token function">thing_combine</span><span class="token punctuation">(</span>ta<span class="token punctuation">,</span> tb<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>
</code></pre>
<p>This can be handy for <strong>small</strong> <a href="https://en.wikipedia.org/wiki/Passive_data_structure">POD types</a> (typically less than a few dozens of bytes), giving them a behavior similar to basic types.<br>
Since passing arguments and results by value <a href="https://godbolt.org/z/-Zqqcy">implies some memory copy,</a> the cost of this approach increases as type size increases. Therefore, whenever the type becomes uncomfortably large, prefer switching to a pointer reference.</p>
<p>The compiler may completely eliminate the memory copy operation if it can somehow inline the invoked functions. That’s, by definition, hard to do when these functions are in a separate unit, due to the need to access a private type declaration.<br>
However, <code>-lto</code> (Link Time Optimization) can break the unit barrier. As a consequence, functions which were behaving correctly while not inlined might end up being inlined, triggering weird optimization effects.</p>
<p>For example, statements acting directly on <code>shell*</code>, such as potential <code>memset()</code> initialization, or any kind of value assignment, might be reordered for parallel processing with other statements within inlined functions acting on <code>internal_type*</code>, on the assumption that <code>shell*</code> and <code>internal_type*</code> should not be aliased.<br>
To be fair, I would expect a modern compiler to be clever enough to detect that <code>shell*</code> and <code>internal_type*</code> reference effectively the same address, and avoid re-ordering or eluding memory read / write operations. Nevertheless, this is a risk, that might be triggered by complex cases or less clever compilers (typically older ones).</p>
<p><a href="https://godbolt.org/z/7_Zw07">The solution is to use <code>memcpy()</code></a> to transfer data back and forth between internal type and shell type. <code>memcpy()</code> acts as a synchronization point for memory accesses : it guarantees that read and write orders will be serialized, ordered as written in the source code. The compiler will not be able to “outsmart” the code by re-ordering statements under the assumptions that side-effects on 2 pointers of different types cannot alias each other : a <code>memcpy()</code> can alias anything, so it has to be performed in the requested order.</p>
<h3 id="back-to-struct-">Back to <code>struct</code> ?</h3>
<p>Adding <code>memcpy()</code> everywhere is a small inconvenience. Also, there is always a risk that the compiler will not be smart enough to elide the copy operation.</p>
<p>Due to these limitations and risks, it can be better to give up this complexity and just use a public <code>struct</code>. As long as the <code>struct</code> is a POD type, all conveniences are available. And without the need to add some private declaration, it’s now possible to define implementations directly in header, as explicit <code>inline</code> functions, sharply reducing the cost of passing parameters.</p>
<p>To avoid direct accesses to structure member, one can still mention it clearly in code comments, and use scary member names as deterrent. A more involved way to protect <code>struct</code> members is to give them scary <em>and useless</em> names, such as <code>dont_access_me_1</code>, <code>dont_access_me_2</code>, etc. and rename them with macros in the code section which can actually interpret them. This is a bit more involving, especially if the number of member names is large, potentially leading to confusion. More importantly, the compiler will no longer be able to help in case of contract violation, and protecting the design pattern will now entirely depend on reviewers. Still, it’s a very reasonable choice, notably for “internal” types, which are not exposed on user side API, hence should only be manipulated by a small number of skillful contributors subject to review process.</p>
<p>For user facing types though, opacity is more valuable. And if the type size is large enough to begin with, it seems a no brainer : prefer the opaque type, and only use references.</p>
</div>
</body>
</html>Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com2tag:blogger.com,1999:blog-834134852788085492.post-46702248732010823222019-01-19T20:30:00.002+01:002019-01-22T09:11:36.718+01:00The type system<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>btrc: compile2 : the type system</title>
<link rel="stylesheet" href="https://stackedit.io/style.css" />
</head>
<body class="stackedit">
<div class="stackedit__html"><p>The type system is a simple yet powerful way to ensure that the code knows enough about the data it is manipulating. It is used to declare conditions at interfaces, which are then enforced at each invocation.<br>
The compiler is very good to check types. The condition is trivial to enforce, and doesn’t cost much compilation resource, so it’s a powerful combination.</p>
<h2 id="typedef-and-weak-types"><code>typedef</code> and weak types</h2>
<p>C is sometimes labelled a “weakly typed” language, presumably because it is associated to the behavior of one of its keywords <code>typedef</code>.<br>
The keyword itself implies that <code>typedef</code> DEFines a new TYPE, but it’s unfortunately a misnomer.</p>
<p>As an example, <code>typedef</code> can be used this way :</p>
<pre class=" language-c"><code class="prism language-c"><span class="token keyword">typedef</span> <span class="token keyword">int</span> meters<span class="token punctuation">;</span>
<span class="token keyword">typedef</span> <span class="token keyword">int</span> kilograms<span class="token punctuation">;</span>
</code></pre>
<p>This defines 2 new “types”, <code>meters</code> and <code>kilograms</code>, which can be used to declare variables.</p>
<pre class=" language-c"><code class="prism language-c">meters m<span class="token punctuation">;</span>
kilograms k<span class="token punctuation">;</span>
</code></pre>
<p>One could logically expect that, from now on, it’s no longer allowed to mix <code>meters</code> and <code>kilograms</code>, since they represent different types, hence should not be compatible.</p>
<p>Unfortunately, that’s not the case : <code>meters</code> and <code>kilograms</code> are still considered as <code>int</code> from the C type system perspective, and <a href="https://godbolt.org/z/UOFA0g">mixing them works</a> without a single warning, even when every possible compiler warning is enabled.</p>
<p>As such, <code>typedef</code> must be considered a mere tagging system. It’s still useful from a code reviewer perspective, since it has documenting value, and that may help notice discrepancies, such as this example. But the compiler won’t be able to provide any signal.</p>
<h2 id="strong-types-in-c">Strong types in C</h2>
<p>To ensure that two types cannot be accidentally mixed, it’s necessary to strongly separate them. And that’s actually possible.<br>
C has one thing called a <code>struct</code> (and its remote relative called <code>union</code>).<br>
Two <code>struct</code> defined independently are considered completely foreign, even if they contain <em>exactly the same members</em>.<br>
<a href="https://godbolt.org/z/UVdyLi">They can’t be mixed unintentionally</a>.</p>
<p>This gives us a basic tool to strongly segregate types.</p>
<h3 id="operations">Operations</h3>
<p>Using <code>struct</code> comes with severe limitations. To begin with, the set of default operations is much more restricted. It’s possible to allocate <code>struct</code> on stack, and make it part of a larger <code>struct</code>, it’s possible to assign with <code>=</code> or <code>memcpy()</code>, but that’s pretty much it. No simple operation like <code>+ - * /</code>, no comparison like <code>< <= => ></code>, etc.</p>
<p>Users may also access members directly, and manipulate them. But it breaks the abstraction.<br>
When structures are used as a kind of “bag of variables”, to simplify transport, and enforce naming for clarity, it’s fine to let users access members directly. Compared to a function with a ton of parameters, an equivalent function with a structure as input will help readability tremendously, just because it enforces naming parameters.<br>
But in the present case, when structures are used to enforce abstractions, users should be clearly discouraged from accessing members directly. Which means, all operations must be achieved at <code>struct</code> level directly.</p>
<p>To comply with these limitations, it’s now necessary to create all allowed operations one by one, giving a uniquely named symbol to each one. So if <code>meters</code> and <code>kilograms</code> can be added, both operations need their own function signature, such as <code>add_meters()</code> and <code>add_kilograms()</code>. This feels like a hindrance, and indeed, if there are many types to populate, it can require a lot of glue code.</p>
<p>But on the plus side, only what’s allowed is now possible. For example, multiplying <code>meters</code> with <code>meters</code> shouldn’t produce some <code>meters</code>, but rather a <code>square_meters</code> surface, which is a different concept. Allowing additions, but not multiplications, is an impossible subtlety for basic <code>typedef</code>.</p>
<h3 id="composition">Composition</h3>
<p>There is no “intermediate” situation, where a type would be “compatible” with another type, yet different. In the mechanisms explained so far, types are either compatible and identical, using <code>typedef</code>, or completely incompatible, using a new definition and a new name.</p>
<p>In contrast, in Object Oriented languages, a <code>cat</code> can also be an <code>animal</code>, thanks to inheritance, so it’s possible to use <code>cat</code> to invoke <code>animal</code> methods, or use functions with <code>animal</code> parameter(s).</p>
<p><code>struct</code> strongly leans towards composition. A <code>struct cat</code> can include a <code>struct animal</code>, which makes it possible to invoke <code>animal</code> related functions, though it’s not transparent : it’s necessary to explicitly spell the substructure (<code>cat.animal</code>) as a parameter or return value of the <code>animal</code> related function.</p>
<p>Note that even Object Oriented languages generally approve the <a href="https://en.wikipedia.org/wiki/Composition_over_inheritance">composition over inheritance</a> guiding principle. The guiding principle states that, if there isn’t a very good reason to employ inheritance, composition must always be preferred, because it generally fares better as the code evolves and becomes more complex (multiple inheritances quickly translate into a nightmare).</p>
<p><code>struct</code> can be made more complex, with tables of virtual function pointers, achieving something similar to inheritance and polymorphism. But this is a whole different level of complexity. I will rather avoid this route for the time being. The current goal is merely to separate types in a way which can be checked by the compiler. Enforcing a unified interface on top of different types is a more complex topic, better left for a future article.</p>
<h2 id="opaque-types">Opaque types</h2>
<p><code>struct</code> are fine as strong types, but publishing their definition implies that their members are public, meaning any user can access and modify them.<br>
When it’s the goal, it’s totally fine.</p>
<p>But sometimes, one could wish that, in order to protect users from unintentional mis-usage, it would be better to make structure members unreachable. This is called an <a href="https://en.wikipedia.org/wiki/Opaque_data_type">opaque type</a>. An additional benefit is that whichever is inaccessible cannot be relied upon, hence may be changed in the future without breaking user code.</p>
<p>Object oriented language have the <code>private</code> tag, which allows exactly that : some members might be published, but they are nonetheless unreachable from the user (well, <a href="http://bloglitb.blogspot.com/2011/12/access-to-private-members-safer.html">in theory</a>…).</p>
<p>A “poor man” equivalent solution in C is to comment the code, clearly indicating which members are public, and which ones are private. No guarantee can be enforced by the compiler, but it’s still a good indication for users.<br>
Another step is to give private members terrible names, such as <code>never_ever_access_me</code>, which provides a pretty serious hint, and is less easy to forget than a code comment.</p>
<p>Yet, sometimes, one wishes to rely on stronger compiler-backed guarantee, to ensure that no user will access private structure members. C doesn’t have <code>private</code>, but can do something equivalent.<br>
It relies on the principles of <a href="https://docs.oracle.com/cd/E19205-01/819-5265/bjals/index.html">incomplete type</a>.</p>
<p>My own preference is to declare an incomplete type by pairing it with <code>typedef</code> :</p>
<pre class=" language-c"><code class="prism language-c"><span class="token keyword">typedef</span> <span class="token keyword">struct</span> house_s house<span class="token punctuation">;</span>
<span class="token keyword">typedef</span> <span class="token keyword">struct</span> car_s car<span class="token punctuation">;</span>
</code></pre>
<p>Notice that we have not published anything about the internals of <code>struct house_s</code>. This is intentional. Since nothing is published, nothing can be accessed, hence nothing can be misused.</p>
<p>Fine, but what can we do about such a thing ? To begin with, we can’t even allocate it, since its size is not known.<br>
That’s right, the only thing that can be declared at this stage is a pointer to the incomplete type, like this :</p>
<pre class=" language-c"><code class="prism language-c">house<span class="token operator">*</span> my_house<span class="token punctuation">;</span>
car<span class="token operator">*</span> my_car<span class="token punctuation">;</span>
</code></pre>
<p>And now ?<br>
Well, only functions with <code>house*</code> or <code>car*</code> as parameter or return type can actually do something with it.<br>
These functions must access <code>struct house_s</code> and <code>struct car_s</code> internal definitions. These definitions are therefore published in a relevant unit <code>*.c</code> file, rather than the header <code>*.h</code>. Being not part of the public interface, the structure’s internal remains effectively private.</p>
<p>The first functions required are allocators and destructors.<br>
For example, I’m used to the following name convention :</p>
<pre class=" language-c"><code class="prism language-c">thing<span class="token operator">*</span> <span class="token function">PREFIX_createThing</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token keyword">void</span> <span class="token function">PREFIX_freeThing</span><span class="token punctuation">(</span>thing<span class="token operator">*</span> t<span class="token punctuation">)</span><span class="token punctuation">;</span>
</code></pre>
<p>Now, it’s possible to allocate space for <code>thing*</code>, and eventually do something with it (with additional functions).<br>
A good convention is that functions which accept <code>thing*</code> as mutable argument should have <code>thing*</code> as first parameter, like in this example :</p>
<pre class=" language-c"><code class="prism language-c"><span class="token keyword">int</span> <span class="token function">PREFIX_pushElement</span><span class="token punctuation">(</span>thing<span class="token operator">*</span> t<span class="token punctuation">,</span> element e<span class="token punctuation">)</span><span class="token punctuation">;</span>
element <span class="token function">PREFIX_pullElement</span><span class="token punctuation">(</span>thing<span class="token operator">*</span> t<span class="token punctuation">)</span><span class="token punctuation">;</span>
</code></pre>
<p>Notice that we are getting pretty close to object oriented programming with this construction. Functions and data members, while not declared in an encompassing “object”, must nonetheless be defined together: the need to know the structure content to do anything about it forces function definitions to be grouped into the unit that declares the structure content. It’s fairly close.</p>
<p>Compared with a direct <code>struct</code>, a few differences stand out :</p>
<ul>
<li>Members are private</li>
<li>Allocation is implemented by a function, it can only be invoked
<ul>
<li>no way to allocate on stack</li>
<li>no way to include a <code>thing</code> into another <code>struct</code>
<ul>
<li>but it’s possible to include a pointer <code>thing*</code></li>
</ul>
</li>
<li>Initialization can be enforced directly in the constructor
<ul>
<li>removes risks of garbage content due to lack of initialization.</li>
</ul>
</li>
</ul>
</li>
<li>The caller <em>is in charge of invoking the destructor</em>.
<ul>
<li>The pattern is exactly identical to <code>malloc()</code> / <code>free()</code> (see future article on <a href="">Resource Control</a>)</li>
</ul>
</li>
</ul>
<p>The responsibility to invoke the destructor after usage is very important.<br>
It’s no different than invoking <code>free()</code> after a <code>malloc()</code>,<br>
but that’s still an additional detail to take care of, with the corresponding risk to forget or mismanage it.</p>
<p>To bypass this responsibility, and take control of the allocation process, it can be preferable to consider <a href="https://fastcompression.blogspot.com/2019/01/opaque-types-and-static-allocation.html">opaque types with static allocation</a>. That’s the topic of the <a href="https://fastcompression.blogspot.com/2019/01/opaque-types-and-static-allocation.html">next article</a>.</p>
<h2 id="summary">Summary</h2>
<p>This closes this first chapter on the type system. We have seen that it’s possible to create <a href="https://en.wikipedia.org/wiki/Strong_and_weak_typing">strong types</a>, and we can use this property to ensure users can’t mix up different types accidentally. We have seen that it’s possible to create <a href="https://en.wikipedia.org/wiki/Opaque_data_type">opaque types</a>, and ensure users can only invoke allowed operations, or can’t rely on secret internal details, clearing the path of future evolution. These properties are compiler-checked, so they are always automatically enforced.</p>
<p>That’s not bad. Just using these properties will seriously improve code resistance to potential mis-usages.</p>
<p>Yet, there is more the compiler can do to detect potential bugs in our code. To be continued…</p>
</div>
</body>
</html>
Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com5tag:blogger.com,1999:blog-834134852788085492.post-8521149767099483152019-01-19T17:58:00.002+01:002019-01-20T03:45:46.708+01:00Writing safer C code<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjU6xETqGRkfiCpq1auAqphrvucXrqBWjy0jaPb_F0BcifhIpuwyjZW_k_ZQje13SEUkOkV4DKfKSAn50w3B4nTnVfqSVD7-JIgnrxxmB4LDf4bwmoNexJBgeP8TdYLH7qOHuU4tLxU0y0/s1600/1200px-The_C_Programming_Language_logo.svg.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="1276" data-original-width="1200" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjU6xETqGRkfiCpq1auAqphrvucXrqBWjy0jaPb_F0BcifhIpuwyjZW_k_ZQje13SEUkOkV4DKfKSAn50w3B4nTnVfqSVD7-JIgnrxxmB4LDf4bwmoNexJBgeP8TdYLH7qOHuU4tLxU0y0/s200/1200px-The_C_Programming_Language_logo.svg.png" width="187" /></a></div>
Writing safer C code may feel like an overwhelming goal. After all, we are told that C gives programmers plenty of opportunities to <a href="http://www.stroustrup.com/bs_faq.html#really-say-that">shoot their own foot</a>.<br />
<br />
But that’s doesn’t mean there is no possible improvement. Actually, in the last decade, programming practices have already evolved dramatically, and for the better, as a consequence of multiple forces, such as improved tooling, shared programming and rising cost of failures, as the numerous Internet exploits tend to remind us all too often.<br />
<br />
I expected to start this series with an introduction on C, its strengths, and guiding principles on safer coding practices. But it doesn’t fit the blog post format, being too long, boring, and at times potential troll magnet. Suffice to say that “safer” implies writing Reviewer-Oriented source code, aka highly readable, and as much error automation as possible, favoring fast methods (immediate feedback while editing code) over longer ones (long offsite test sessions in dedicated environments).<br />
<br />
One thing I can’t escape though is to mention a few words on the intended audience. These articles are not meant to learn new things for “experts”, which know a lot more than I do. Neither are they intended to guide the first steps towards C programming. The intended audience has good enough C programming skills, and can actually ship products. Shipping real products is important, because the whole concept of “safer programming” is better understood under the pressure and experience of a product’s maintenance.<br />
<br />
The main driver is to make it more difficult to ship bugs, as the code base lives and evolves, and new team members get onboard, adding much-needed automated controls at every opportunity. Issues are centered around modifying / fixing an existing code base, and managing the cascading impacts on the rest of the project. This requires to prepare the code for this challenge, hence the design patterns proposed are also useful for new codes with an expected “long” life expectancy (beyond a few months).<br />
<br />
Now let’s shorten this introduction and go directly into the meat of the topic.<br />
I’ll start this series with design patterns that leverage compiler checks, to help make C code more resistant to mis-usages and future refactoring.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJkemU1rr-Qx19u8LMO__KsiwO1unvpTPTcOpFD7OiXrf3dI8cUMhPepgO4wqvp1TDvAd2Dx3ohB8mdVgoFlV9QSs4l2NPxJf-iWfC1HkLlrI98LyZAlEB5GEay6wQZtKbMdUDTlmXyWg/s1600/c_compiler_svg.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="260" data-original-width="1083" height="96" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJkemU1rr-Qx19u8LMO__KsiwO1unvpTPTcOpFD7OiXrf3dI8cUMhPepgO4wqvp1TDvAd2Dx3ohB8mdVgoFlV9QSs4l2NPxJf-iWfC1HkLlrI98LyZAlEB5GEay6wQZtKbMdUDTlmXyWg/s400/c_compiler_svg.png" width="400" /></a></div>
<br />
<br />
As a quick background, the compiler is a fairly central part of the development process for compiled language. Compiling a source code incurs a delay, more or less noticeable. That’s a cost.<br />
Interpreted languages (most scripts, <code>python</code>, <code>ruby</code>, <code>basic</code>, <code>bash</code>, etc.) can evade it, making the initial code writing experience more agreeable, with quick modification / experience feedback loop.<br />
The real cost though comes later, and it is steep : compiled languages have this constraint that the compiler must understand and therefore sanitize the code in order to produce the executable binary. This constraint becomes a huge advantage as it catches many categories of errors before they get a chance to run. This typically includes many flavors of mis-typings. Interpreted languages, in contrast, will have to find a majority of problems at run time (<i>note:</i> a good editor’s parser can definitely help both language types there).<br />
<br />
And compiler can go much farther. One of the big lessons from modern languages favoring safety like <code>rust</code> is that using the compiler as a primary tool to guide design patterns towards safer practices improves code quality substantially. It’s a good choice : the compiler is a compulsory part of the development chain, it sits close to the programmer, its diagnosis is part of the valuable “short” feedback loop (in contrast with complementary techniques such as code analyzers, test suites and sanitizers). Whatever the compiler can flag gets solved more quickly, reducing load and risks at later stages of the development.<br />
<br />
Hence, this is the first topic to explore : let’s make the compiler work for us, check the validity of our code to the best of its abilities. To reach that goal, we will have to purposely leverage its capabilities, in effect help the compiler help us.<br />
<br />
And let’s start with its first weapon, <a href="https://fastcompression.blogspot.com/2019/01/the-type-system_19.html">the type system</a>.Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com3tag:blogger.com,1999:blog-834134852788085492.post-23437289862979455312018-03-14T19:29:00.002+01:002018-03-17T02:28:52.098+01:00xxHash for small keys: the impressive power of modern compilers<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZBXLNstt8IN8_W-PdqrdHTtDTDVA6u-cri6mrzFSLlVRYa_wQZewxWzfh81q-GKFBba9sp0mEcRer9gag3Pb0Hnyep9bVcJLVMN1eu5Lo73IDEujjf5HLgLr1ZhgDRJQJ2CMTNUe42_4/s1600/checksumming.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="282" data-original-width="385" height="146" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZBXLNstt8IN8_W-PdqrdHTtDTDVA6u-cri6mrzFSLlVRYa_wQZewxWzfh81q-GKFBba9sp0mEcRer9gag3Pb0Hnyep9bVcJLVMN1eu5Lo73IDEujjf5HLgLr1ZhgDRJQJ2CMTNUe42_4/s200/checksumming.png" width="200" /></a></div>
<div style="text-align: left;">
Several years ago, <a href="http://www.xxhash.com/">xxHash</a> was created as a companion error detector for <a href="http://www.lz4.org/">LZ4</a> <a href="https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md">frame format</a>. The initial candidate for this role was <a href="https://en.wikipedia.org/wiki/Cyclic_redundancy_check#CRC-32_algorithm">CRC32</a>, but it turned out being several times slower than LZ4 decompression, nullifying one of its key strength.</div>
<br />
After some research, I found the great <a href="https://en.wikipedia.org/wiki/MurmurHash">murmurhash</a>, by <a href="https://github.com/aappleby">Austin Appleby</a>, alongside its validation tool <a href="https://github.com/aappleby/smhasher">SMHasher</a> . It nearly fitted the bill, running faster than LZ4, but still a bit too close to my taste. Hence started a game to see how much speed could be extracted from a custom hash formula while preserving good distribution properties.<br />
<br />
A lot of things happened since that starting point, but the main take away from this story is : xxHash was created mostly as a checksum companion, digesting long inputs.<br />
<br />
Fast forward nowadays, and xxHash is being used in more places than it was originally expected. The design has been expanded to create a second variant, XXH64, which is successful in video content and data bases. <br />
But for several of these uses cases, hashed keys are no longer necessarily "large".<br />
<br />
In some cases, the need to run xxHash on small keys resulted in the <a href="https://bitbucket.org/runevision/random-numbers-testing/src/16491c9dfa/Assets/Implementations/HashFunctions/XXHash.cs?fileviewer=file-view-default#XXHash.cs-193">creation of dedicated variants</a>, that cut drastically through the decision tree to extract just the right suite of operations for the desired key. And it works quite well.<br />
<br />
That pushes the hash algorithm into territories it was not explicitly optimized for. Thankfully, one of SMHasher's test module was dedicated for speed on small keys, so it helped to pay attention to the topic during design phase. Hence the performance on small key is correct, but the dedicated function push it to another level.<br />
<br />
Let's analyse the 4-byte hash example.<br />
Invoking the regular <span style="font-family: "courier new" , "courier" , monospace;">XXH32()</span> function on 4-bytes samples, and running it on my Mac OS-X 10.13 laptop (with compilation done by llvm9), I measure <b>233 MH/s</b> (Millions of hashes per second).<br />
Not bad, but running the dedicated 4-bytes function, it jumps to <b>780 MH/s</b>. That's a stark difference !<br />
<br />
Let's investigate further.<br />
xxHash offers an <a href="https://github.com/Cyan4973/xxHash/tree/3b589804fcd0379d652c405dabce5d049a10c918#build-modifiers">obscure build flag named XXH_PRIVATE_API</a>. The initial intention is to make all <span style="font-family: "courier new" , "courier" , monospace;">XXH_*</span> symbols <span style="font-family: "courier new" , "courier" , monospace;">static</span>, so that they do not get exposed on the public side of a library interface. This is useful when several libraries use xxHash as an embedded source file. In such a case, an application linking to both libraries will encounter multiple <span style="font-family: "courier new" , "courier" , monospace;">XXH_*</span> symbols, resulting in naming collisions.<br />
<br />
A side effect of this strategy is that function bodies are now available during compilation, which makes it possible to inline them. Surely, for small keys, inlining the hash function might help compared to invoking a function from another module ?<br />
Well, yes, it does help, but there is nothing magical. Using the same setup as previously, the speed improves to <b>272 MH/s</b> . That's better, but still far from the dedicated function.<br />
<br />
That's where the power of inlining can really kick in. <i><u>In the specific case that the key has a predictable small length</u></i>, it's possible to pass as length argument a <u>compile-time constant</u>, like <span style="font-family: "courier new" , "courier" , monospace;">sizeof(key)</span>, instead of a variable storing the same value. This, in turn, will allow the compiler to make some drastic simplification during binary generation, through dead code removal optimization, throwing away branches which are known to be useless.<br />
Using this trick on the now inlined <span style="font-family: "courier new" , "courier" , monospace;">XXH32()</span>, speed increases to <b>780 MH/s</b>, aka the same speed as dedicated function.<br />
<br />
I haven't checked but I wouldn't be surprised if both the dedicated function and the inlined one resulted in the same assembly sequence.<br />
But the inlining strategy seems more powerful : no need to create, and then maintain, a dedicated piece of code. Plus, it's possible to generate multiple variants, by changing the "length" argument to some other compile-time constant.<br />
<br />
<table>
<thead>
<tr>
<th>object</th>
<th>XXH32()</th>
<th>XXH32 inlined</th>
<th>XXH32 inlined + <br />
length constant</th>
<th>dedicated XXH32 function</th>
</tr>
</thead>
<tbody>
<tr>
<td>4-bytes field</td>
<td><div style="text-align: center;">
233 MH/s</div>
</td>
<td><div style="text-align: center;">
272 MH/s</div>
</td>
<td><div style="text-align: center;">
780 MH/s</div>
</td>
<td><div style="text-align: center;">
780 MH/s</div>
</td>
</tr>
</tbody>
</table>
<br />
<br />
Another learning is that inlining is quite powerful for small keys, but the <span style="font-family: "courier new" , "courier" , monospace;">XXH_PRIVATE_API</span> build macro makes a poor job at underlining its effectiveness.<br />
<br />
As a consequence, next release of xxHash will introduce a <a href="https://github.com/Cyan4973/xxHash#build-modifiers">new build macro, named <span style="font-family: "courier new" , "courier" , monospace;">XXH_INLINE_ALL</span></a>. It does exactly the same thing, but its performance impact is better documented, and I suspect the name itself will make it easier for developers to anticipate its implications.Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com8tag:blogger.com,1999:blog-834134852788085492.post-52507841551253837222018-02-16T20:19:00.001+01:002022-04-05T17:54:40.014+02:00When to use Dictionary Compression<br />
<div class="separator" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em; text-align: center;">
<img border="0" data-original-height="408" data-original-width="1024" height="79" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1uwJ8gnp2rP60ZNaPaxks87e5jdO7Nqn-WoKVjzsp_zgwWI5sydN4DYJEZw86CywoSSJXEh85e2tSNZ-5hXQXZsi54LTGAGnJyXLZDi9vbwYKjkzBs_nUstlspPXAzYOIPnANzCap91I/s200/dictionary_pic.jpg" width="200" /></div>
On the <a href="https://facebook.github.io/zstd/">Zstandard website</a>, there is a small chapter dedicated to <a href="https://github.com/facebook/zstd#the-case-for-small-data-compression">Dictionary compression</a>. In a nutshell, it explains that it can dramatically improve compression ratio for small files. Which is correct, but doesn’t nearly capture the impact of this feature. In this blog post, I want to address this topic, and present a few scenarios in which dictionary offers more than just “improved compression”.<br />
<br />
<h3 id="database-example">
Database example</h3>
Let’s start by a simple case. Since dictionary compression is good for small data, we need an application which handles small data. For example a log record storage, which can be simplified as an append-only database.<br />
Writes are append-only, but reads can be random. A single query may require retrieving multiple individual records scattered throughout the database, in no particular order.<br />
As usual, the solution needs to be storage efficient, so compression plays an important role. And records tend to be repetitive, so compression ratio will be good.<br />
However, this setup offers a classic example of the “block-size trade-off” : in order to get any compression done, it’s necessary to group records together into “blocks”, then compress each block as a single entity. The larger the block, the better the compression ratio. But now, in order to retrieve a single record, it’s necessary to decompress the whole block, which translates into extra-work during read operations.<br />
<br />
<div style="text-align: center;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjF-XxT6vNS0TjDcu1YZU_bfj0hlWFmuF71lbJ5vD4Jg-1tjC4_N5ZsPsPqGtiHakIgDVBBSUa4L7VBvidmF9MZ-33u56iRdTZ3MGPpw-CQBcidkdneusMTHNvfWZlAu5UOOnojJb-SyKE/s1600/RatioVsBlock.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="635" data-original-width="1279" height="316" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjF-XxT6vNS0TjDcu1YZU_bfj0hlWFmuF71lbJ5vD4Jg-1tjC4_N5ZsPsPqGtiHakIgDVBBSUa4L7VBvidmF9MZ-33u56iRdTZ3MGPpw-CQBcidkdneusMTHNvfWZlAu5UOOnojJb-SyKE/s640/RatioVsBlock.png" width="640" /></a></div>
<i>Impact of Block Size on Compression ratio (level 1)</i></div>
<div style="text-align: center;">
<i>Last data point compresses records individually ([250-550] bytes)</i>
</div>
<div style="text-align: center;">
<i><br /></i></div>
The “optimal block size” is application dependent. It highly depends on data (how repetitive it is), and on usage scenarios (how many random reads). Typical sizes may vary between 4 KB and 128 KB.<br />
<br />
Enter dictionary compression. It is now possible to train the algorithm to become specifically good at compressing a selected type of data. <br />
(For simplicity purpose, in this example, we’ll assume that the log storage server stores a limited variety of record types, small enough to fit into a single dictionary. If not, it is necessary to separate record types into “categories”, generate a dictionary per category, then route each record into a separate pool based on category. More complex, but it doesn’t change the principles.)<br />
As a first step, we can now just use the dictionary to grab additional compression (see graph), and it's an instant win. But that wouldn’t capture all the value from this operation.<br />
<br />
The other benefit is that now, the optimal block size can be adjusted to the new conditions, since dictionary compression preserves better compression ratio (even 1 KB block size becomes practicable!). What we gain in exchange is improved performance for random read extraction. <br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhXvIreMq2cYtqKmIcWLj_axHBnAmMGj7zt6D0k17ILCHraTh2-MTgzGjQQrlQZsMjOSObvWC9uc_KJXQQxFhCD2UC3esSmjoFydblfn1QjInPGEXYCvtyr0BKXEJWXAabASW8s0jdqnh8/s1600/BlockVsRead.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="635" data-original-width="1279" height="316" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhXvIreMq2cYtqKmIcWLj_axHBnAmMGj7zt6D0k17ILCHraTh2-MTgzGjQQrlQZsMjOSObvWC9uc_KJXQQxFhCD2UC3esSmjoFydblfn1QjInPGEXYCvtyr0BKXEJWXAabASW8s0jdqnh8/s640/BlockVsRead.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<i>Impact of Block Size on single record extraction (Random Reads)</i></div>
<br />
At its extreme, it might even be possible to compress each record individually, so that no more energy is lost to decompress additional records which aren’t used by the query. But that's not even required.<br />
This change produces huge wins for random reads (collecting “single records” scattered throughout the database). In this example, the log storage can now extract up to 2.5M random records / second, while without dictionary, it was limited to 400K random records / second, and at the cost of a pitiful compression ratio. <br />
<br />
In fact, the new situation <em>unlocks</em> new possibilities for the database, which can now accept queries that used to be considered too prohibitive, and might previously have required <em>another</em> dedicated database to serve them.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYTxZZyzLf8biB0Qvb-wYh15TGn82fqOQOfbD1zS-wSBmiUGswNgSjq09tKCgbVe1PcwBasHSRunwCCfuNaP9kJ7jNdRO0hu-a3t78ReU5EyZcpMIcdp1THLBB3Q38NW70kCnYcUHHiyo/s1600/RatioVsReads.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="635" data-original-width="1279" height="316" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYTxZZyzLf8biB0Qvb-wYh15TGn82fqOQOfbD1zS-wSBmiUGswNgSjq09tKCgbVe1PcwBasHSRunwCCfuNaP9kJ7jNdRO0hu-a3t78ReU5EyZcpMIcdp1THLBB3Q38NW70kCnYcUHHiyo/s640/RatioVsReads.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<i>Normal Vs Dictionary Compression (level 1)<br />Comparing Compression Ratio and Random Reads per second</i></div>
<br />
That’s when the benefits of dictionary compression really kick in : beyond “better compression”, it unlocks new capabilities that were previously rightly considered “out of reach”.<br />
<br />
<h3 id="massive-connections">
Massive connections</h3>
Another interesting scenario is when a server maintains a large number of connections (multiple thousands) with many clients. Let’s assume the connection is used to send / receive requests respecting a certain protocol, so there tend to be a lot of inter-messages repetition. But within a single message, compression perspective is low, because each message is rather small.<br />
<br />
The solution to this topic is known : use streaming compression. Streaming will keep in memory the last N KB (configurable) of messages and use that knowledge to better compress future messages. It reaches excellent performance, as same tags and sequences repeated across messages get squashed in subsequent messages.<br />
<br />
The problem is, preserving such a “context” costs memory. And a lot of memory that is. To provide some perspective, a <a href="http://www.zstd.net/">Zstandard</a> streaming context with an history of 32 KB requires 263 KB of memory (about the same as zlib). Of course, increasing history depth improves compression ratio, but also increases memory budget.<br />
That doesn’t look like a large budget in isolation, or even for a few connections, but when applied to 10 thousands client connections, we are talking about > 2.5 GB of RAM. Situation worsen with more connections, or when trying to improve ratio through larger history and higher compression levels.<br />
<br />
In such a situation, dictionary compression can help. Training a dictionary to be good at compressing a set of protocols, and then ignite each new connection with this dictionary, will produce instant benefits during the early stage of the connection lifetime. Fine, but then, streaming history takes over, so gains are limited, especially when connection lifetime is long.<br />
<br />
A bigger benefit can be realised when understanding that the dictionary can be used to completely eliminate streaming. Dictionary will then partially offset compression ratio reduction, so mileage can vary, but in many cases, ratio just takes a small hit, as generic redundant information is already included in the dictionary anyway.<br />
What we get in exchange are 2 things :<br />
<ul>
<li>huge RAM savings : it’s no longer necessary to preserve a context <em>per client</em>, a single context <em>per active thread</em> is now enough, which is typically several order of magnitude less than the nb of clients.</li>
<li>vastly simplified communication framework, as each request is now technically “context-less”, eliminating an important logic framework dedicated to keeping contexts synchronised between sender and receiver (with all the funny parts related to time out, missed or disordered packets, etc.).</li>
</ul>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5LZzgTBQrCBgakcv4a3mie6MgXp-jlYDOWKTAqT8RplzpJvBI1MDY04COSKZc3VEG8AxJ8PAGiWRoDrqDnYeZIpqDIlOEp20SVgUkD0fR4rpmyueWhGF42Z52vwe3Mw8oS0OTzfj3UZs/s1600/memBudget.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="722" data-original-width="1408" height="328" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5LZzgTBQrCBgakcv4a3mie6MgXp-jlYDOWKTAqT8RplzpJvBI1MDY04COSKZc3VEG8AxJ8PAGiWRoDrqDnYeZIpqDIlOEp20SVgUkD0fR4rpmyueWhGF42Z52vwe3Mw8oS0OTzfj3UZs/s640/memBudget.png" width="640" /></a></div>
<div style="text-align: center;">
<i>Memory budget comparison</i></div>
<div style="text-align: center;">
<i><i>Context-less strategy requires much less memory (and is barely visible)</i></i></div>
<i>
</i>
<br />
<div style="text-align: center;">
<br /></div>
Eliminating the RAM requirement, which evolves from dominant to negligible, and reducing complexity open in turn new possibilities, such as hosting even more connections per server, or improved performance for other local applications. Or simply reduce the number of servers required for the same task.<br />
<br />
This kind of scenarios is where Dictionary Compression can give its best : beyond “better compression for small data”, it makes it possible to build target application differently, with second-order effects being as important if not more than compression savings alone.<br />
<br />
These are just 2 examples. And I’m sure there are more. But that’s enough to illustrate the topic, and probably enough words for a simple blog post.Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com9tag:blogger.com,1999:blog-834134852788085492.post-67592498641395915072018-01-12T01:57:00.000+01:002018-01-12T17:51:52.343+01:00Zstandard Overview<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/facebook/zstd/dev/doc/images/zstd_logo86.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="79" data-original-width="83" src="https://raw.githubusercontent.com/facebook/zstd/dev/doc/images/zstd_logo86.png" /></a></div>
I recently realised that, while there is a <a href="https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md">specification for Zstandard</a>, which describes in great details what is encoded where, there is no “overview” of the format, which would be neither too detailed nor too vague for programmers with a casual interest in data compression to understand its inner working. This blog post is an attempt to correct that.<br />
<br />
<h3 id="introduction">
Introduction</h3>
<a href="http://www.zstd.net/">Zstandard</a> is an <a href="https://en.wikipedia.org/wiki/LZ77_and_LZ78">LZ77-class</a> compressor, which primarily achieves compression by referencing from past data some segment of bytes identical to following bytes. Zstandard features a few other additional capabilities, but it doesn’t change the core formula. This construction offers several advantages, primarily speed related, especially on the decoder side, since a memory copy operation is all it takes to regenerate a bunch of bytes. Moreover, simple pointer arithmetic is enough to locate the reference to copy, which is as frugal as it gets both cpu and memory wise.<br />
<br />
<h3 id="blocks">
Blocks</h3>
<a href="http://www.zstd.net/">Zstandard</a> format is block-oriented. It can only start decoding data when a first full block arrives (with the minor exception of uncompressed blocks). But it’s nonetheless stream-capable : any large input is automatically cut into smaller blocks, and decoding starts as soon as the first block arrives. <br />
A block can have any size, up to a maximum of 128 KB. There are multiple reasons for such a limit to exist. <br />
It’s not a concern related to initial latency for the first block, since the format allows this block to have <em>any</em> size up to maximum, so it can be made explicitly small whenever necessary. <br />
The maximum block size puts an upper limit to the amount of data a decoder must handle in a single operation. The limit makes it possible to allocate a number of resources which are guaranteed to be enough for whatever data will follow. <br />
There are also other concerns into the mix, such as the relative weight of headers and descriptors, time spent to build tables, local adaptation to source entropy, etc. 128 KB felt like a good middle ground providing a reasonable answer to all these topics.<br />
It follows that a small source can be compressed into a single block, while larger ones will need multiple blocks. <br />
The organisation of all these blocks into a single content is called a frame.<br />
<br />
<h3 id="frame">
Frame</h3>
A frame will add a number of properties shared by all blocks in the current frame. <br />
To begin with, it can restrict the maximum block size even further : largest maximum is 128 KB, but for a given frame, it can be defined to a value as small as 1 KB. <br />
The frame can tell upfront how much data will be regenerated from its content, which can be useful to pre-allocate a destination buffer. <br />
Most importantly, it can tell how much “past data” must be preserved in memory to decode next block. This is the “window size”, which has direct consequences on buffer sizes. <br />
There are other properties stored there, but it’s not in the scope of this article to describe all of them. Should you wish to know more, feel free to consult the <a href="https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#frame_header">specification</a>.<br />
Once these properties are extracted, the decoding process is fairly straightforward : decompress data block after block.<br />
<br />
<h3 id="literals">
Literals</h3>
A compressed block consists of 2 sections : literals and sequences.<br />
Literals are the “left over” from LZ77 mechanism : remember, LZ77 compress next bytes by referencing an identical suite of bytes in the past. Sometimes, there is simply no such thing. Trivially, this is necessarily the case at the beginning of each source. <br />
In such a case, the algorithm outputs some “raw bytes”. These bytes are not compressed by LZ77, but they generally can be compressed using another technique : <a href="https://en.wikipedia.org/wiki/Huffman_coding">Huffman compression</a>.<br />
The principle behind Huffman is quite different : it transforms bytes into prefix codes using variable number of bits, and assign a low number of bits to frequent characters, while sacrificing infrequent characters with more bits. <br />
The Huffman algorithm makes it possible to select the most efficient repartition of prefix codes. <br />
When all bytes are equally present, it’s not possible to compress anything. But it’s generally not the case. <br />
Consider a standard text file using ASCII character set, a whole set of byte values will not be present (>128), and some characters (like ‘e’) are expected to be more common than others (like ‘q’). <br />
This is the kind of irregularity that Huffman can exploit to provide some compression for these left-over bytes. Typical gains range between 20% and 40%.<br />
The literal section can be uncompressed (mostly when it’s very small, since describing a Huffman table cost multiple bytes), or compressed as a single stream of bits, or using multiple (4) streams of bits. <br />
The multi streams strategy has been explained in <a href="https://fastcompression.blogspot.com/2015/10/huffman-revisited-part-5-combining.html">another post</a>, and is primarily designed for improved decoding speed.<br />
All literals are decompressed into their own buffer. The buffer size is primarily limited by the block size, since in worst case circumstances, LZ77 will fail completely, leaving only Huffman to do the job.<br />
<br />
<h3 id="sequences">
Sequences</h3>
Obviously, we expect LZ77 to be useful. Its outcome is described in the second section, called “sequences”. <br />
A block is rebuilt by a succession of sequences. <br />
A “sequence” describes a number of bytes to copy from literals buffer, and then a number of byte to copy from past data, with an associated offset to locate its reference. <br />
These values are of different nature, so they are encoded using 3 separated alphabets. <br />
Each alphabet must be described, and there is a small header for each of them at the beginning of the section. <br />
The compression technique used here is <a href="https://github.com/Cyan4973/FiniteStateEntropy/">Finite State Entropy</a>, a <a href="https://arxiv.org/abs/1311.2540">tANS</a> variant, which offers better compression for dominant symbols. <br />
Dominant symbols lose a lot of precision with Huffman, resulting in a loss of compression. They are unlikely to be present in “left over” literals, but for sequence symbols, the situation is less favourable. <br />
FSE solves this issue, by being able to encode symbols using a fractional number of bits. <br />
If you are interested in how FSE works, there is a <a href="https://fastcompression.blogspot.com/2013/12/finite-state-entropy-new-breed-of.html">series of articles</a> which tries to describe it, but be aware that it’s fairly complex.<br />
All sequence symbols are interleaved in a single bitstream, which must be read backward, due to ANS property of inverting directions for encoding vs decoding. <br />
On 64-bits CPU, a single read operation is generally enough to grab all bits necessary to decode the 3 symbols forming the sequence. All it takes now is to apply the sequence : copy some bytes from the literals buffer, then copy some bytes from the past. <br />
Decode next sequence. Rince, repeat. Stop when there is no more sequence left in the bitstream. <br />
At this stage, whatever remains in the literals buffer is simply copied to complete the block.<br />
And the decoder can move on to the next block.<br />
<br />
<h3 id="window">
Window</h3>
While decoding literals and sequence is a block-oriented job, that could be achieved in parallel within multiple blocks (expect a multi-threaded version in the future), the LZ copy operation is not. <br />
It depends on previous blocks being already decoded, and is therefore serial in nature. <br />
That’s where the frame header comes into play : it specifies how much past data the decoder must keep around to be able to decode next block. <br />
The specification recommends to keep this value <= 8 MB, though it’s only a recommendation. <code>--ultra</code> levels for example go beyond this limit. <br />
In most cases though, the decoder will not need that much. All levels <= 10 for example, which tend to be preferred due to their speed, require a memory budget <= 2 MB. <br />
As could be guessed, using less memory is also good for speed.<br />
<br />
<h3 id="wrap-up">
Wrap up</h3>
That’s basically it. All these operations form the core of Zstandard compression format. There are a few more little details involved, such as the repeat offset symbols, shortened header with repeat tables and so on, which are described in the <a href="https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md">specification</a>, but this description should be enough to grab the essence of the decoder.<br />
The encoder is a bit more complex, not least because there are, in fact, multiple encoders. <br />
The format doesn’t impose a single way to find or select references into the past. At every position into the file, there are always multiple possibilities to encode what’s next. <br />
That’s why different strategies exist, providing different speed / compression trade off. Lower level are mapped onto LZ4, being very fast. Upper levels can be very complex, on top of very slow, offering improved compression ratio.<br />
But the decoder remains always the same, preserving its speed properties.Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com3tag:blogger.com,1999:blog-834134852788085492.post-84055493869538590772017-07-13T12:32:00.000+02:002017-07-13T19:35:28.421+02:00Dealing with library version mismatch<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgMttWcqXRKsDHUw_PXcRBbULB8y-N723Ep88VgBaz7vhuOOhYuMTFzkKkO3Z2e5xXkjJwfbroD2fngWTBI8shn99cy8z6XKjKmrin_Qp5gDW89vB-wYDEIkwa2NR8WIs7Kq2dO3259_Q0/s1600/mysqldump_version_mismatch.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="213" data-original-width="320" height="133" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgMttWcqXRKsDHUw_PXcRBbULB8y-N723Ep88VgBaz7vhuOOhYuMTFzkKkO3Z2e5xXkjJwfbroD2fngWTBI8shn99cy8z6XKjKmrin_Qp5gDW89vB-wYDEIkwa2NR8WIs7Kq2dO3259_Q0/s200/mysqldump_version_mismatch.png" width="200" /></a></div>
<i> Note : this article was initially redacted as an answer to <a href="https://fastcompression.blogspot.com/2017/07/the-art-of-designing-advanced-api.html?showComment=1499686140209#c1828877728309860456">David Jud's comment</a>, but it became long enough to be worth converting into a full blog entry.</i><br />
In previous article, I attempted to introduce a few challenges related to designing an extensible API.<br />
<br />
In this one, I'll cover an associated but more specific topic, on how to handle a library version mismatch.<br />
<br />
Version mismatch is a more acute problem in a DLL scenario. In a static linking scenario, the programmer has several advantages :<br />
<ul>
<li>Compiler will catch errors (if a type, or a prototype, has changed for example). This gives time to fix these errors. Of course, the application maintainer will prefer that a library update <i>doesn't</i> introduce any change in existing code, but worst case is, most errors should be trapped before shipping the product.</li>
<li>Compiler will automatically adapt ABI changes : if an updated type is larger/smaller than previous version, it will be automatically converted throughout the code base. Same thing happens in case of <span style="font-family: "courier new" , "courier" , monospace;">enum</span> changes : adaptation to new <span style="font-family: "courier new" , "courier" , monospace;">enum</span> values is automatically applied by compiler.</li>
<li>Library is available during compilation, which means programmer has a local copy that he can update (or not) according to its requirements.</li>
</ul>
<br />
Well, this last property is not always true : in larger organisations, library might belong to a "validated" pool, which cannot be easily adapted for a specific project. In which case, the user program will either have to host its own local copy, or adapt to the one selected by its organisation.<br />
<br />
But you get the idea : problematic version mismatches are likely to be trapped or automatically fixed by the compiler, and therefore should be corrected before shipping a binary. Of course, the less changes, the better. Program maintainers will appreciate a library update as transparent as possible.<br />
<div>
<br /></div>
<div>
For a dynamic library though, the topic is a lot harder.</div>
<div>
To begin with, user program typically does not have direct control over the library version deployed on target system. So it could be anything. The library could be more recent, or older, than expected during program development.</div>
<div>
<br /></div>
Now these two types of mismatches are quite different, and trigger different solutions :<br />
<br />
<b><u>Case 1 - library version is higher than expected</u></b><br />
<div>
<br /></div>
This one can, and should, be solved by the library itself.<br />
<br />
It's relatively "easy" : never stop supporting older entry points.<br />
This is of course easier said than done, but to be fair, it's first and foremost a question of respecting a policy, and therefore is not out of reach.<br />
Zstandard tries to achieve that by guaranteeing that any prototype reaching "stable" status will be there "forever". For example, <span style="font-family: "courier new" , "courier" , monospace;">ZSTD_getDecompressedSize()</span>, which has been recently superceded by <span style="font-family: "courier new" , "courier" , monospace;">ZSTD_getFrameContentSize()</span>, will nonetheless remain an accessible entry point in future releases, because it's labelled "stable".<br />
<br />
A more subtle applicable problem is ABI preservation, in particular structure size and alignment.<br />
Suppose, for example, that version v1.1 defines a structure of size 40 bytes.<br />
But v1.2 add some new capabilities, and now structure has a size of 64 bytes.<br />
All previous fields from v1.1 are still there, at their expected place, but there are now more fields.<br />
<br />
The user program, expecting v1.1, would allocate the 40-bytes version, and pass that as an argument to a function expecting a 64-bytes object. You can guess what will follow.<br />
<br />
This could be "manually" worked around by inserting a "version" field and dynamically interpreting the object with the appropriate parser. But such manipulation is a recipe for complexity and errors.<br />
That's why structures are pretty dangerous. For best safety, structure definition must remain identical "forever", like the approved "stable" prototypes.<br />
<br />
In order to avoid such issue, it's recommended to use <a href="http://www.embedded.com/electronics-blogs/programming-pointers/4024893/Incomplete-types-as-abstractions">incomplete types</a>. This will force the creation of underlying structure through a process entirely controlled by current library, whatever its version, thus avoiding any kind of size/alignment mismatch.<br />
<br />
When above conditions are correctly met, the library is "safe" to use by applications expecting an older version : all entry points are still there, behaving as expected.<br />
<br />
Whenever this condition cannot be respected anymore, an accepted work-around is to increase the Major digit of the version, indicating a breaking change.<br />
<div>
<br /></div>
<br />
<b><u>Case 2 - library version is lower than expected</u></b><br />
<div>
<b><u><br /></u></b></div>
This one is more problematic.<br />
Basically, responsibility is now on the application side. It's up to the application to detect the mismatch and act accordingly.<br />
<br />
In <a href="https://fastcompression.blogspot.com/2017/07/the-art-of-designing-advanced-api.html?showComment=1499686140209#c1828877728309860456">David Jud's comment</a>, he describes a pretty simple solution : if the library is not at the expected version, the application just stops there.<br />
Indeed, that's one way to safely handle the matter.<br />
<br />
But it's not always desirable. An application can have multiple library dependencies, and not all of them might be critical.<br />
For example, maybe the user program access several libraries offering similar services (encryption for example). If one of them is not at the expected version, and cannot be made to work, it's not always a good reason to terminate the program : maybe there are already plenty of capabilities available without this specific library, and the program can run, just with less options.<br />
<br />
Even inside a critical library dependency, some new functionality might be optional, or there might be several ways to get one job done.<br />
Dealing with this case requires writing some "version dependent" code.<br />
This is not an uncommon situation by the way. Gracefully handling potential version mismatches is one thing highly deployed programs tend to do well.<br />
<br />
Here is how it can be made to work : presuming the user application wants access to a prototype which is only available in version v1.5+, it first tests the version number. If condition matches, then program can invoke target prototype as expected. If not, a backup scenario is triggered, be it an error, or a different way to get the same job done.<br />
<div>
<br /></div>
<div>
Problem is, this test must be done statically.<br />
For example, in <a href="https://facebook.github.io/zstd/">Zstandard</a>, it's possible to ask for library version at runtime, using <span style="font-family: "courier new" , "courier" , monospace;"><a href="https://github.com/facebook/zstd/blob/master/lib/zstd.h#L65">ZSTD_versionNumber()</a></span>. But unfortunately, it's already too late.<br />
Any invocation of a new function, such as <a href="https://github.com/facebook/zstd/blob/master/lib/zstd.h#L95"><span style="font-family: "courier new" , "courier" , monospace;">ZSTD_getFrameContentSize()</span></a> which only exists since v1.3.0, will trigger an error at link time, even if the invocation itself is protected by a prior check with <span style="font-family: "courier new" , "courier" , monospace;">ZSTD_versionNumber()</span> .<br />
<br />
What's required is to selectively remove any reference to such prototype from compilation and linking stages, which means this code cannot exist. It can be excluded through pre-processor.<br />
So the correct method is to use a macro definition, in this case, <a href="https://github.com/facebook/zstd/blob/master/lib/zstd.h#L64"><span style="font-family: "courier new" , "courier" , monospace;">ZSTD_VERSION_NUMBER</span></a> . </div>
<div>
<br /></div>
<div>
Example :</div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">#if ZSTD_VERSION_NUMBER >= 10300</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">size = ZSTD_getFrameContentSize(src, srcSize);</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">#else</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">size = ZSTD_getDecompressedSize(src, srcSize);</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">/* here, </span><span style="font-family: "courier new" , "courier" , monospace;">0-size answer can be mistaken for "error", so add some code to mitigate the risk */</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">#endif</span></div>
<div>
<br /></div>
<div>
That works, but requires to compile binary with the correct version of <span style="font-family: "courier new" , "courier" , monospace;">zstd.h</span> header file.<br />
When the program is compiled on target system, it's a reasonable expectation : if <span style="font-family: "courier new" , "courier" , monospace;">libzstd</span> is present, <span style="font-family: "courier new" , "courier" , monospace;">zstd.h</span> is also supposed to be accessible. And it's reasonable to expect them to be synchronised. There can be some corner case scenarios where this does not work, but let's say that in general, it's acceptable.<br />
<br />
The detection can also be done through a <span style="font-family: "courier new" , "courier" , monospace;">./configure</span> script, in order to avoid an <span style="font-family: "courier new" , "courier" , monospace;">#include</span> error during compilation, should the relevant <span style="font-family: "courier new" , "courier" , monospace;">header.h</span> be not even present on target system, as sometimes the library is effectively optional to the program.<br />
<br />
But compilation from server side is a different topic. While this is highly perilous to pre-compile a binary using dynamic libraries and then deploy it, this is nonetheless the method selected by many repositories, such as <span style="font-family: "courier new" , "courier" , monospace;">aptitude</span>, in order to save deployment time. In which case, the binary is compiled for "system-provided libraries", which minimum version is known, and repository can solve dependencies. Hence, by construction, the case <i>"library has a lower version than expected"</i> is not supposed to happen. Case closed.</div>
<div>
<br />
So, as we can see, the situation is solved either by local compilation and clever usage of preprocessing statements, or by dependency solving through repository rules.<br />
<br />
Another possibility exists, and is, basically, the one proposed in <span style="font-family: "courier new" , "courier" , monospace;">ZSTD_CCtx_setParameter()</span> API : the parameter to set is selected through an <span style="font-family: "courier new" , "courier" , monospace;">enum</span> value, and if it doesn't exist, because the local library version is too old to support it, the return value signals an error.</div>
<div>
<br /></div>
<div>
Using safely this API feels a lot like the previous example, except that now, it becomes possible to check library version at runtime :</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">if (ZSTD_versionNumber() >= 10500) {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> return ZSTD_CCtx_setParameter(cctx, ZSTD_p_someNewParam, value);</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">} else {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> return ERROR(capability_not_present);</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">}</span></div>
</div>
<div>
<br /></div>
<div>
This time, there is no need to be in sync with the correct <span style="font-family: "courier new" , "courier" , monospace;">header.h</span> version. As the version number comes directly at runtime from the library itself, it's necessarily correct.<br />
<br />
Note however that <span style="font-family: "courier new" , "courier" , monospace;">ZSTD_CCtx_setParameter()</span> only covers the topic of "new parameters". It cannot cover the topic of "new prototypes", which still requires using exclusion through macro detection.</div>
<div>
<br /></div>
<div>
So, which approach is best ?</div>
<div>
<br /></div>
<div>
Well, that's the good question to ask. There's a reason the new advanced API is currently in "experimental" mode : it needs to be field tested, to experience its strengths and weaknesses. There are pros and cons to both methods.<br />
And now, the matter is to select the better one...</div>
Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com5tag:blogger.com,1999:blog-834134852788085492.post-70553101863137587672017-07-06T01:06:00.000+02:002017-07-13T19:36:25.917+02:00The art of designing advanced API<div class="separator" style="clear: both; text-align: center;">
<a href="https://www.bls.gov/bls/api_image.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="220" data-original-width="300" height="146" src="https://www.bls.gov/bls/api_image.png" width="200" /></a></div>
A library <a href="https://en.wikipedia.org/wiki/Application_programming_interface">API (Application Programming Interface)</a> is even more important than its implementation.<br />
<br />
There are many reasons for this statement :<br />
- An API exposes a <a href="https://en.wikipedia.org/wiki/Abstraction_(software_engineering)">suitable abstraction</a>. Should it prove broken, unclear or just too complex, the library will be misused, which will ultimately be paid by users' time (multiplied by nb of users).<br />
- An API is a contract. Break it, and existing applications can no longer work with your library. Adaptation cost is once again paid by users' time (if it ever happens).<br />
- Because of this, API tend to stick around for a <i>long</i> time, much longer than underlying implementation.<br />
<div>
<br />
If an implementation is modified to provide, say, a 5% speed improvement, it's all free, every user can immediately benefit about it without further hassle. But if one has to add a single parameter, it's havoc time.<br />
<br />
Because it's so stressful to modify an API, one can be tempted to look very hard once, in order to design and expose a <i>perfect API</i>, one that will stand the test of time, and will never need to be changed. This search is (mostly) a delusion.<br />
- perfect API, embedding such a strong restriction to never change in the future, can take forever to build, all under intense stress, as there is always a question mark hanging around : "is there a use case that it does not cover ?". Eventually, it's only a matter of time before you (or your users) find one.<br />
- perfect API lean towards "complex API", as the requirement to support everything makes it add more parameters and control, becoming more and more difficult to manage by users.<br />
- "complex" quickly leads to "misleading", as supporting some "future scenarios" for which there is no current implementation, and maybe no current need, will be categorised bad ideas after all, but side-effects of this useless abstraction will remain in the API.</div>
<div>
<br /></div>
<div>
So, the next great idea is to plan for API changes.<br />
The way <a href="https://github.com/facebook/zstd">Zstandard library</a> tries to achieve this is by quickly converging towards some very simple prototypes, which offer "core" functionalities at a low complexity level.<br />
Then, more complex use cases, not covered by simple API, do show up, and the need to serve them introduce the creation of an "experimental section", a place where it's still possible to play with API, trying to find an optimal abstraction for intended use case, before moving into "stable" status (aka: this method will no longer change).</div>
<div>
<br /></div>
<div>
A consequence of this strategy is the creation of more and more prototypes, dedicated to serving their own use case.<br />
Need to compress with dictionary ? Sure, here comes <span style="font-family: "courier new" , "courier" , monospace;">ZSTD_compress_usingDict()</span> .<br />
Need to process data in a streaming fashion ? Please find <span style="font-family: "courier new" , "courier" , monospace;">ZSTD_compressStream()</span> .<br />
In a streaming fashion with a dictionary ? <span style="font-family: "courier new" , "courier" , monospace;">ZSTD_compressStream_usingDict()</span> .<br />
Need control over specific parameters ? Go to <span style="font-family: "courier new" , "courier" , monospace;">_advanced()</span> variants.<br />
Preprocess dictionary for faster loading time ? Here are <span style="font-family: "courier new" , "courier" , monospace;">_usingCDict()</span> variants.<br />
Some multithreading maybe ? <span style="font-family: "courier new" , "courier" , monospace;">ZSTDMT_*()</span></div>
<div>
Combine all this ? Of course. Here is a list of a gazillion methods.<br />
<br />
As one can see, this doesn't scale too well. It used to be "good enough" for a dozen or so methods, but as combinatorial complexity explodes, it's no longer suitable.</div>
<div>
<br /></div>
<div>
In <a href="https://github.com/facebook/zstd/releases/tag/v1.3.0">latest release of Zstandard</a>, we try to get a fresh look to this situation, and provide an API simpler to manage. The result of which is the <a href="https://github.com/facebook/zstd/blob/master/lib/zstd.h#L890">new advanced API candidate</a>, which actually stands a chance to become stable one day.<br />
<br />
It features 2 core components : </div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">ZSTD_compress_generic()</span> is the new main compression method. It's designed to support all other compression methods. It can do block, streaming, dictionary, multithreading, and any combination of those. We have plan for even more extensions, and they all seem to fit in.<br />
<br />
This is possible because now sending parameters is a separate operation, which can be completed in as many steps as necessary.</div>
<div>
The main vehicle to set these parameters is <span style="font-family: "courier new" , "courier" , monospace;">ZSTD_CCtx_setParameter()</span> .</div>
<div>
It uses an <span style="font-family: "courier new" , "courier" , monospace;">enum</span> based policy, where the parameter is selected in an <span style="font-family: "courier new" , "courier" , monospace;">enum</span> list, and new value is provided as an <span style="font-family: "courier new" , "courier" , monospace;">unsigned</span> type.</div>
<div>
This design has been favoured over previous one, which was using a <span style="font-family: "courier new" , "courier" , monospace;">struct</span> to pass all parameters in a single step. The <span style="font-family: "courier new" , "courier" , monospace;">struct</span> was inconvenient as it forced user to select a value for each and every parameter, even out-of-scope ones, in order to change just one of them. Perhaps even more importantly, the <span style="font-family: "courier new" , "courier" , monospace;">struct</span> is problematic for future changes : adding any new parameter would change the <span style="font-family: "courier new" , "courier" , monospace;">struct</span> size, which is an ABI break. It can quickly get ugly when the program and library work on common memory areas using different sizes.</div>
<div>
The <span style="font-family: "courier new" , "courier" , monospace;">enum</span> policy allow us to add any new parameter while preserving API and ABI, so it looks very flexible.</div>
<div>
<br /></div>
<div>
However, it comes with its own set of troubles.</div>
<div>
To begin with, <span style="font-family: "courier new" , "courier" , monospace;">enum</span> values can very easily change : just add a new <span style="font-family: "courier new" , "courier" , monospace;">enum</span> in the list, and see all enum values after that one slide by one.<br />
It can be a problem if, in a version of the library, <span style="font-family: "courier new" , "courier" , monospace;">ZSTD_p_compressionLevel</span> is attributed a 2, but in a future version, it becomes a 3. In a dynamic library scenario, where version mismatch can easily happen, it means the caller is changing some other random parameter.</div>
<div>
To counter that, it will be necessary to pin down all <span style="font-family: "courier new" , "courier" , monospace;">enum</span> values to a manually assigned one. This will guarantee the attribution.<br />
<br />
Another issue is that the value of the parameter is provided as an <span style="font-family: "courier new" , "courier" , monospace;">unsigned</span> type, so the parameter must fit this type. That's not always possible.</div>
<div>
For example, there is a dedicated method to set <span style="font-family: "courier new" , "courier" , monospace;">pledgedSrcSize</span>, which is essentially a promise about how much data is going to be processed. This amount can be very large, so an <span style="font-family: "courier new" , "courier" , monospace;">unsigned</span> type is not enough. Instead, we need an <span style="font-family: "courier new" , "courier" , monospace;">unsigned long long</span>, hence a dedicated method.</div>
<div>
Another even more obvious one happens when referencing a prepared dictionary in read-only mode : this parameter is a <span style="font-family: "courier new" , "courier" , monospace;">const ZSTD_CDict* </span>type, so it is set through a dedicated method, <span style="font-family: "courier new" , "courier" , monospace;">ZSTD_CCtx_refCDict().</span></div>
<div>
And we have a few other exceptions using their own method, as the argument cannot fit into an <span style="font-family: "courier new" , "courier" , monospace;">unsigned</span>.</div>
<div>
<br /></div>
<div>
But the large majority of them uses <span style="font-family: "courier new" , "courier" , monospace;">ZSTD_CCtx_setParameter()</span>.</div>
<div>
In some cases, the adaptation works though it's not "best".</div>
<div>
For example, a few parameters are selected among a list of enums, for example <span style="font-family: "courier new" , "courier" , monospace;">ZSTD_strategy</span> . The enum is simply casted to an <span style="font-family: "courier new" , "courier" , monospace;">unsigned</span> and passed as argument. It works. But it would have been even nicer to keep the argument type as the intended enum, giving the compiler a chance to catch potential type mismatch (<a href="https://godbolt.org/g/dixEzG">example</a>).</div>
<div>
<br /></div>
<div>
So this design could be in competition with another one : define one method per parameter. The most important benefit would be that each parameter can have its own type.</div>
<div>
But this alternate design has also its own flaws :</div>
<div>
adding any new parameter means adding a method. Therefore, if a program uses a "recent" method, but links against an older library version, this is a link error.<br />
In contrast, the <span style="font-family: "courier new" , "courier" , monospace;">enum</span> policy would just generate an error in the return code, which can be identified and gracefully dealt with.</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Creating future-proof API is hard. There is always a new unexpected use case which shows up and would require another entry point or another parameter. The best we can do is plan for those changes.</div>
<div>
The new Zstandard's advanced API tries to do that. But since it is a first attempt, it likely is perfectible. </div>
<div>
This is design time, and it will cost a few revisions before reaching "stable" status. As always, user feedback will be key to decide if the new design fits the bill, and how to amend it.<br /><br /><u>Follow up</u> : <a href="https://fastcompression.blogspot.com/2017/07/dealing-with-library-version-mismatch.html">Dealing with library version mismatch</a><br />
<br />
<i><u>Edit :</u></i><br />
Arseny Kapoulkine made an interesting comment, arguing that specialized entry points make it possible for compiler's DCE (Dead Code Elimination) optimisation to kick in, removing useless code from the final binary :<br />
<a href="https://twitter.com/zeuxcg/status/882816066296172550">https://twitter.com/zeuxcg/status/882816066296172550</a><br />
<br />
In general this is true. Calling<br />
<span style="font-family: "courier new" , "courier" , monospace;">specialized_function1(...)</span><br />
is clear for the linker,<br />
then it's possible to remove any potentially unused <span style="font-family: "courier new" , "courier" , monospace;">specialized_function2()</span> from binary generation.<br />
<br />
In contrast calling<br />
<span style="font-family: "courier new" , "courier" , monospace;">generic_function(mode=1, ...)</span><br />
with<br />
<span style="font-family: "courier new" , "courier" , monospace;">void generic_function(int mode, ...)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">{</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> switch(mode) {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> case 1 : return specialized_function1(...);</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> case 2 : return specialized_function2(...);</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">}</span><br />
<br />
makes it much more difficult. In general, for the second case, <span style="font-family: "courier new" , "courier" , monospace;">specialized_function2()</span> will remain in the binary.<br />
(an exception being usage of <span style="font-family: "courier new" , "courier" , monospace;">inline</span>, associated with <span style="font-family: "courier new" , "courier" , monospace;">-flto</span>, and non-ambiguous selection parameter, but that's no longer a "simple" setup).<br />
<br />
For Zstandard though, it doesn't make a lot difference.<br />
The reason is, whatever "specialized" entry point is invoked, (<span style="font-family: "courier new" , "courier" , monospace;">ZSTD_compress()</span>, or <span style="font-family: "courier new" , "courier" , monospace;">ZSTD_compress_usingDict()</span> for example), it's just an entry point. The compression code is not "immediately behind", it's reached after several indirection levels. This design make it possible for a single compression code to address multiple usage scenarios with vastly different set of parameters, which is vital for maintenance. But disentagling that for DCE is a lot more difficult.<br />
If required, <span style="font-family: "courier new" , "courier" , monospace;">-flto</span> makes it possible to optimize size better, and some difference become visible, but remain relatively small.</div>
Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com8tag:blogger.com,1999:blog-834134852788085492.post-19820110334598143982016-08-31T21:17:00.006+02:002017-03-20T09:52:23.221+01:00Zstandard v1.0 is out<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjypsisXzrNSKNxh8ZbWywQNbMMcyxpfBCkgKos9IueJki4YVOdRSMSORnwb_mzHkhT69GZPuih85RVRpEu1J0RyH8_cXdXb93LOPiu_0Wo4jO4gQpKTGhEanF3rMQfLSBDRBis_rJTy8c/s1600/zstd_logo_only.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="193" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjypsisXzrNSKNxh8ZbWywQNbMMcyxpfBCkgKos9IueJki4YVOdRSMSORnwb_mzHkhT69GZPuih85RVRpEu1J0RyH8_cXdXb93LOPiu_0Wo4jO4gQpKTGhEanF3rMQfLSBDRBis_rJTy8c/s200/zstd_logo_only.png" width="200" /></a></div>
and is now officially a Facebook Open Source project :<br />
<br />
<a href="https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard">https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard</a><br />
<br />
<br />
<br />
<br />
<br />
<i><u>Edit</u></i> : grab latest release at : <a href="https://github.com/facebook/zstd/releases">https://github.com/facebook/zstd/releases</a><br />
<br />Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com6tag:blogger.com,1999:blog-834134852788085492.post-90083604665995228202016-07-05T19:51:00.000+02:002016-08-01T14:42:26.628+02:00Specification of Zstandard compression format<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjJm32K2mmcS4TiPNZlUABf9NX4abPTlejnvdOW5yH6NmXFulkRyL4VxDdTuY_IMQxyZJJ1csE4JbmQrkMj-tnAF-h3ljsRQa2SfuqIf3v1Q2Su-8m6Vk-zpa1-DIljMWEc4ZkrQg9Fr2k/s1600/zstd_logo_only.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="192" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjJm32K2mmcS4TiPNZlUABf9NX4abPTlejnvdOW5yH6NmXFulkRyL4VxDdTuY_IMQxyZJJ1csE4JbmQrkMj-tnAF-h3ljsRQa2SfuqIf3v1Q2Su-8m6Vk-zpa1-DIljMWEc4ZkrQg9Fr2k/s200/zstd_logo_only.png" width="200" /></a></div>
<div style="color: #1d2129; font-family: helvetica, arial, sans-serif; font-size: 14px; line-height: 19.32px; margin-bottom: 6px; margin-top: 6px;">
With the compression format stabilized in v0.7.x serie, <a href="http://www.zstd.net/">Zstandard</a> gets now a first version of its formal specification :
<a href="https://l.facebook.com/?u=https%3A%2F%2Fgithub.com%2FCyan4973%2Fzstd%2Fwiki%2FZstandard-Compression-Format&e=ATNRRtOMGLqsGlJ7wunWSmTdKfga8TatL2N5I4byK7aBfgWJM2OAkeUnh19nbJWaRjV__JNxoi17" style="color: #365899; cursor: pointer; font-family: inherit; text-decoration: none;" target="_blank"></a><a href="https://github.com/Cyan4973/zstd/wiki/Zstandard-Compression-Format" style="color: #365899; cursor: pointer; font-family: inherit; text-decoration: none;" target="_blank">https://github.com/Cyan4973/zstd/wiki/Zstandard-Compression-Format</a></div>
<div style="color: #1d2129; font-family: helvetica, arial, sans-serif; font-size: 14px; line-height: 19.32px; margin-bottom: 6px; margin-top: 6px;">
If you ever wanted to know how the algorithm works, and / or wanted to create your own version in any language of your choice, this is the place to start.</div>
<span class="text_exposed_show" style="color: #1d2129; display: inline; font-family: "helvetica" , "arial" , sans-serif; font-size: 14px; line-height: 19.32px;"></span><br />
<div class="text_exposed_show" style="color: #1d2129; display: inline; font-family: helvetica, arial, sans-serif; font-size: 14px; line-height: 19.32px;">
<div style="font-family: inherit; margin-bottom: 6px;">
It is a first version though, with usual caveats : expect it to be perfectible and require a few rounds, feedbacks and modifications, before reaching a stage of being unambiguous and clear.</div>
<div style="font-family: inherit; margin-bottom: 6px; margin-top: 6px;">
This is an opened public consultation phase, every feedback is welcomed.<br />
It's also the very last chance to review the different choices that made it into the format, introducing questions and possibly <span style="line-height: 19.32px;">suggesting</span><span style="line-height: 19.32px;"> </span><span style="font-family: inherit; line-height: 19.32px;">improvements or simplifications.</span></div>
<div style="font-family: inherit; margin-bottom: 6px; margin-top: 6px;">
I don't expect "big changes", but maybe a collection of very minor things, which could, collectively, be worth considering a last polishing touch before pushing to v1.0.<br />
<br />
<b><i><u>Edit </u></i></b>: Indeed, there will be a polishing stage...<br />
Writing the specification made it possible to grab a complete view of the multiple choices which made it into the format. Retrospectively, some of these choices are similar yet slightly different. For example, encoding types exist for all symbols, but are not numbered in the same way. Most fields are little-endian, but some are big-endian, some corner cases optimizations are so rare they are not worth their complexity, etc.<br />
Therefore, in an effort to properly unify every minor detail of the specification and bring a few simplifications, a last modification round will be performed. It will be released as 0.8. No major change to expect, only a collection of minor ones. But a change is a change, so it's nonetheless a new format.<br />
As usual, 0.8 will be released with a "legacy mode", allowing reading data already compressed with 0.7.x series and before.<br />
Unlike usual though, we plan to release a "v0.7 transition" version, able to read data created with v0.8, in order to smooth transition in live systems which depend on listeners / producers, and need to ensure all listeners are able to read data sent to them before upgrading to 0.8.<br />
<br />
<b><u><i>Edit 2 :</i></u></b><br />
<a href="https://github.com/Cyan4973/zstd/releases/tag/v0.8.0">v0.8.0</a> and "<a href="https://github.com/Cyan4973/zstd/releases/tag/v0.7.5">transition v0.7.5</a>" have been released</div>
</div>
Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com4tag:blogger.com,1999:blog-834134852788085492.post-55395575828586472542016-06-17T13:45:00.001+02:002016-08-17T00:27:41.883+02:00Zstandard reaches Candidate Compression Format<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvP66T9O2WVk7JB6UIJ64kk8LcxtzKxKxWIqlm63-PJcWBz8tCW9q2nWyw9tyB1xT726VK_VMh5av4NbE7gS1AyN83Qyv3KXB68UIXQDG6DSnHl1E7acRFzjBCePD9PtCHrZO3lXv7V-4/s1600/zstd_logo_only.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="193" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvP66T9O2WVk7JB6UIJ64kk8LcxtzKxKxWIqlm63-PJcWBz8tCW9q2nWyw9tyB1xT726VK_VMh5av4NbE7gS1AyN83Qyv3KXB68UIXQDG6DSnHl1E7acRFzjBCePD9PtCHrZO3lXv7V-4/s200/zstd_logo_only.png" width="200" /></a></div>
Finally. That was a pretty long journey.<br />
With the release of v0.7, <a href="http://www.zstd.net/">Zstandard </a>has reached an important milestone where the compression format is stable and complete enough to pretend becoming v1.0.<br />
We don't call it v1.0 yet, because it's safer to spend some time on a "confirmation period" during which the final compression format is field-tested. It shall confirm its ability to match its objectives, dealing with all situations it is planned for.<br />
<i>Then</i> it will be rebranded v1.0.<br />
<br />
With the <i>source code</i> out, it's also time to think about other supportive actions, such as documentation. The next priority task is to redact a specification of the compression format, so that it can be better exposed, understood and implemented by third parties. The goal is that any third party should be able to create its own version. However, describing algorithm in a way which is clear and concise is not trivial. It's expected that some paragraphs will need re-write in an effort to become clearer and more complete, reducing sources of confusion. So this effort will benefit from user exposure and feedback<br />
<br />
It's also time to have some more involved discussions around the API.<br />
The current "stable API" is expected to remain, but its scope is also limited, providing mostly the "basics", such as compressing a buffer into another buffer. More complex usages are possible (streaming in memory-constrained environment using a custom allocator for example), but need to access advanced prototypes, exposed in the "experimental" section. <br />
Now will be a good time to seriously consider extending the scope of "stable API".<br />
<br />
Just "promoting" a prototype from "experimental" to "stable" is not necessarily the better way to go. It's important that the extended API remain simple enough to understand and use (which is not the main priority of "experimental" section).<br />
After being immersed in the code for so long, some technical complexities can become invisible, while becoming real obstacles to newcomers. Therefore, it's important to "think" extended API properly, to create interfaces easy to use.<br />
For this objective, the key is to listen 3rd parties, in order to better fit natural expectations.Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com1tag:blogger.com,1999:blog-834134852788085492.post-59501799895639663732016-05-13T13:08:00.001+02:002016-05-17T15:57:24.617+02:00Finalizing a compression formatWith Zstandard v1.0 looming ahead, the last major item for zstd to settle is an extended set of features for its frame encapsulation layer.<br />
<br />
Quick overview of the design : data compressed by zstd is cut into blocks. A compressed block has a maximum content size (128 KB), so obviously if input data is larger than this, it will have to occupy multiple blocks.<br />
The frame layer organize these blocks into a single content. It also provides to the decoder a set of properties that the encoder pledges to respect. These properties allow a decoder to prepare required resources, such as allocating enough memory.<br />
<br />
The current frame layer only stores 1 identifier and 2 parameters :<br />
<ul>
<li><span style="font-family: "courier new" , "courier" , monospace;">frame Id :</span> It simply tells what are the expected frame and compression formats for follow. This is currently use to automatically detect legacy formats (v0.5.x, v0.4.x, etc.) and select the right decoder for them. It occupies the first 4 bytes of a frame.</li>
<li><span style="font-family: "courier new" , "courier" , monospace;">windowLog </span>: This is the maximum search distance that will be used by the encoder. It is also the maximum block size, when<span style="font-family: "courier new" , "courier" , monospace;"> (1<<windowLog) < MaxBlockSize (== 128 KB)</span>. This is enough for a decoder to guarantee successful decoding operation using a limited buffer budget, whatever the real content size is (endless streaming included).</li>
<li><span style="font-family: "courier new" , "courier" , monospace;">contentSize </span>: This is the amount of data to decode within this frame. This information is optional. It can be used to allocate the exact amount of memory for the object to decode.</li>
</ul>
<br />
These information may seem redundant.<br />
Indeed, for a few situations, they are : when <span style="font-family: "courier new" , "courier" , monospace;">contentSize < (1<<windowLog)</span>. In which case, it's enough to allocated <span style="font-family: "courier new" , "courier" , monospace;">contentSize </span>bytes for decoding, and <span style="font-family: "courier new" , "courier" , monospace;">windowLog </span>is just redundant.<br />
But for all other situations, <span style="font-family: "courier new" , "courier" , monospace;">windowLog</span><span style="font-family: "courier new" , "courier" , monospace;"> </span>is useful : either <span style="font-family: "courier new" , "courier" , monospace;">contentSize </span>is unknown (it wasn't known at the beginning of the frame and was only discovered on frame termination), or <span style="font-family: "courier new" , "courier" , monospace;">windowLog </span>defines a smaller memory budget than <span style="font-family: "courier new" , "courier" , monospace;">contentSize</span>, in which case, it can be used to limit memory budget.<br />
<br />
That's all there is for v0.6.x. Arguably, that's a pretty small list.<br />
<br />
The intention is to create a more feature complete frame format for v1.0.<br />
Here is a list of features considered, in priority order :<br />
<ul>
<li><b>Content Checksum</b> : objective is to validate that decoded content is correct.</li>
<li><b>Dictionary ID</b> : objective is to confirm or detect dictionary mismatch, for files which require a dictionary for correct decompression. Without it, a wrong dictionary could be picked, resulting in silent corruption (or an error).</li>
<li><b>Custom content</b>, aka skippable frames : the objective is to allow users to embed custom elements (comments, indexes, etc.) within a file consisting of multiple concatenated frames.</li>
<li><b>Custom window sizes</b>, including non power of 2 : extend current windowLog scheme, to allow more precise choices.</li>
<li><b>Header checksum</b> : validate that checksum informations are not accidentally distorted.</li>
</ul>
Each of these bullet points introduce its own set of questions, that is detailed below :<br />
<br />
<b><u>Content checksum</u></b><br />
The goal of this field is obvious : validate that decoded content is correct. But there are many little details to select.<br />
<br />
Content checksum only protects against accidental errors (transmission, storage, bugs, etc). It's not an electronic "signature".<br />
<br />
<i>1) Should it be enabled or disabled by default (field == 0) ?</i><br />
<br />
<div>
Suggestion : disabled by default<br />
Reasoning : There are already a lot of checksum around, in storage, in transmission, etc. Consequently, errors are now pretty rare, and when they happen, they tend to be "large" rather than sparse. Also, zstd is likely to detect errors just by parsing the compressed input anyway.<br />
<br />
2) <i>Which algorithm ? Should it be selectable ?</i><br />
<br />
Suggestion : xxh64, additional header bit reserved in case of additional checksum, but just a single one defined in v1.<br />
Reasoning : we have transitioned to a 64-bits world. 64-bits checksum are faster to generate than 32-bits ones on such systems. So let's use the faster ones. <br />
xxh64 also has excellent distribution properties, and is highly portable (no dependency on hardware capability). It can be run in 32-bits mode if need be.<br />
<br />
3) <i>How many bits for the checksum ?</i><br />
<br /></div>
Current format defines the "frame end mark" as a 3-bytes field, the same size as a block header, which is no accident : it makes parsing easier. This field has a 2-bits header, hence 22 bits free, which can be used for a content checksum. This wouldn't increase the frame size.<br />
<br />
22-bits means there is a 1 in 4 millions chances of collision in case of error. Or said differently, there are 4194303 chances out of 4194304 to detect a decoding error (on top of all the syntax verification which are inherent to the format itself). That's more than > 99.9999 %. Good enough in my view.<br />
<br />
<b><u><br /></u></b>
<b><u>Dictionary ID</u></b><br />
<br />
Data compressed using a dictionary needs the exact same one to be regenerated. But no control is done on the dictionary itself. In case of wrong dictionary selection, it can result in a data corruption scenario.<br />
<br />
The corruption is likely to be detected by parsing the compressed format (or thanks to the previously described optional content checksum field).<br />
But an even better outcome would be detect such mismatch immediately, before starting decompression, and with a clearer error message/id than "corruption", which is too generic.<br />
<br />
For that, it would be enough to embed a "Dictionary ID" into the frame.<br />
The Dictionary ID would simply be a random value stored inside the dictionary (or an assigned one, provided the user as a way to control that he doesn't re-use the same value multiple times). A comparison between the ID in the frame and the ID in the dictionary will be enough to detect the mismatch.<br />
<br />
A simple question is : how long should be this ID ? 1, 2, 4 bytes ?<br />
In my view, 4 bytes is enough for a random-based ID, since it makes the probability of collision very low. But that's still 4 more bytes to fit into the frame header. In some ways it can be considered an efficiency issue.<br />
Maybe some people will prefer 2 bytes ? or maybe even 1 byte (notably for manually assigned ID values) ? or maybe even 0 bytes ?<br />
<br />
It's unclear, and I guess multiple scenarios will have different answers.<br />
So maybe a good solution would be to support all 4 possibilities in the format, and default to 4-bytes ID when using dictionary compression.<br />
<br />
Note that if saving headers is important for your scenario, it's also possible to use frame-less block format ( <span style="font-family: "courier new" , "courier" , monospace;">ZSTD_compressBlock()</span>, <span style="font-family: "courier new" , "courier" , monospace;">ZSTD_decompressBlock()</span> ), which will remove any frame header, saving 12+ bytes in the process. It looks like a small saving, but when the corpus consists of lot of small messages of ~50 bytes each, it makes quite a difference. The application will have to save metadata on its own (what's the correct dictionary, compression size, decompressed size, etc.).<br />
<br />
<br />
<b><u>Custom content</u></b><br />
<br />
Embedding custom content can be useful for a lot of unforeseen applications.<br />
For example, it could contain a custom index into compressed content, or a file descriptor, or just some user comment.<br />
<br />
The only thing that a standard decoder can do is skip this section. Dealing with its content is within application-specific realm.<br />
<br />
The <a href="https://github.com/Cyan4973/lz4/blob/master/lz4_Frame_format.md#skippable-frames">lz4 frame format</a> already defines such container, as skippable frames. It looks good enough, so let's re-use the same definition.<br />
<br />
<br />
<b><u>Custom window sizes</u></b><br />
<br />
The current frame format allows defining window sizes from 4 KB to 128 MB, all intermediate sizes being strict power of 2 (8 KB, 16 KB, etc.). It works fine, but maybe some user would find its granularity or limits insufficient.<br />
There are 2 parts to consider :<br />
<br />
- Allowing larger sizes : the current implementation will have troubles handling window sizes > 256 MB. That being said, it's an implementation issue, not a format issue. An improved version could likely work with larger sizes (at the cost of some complexity).<br />
From a frame format perspective, allowing larger sizes can be as easy as keeping a reserved bit for later.<br />
<br />
- Non-power of 2 sizes : Good news is, the internals within zstd are not tied to a specific power of 2, so the problem is limited to sending more precise window sizes. This requires more header bits.<br />
Maybe an unsigned 32-bits value would be good enough for such use.<br />
Note that it doesn't make sense to specify a larger window size than content size. Such case should be automatically avoided by the encoder. As to the decoder, it's unclear how it should react : stop and issue an error ? proceed with allocating the larger window size ? or use the smaller content size, and issue an error if the content ends up larger than that ?<br />
Anyway, in many cases, what the user is likely to want is simply enough size for the frame content. In which case, a simple "refer to frame content size" is probably the better solution, with no additional field needed.<br />
<br />
<br />
<b><u>Header Checksum</u></b><br />
<br />
The intention is to catch errors in the frame header before they translate into larger problems for the decoder. Note that only errors can be caught this way : intentional data tampering can simply rebuild the checksum, hence remain undetected.<br />
<br />
Suggestion : this is not necessary.<br />
<br />
While transmission errors used to be more common a few decades ago, they are much less of threat today, or they tend to garbage some large sections (not just a few bits).<br />
An erroneous header can nonetheless be detected just by parsing it, considering the number of reserved bits and forbidden value. They must all be validated.<br />
The nail in the coffin is that we do no longer trust headers, as they can be abused by remote attackers to deliver an exploit. And that's an area where the header checksum is simply useless. Every field must be validated, and all accepted values must have controllable effects (for example, if the attacker intentionally requests a lot of memory, the decoder shall put a high limit to the accepted amount, and check the allocation result).<br />
So we already are highly protected against errors, by design, because we must be protected against intentional attacks.<br />
<br />
<br />
<b><u>Future features : forward and bakward compatibility</u></b><br />
<br />
It's also important to design from day 1 a header format able to safely accommodate future features, with regards to version discrepancy.<br />
<br />
The basic idea is to keep a number of reserved bits for these features, set to <span style="font-family: "courier new" , "courier" , monospace;">0</span> while waiting for some future definition.<br />
<br />
It seems also interesting to split these reserved bits into 2 categories :<br />
- Optional and skippable features : these are features which a decoder can safely ignore, without jeopardizing decompression result. For example, a purely informational signal with no impact on decompression.<br />
- Future features, disabled by default (<span style="font-family: "courier new" , "courier" , monospace;">0</span>): these features can have unpredictable impact on compression format, such as : adding a new field costing a few more bytes. A non-compatible decoder cannot take the risk to proceed with decompression. It will stop on detecting such a reserved bit to <span style="font-family: "courier new" , "courier" , monospace;">1</span> and gives an error message.<br />
<br />
While it's great to keep room for the future, it should not take a too much toll in the present. So only a few bits will be reserved. If more are needed, it simply means another frame format is necessary. It's enough in such case to use a different frame identifier (First 4 bytes of a frame).Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com8tag:blogger.com,1999:blog-834134852788085492.post-84003617342785405002016-04-03T03:00:00.000+02:002016-04-05T12:36:38.187+02:00Working with streaming<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDx7k3r9glpYlPzGQGJ-eG6YI7kGG0kFCvPDSLJFoedhpMjWdUIhy_sXEVHwJSImmvaoaXz5NlxxhF0za-mJLAX1Dc38l3QQNf7GTGsr1EF6OBvcUJ3hLIutxVNcJEA9E5B_BlWSgRxBA/s1600/linked_list_1.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="51" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDx7k3r9glpYlPzGQGJ-eG6YI7kGG0kFCvPDSLJFoedhpMjWdUIhy_sXEVHwJSImmvaoaXz5NlxxhF0za-mJLAX1Dc38l3QQNf7GTGsr1EF6OBvcUJ3hLIutxVNcJEA9E5B_BlWSgRxBA/s1600/linked_list_1.png" width="200" /></a></div>
Streaming, an advanced and very nice processing mode that a few codecs offer to deal with small data segments. This is great in communication scenarios. For lossless data compression, it makes it possible to send tiny packets, in order to create a low-latency interaction, while preserving strong compression capabilities, by using previously sent data to compress following packets.<br />
<br />
Ideally, on the encoding side, the user should be able to send any amount of data, from the smallest possible (1 byte) to much larger ones (~~MB). It's up to the encoder to decide how to deal with this. It may group several small fields into a single packet, or conversely break larger ones into multiple packets. In order to avoid any unwanted delay, a "flush" command shall be available, so that the user can decide it's time to send buffered data.<br />
<br />
On the other side, a compatible decoder shall be able to cope with whatever data was sent by the encoder. This obviously requires a bit of coordination, a set of shared rules.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://www.w3.org/TR/PNG/png-figures/fig410.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="128" src="https://www.w3.org/TR/PNG/png-figures/fig410.png" width="200" /></a></div>
<br />
<a href="https://www.ietf.org/rfc/rfc1951.txt">The zip format</a> defines a maximum copy distance (32 KB). Data is sent as a set of blocks, but there is no maximum block size (except non-compressed blocks, which must be <= 64 KB).<br />
A compatible zip decoder must be able to cope with these conditions. It must keep up to 32 KB of previously received data, and be able to break decoding operation in the middle of a block, should it receive a block way too large to fit into its memory buffer.<br />
Thankfully, once this capability is achieved, it's possible to decode with a buffer size of <span style="font-family: "courier new" , "courier" , monospace;">32 KB + maximum chunk size</span>, with "chunk size" being the maximum size the decoder can decode from a single block. In general, it's a bit more than that, in order to ease a few side-effects, but we won't go into details.<br />
<br />
The main take-away is : buffer size is <i>a consequence of </i>maximum copy distance, plus a reasonable amount of data to be decoded in a single pass.<br />
<br />
<a href="http://www.zstd.net/">zstd</a>'s proposition is to reverse the logic : the size of the decoder buffer is set and announced in its frame header. The decoder can safely allocate the requested amount of memory. It's up to the encoder to respect this condition (otherwise, compressed data is considered corrupted).<br />
<div>
<br /></div>
In current version of the format, this buffer size can vary from 4 KB to 128 MB. It's a pretty wide range, and crucially, it includes possibilities for small memory footprint. A decoder which can only handle small buffer sizes can immediately detect and discard frames which ask for more than its capabilities.<br />
<br />
Once the buffer size is settled, data is sent as "blocks". Each block has a maximum size of 128 KB. So, in theory, a block could be larger than the agreed decoder buffer. What would happen in such case ?<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://accu.org/var/uploads/journals/resources/goodliffe%20circular%20buffer.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://accu.org/var/uploads/journals/resources/goodliffe%20circular%20buffer.png" height="120" width="320" /></a></div>
<br />
<br />
Following zip example, one solution would be for the decoder to be able to stop (and then resume) decoding operation in the middle of a block. This obviously increases decoder complexity. But the benefit is that the only condition the compressor has to respect is a max copy distance <= buffer size.<br />
<br />
On the decoder side though, it's only one side of the problem. It's no point having a very small decoding buffer if some other memory budget dwarf it.<br />
<br />
The decoding tables are not especially large : they use 5 KB by default, and could be reduced to half, or possibly a quarter of that (but with impact on compression ratio). Not a big budget.<br />
<br />
The real issue is the size of the incoming compressed block. A compressed block must be smaller than its original size, otherwise it will be transmitted in uncompressed format. That still makes it possible to have a (128 KB - 1) block size. This is extremely large compared to a 4 KB buffer.<br />
<br />
Zip's solution is that it's not necessary to receive the entire compressed block in memory in order to start decompressing it. This is possible because all symbols are entangled in a single bitstream, which is read in forward direction. So input buffer can be a fraction of a block. It simply stops when there is no more information available.<br />
<br />
This will be difficult to imitate for zstd : it has multiple independent bitstreams (between 2 and 5) read in <i>backwards</i> direction.<br />
<br />
The backward direction is unusual, and a direct consequence of using <a href="https://github.com/Cyan4973/FiniteStateEntropy">ANS entropy</a> : encoding and decoding must be done in reverse direction. <a href="https://github.com/Cyan4973/FiniteStateEntropy">FSE </a>solution is to write forward and read backward.<br />
It could have been a different choice : write backward, read forward, as suggested by <a href="https://fgiesen.wordpress.com/2015/12/21/rans-in-practice/">Fabian Giesen</a>. But it makes the encoder's API more complex : the destination buffer would be filled from the end, instead of the beginning. From a user perspective, it breaks a few common assumptions, and become a good recipe for confusion.<br />
Alternatively, the end result could be <span style="font-family: "courier new" , "courier" , monospace;">memmove()</span> to the beginning of the buffer, with a small but noticeable speed cost.<br />
<br />
But even that wouldn't solve the multiple bitstreams design, which is key to <a href="http://fastcompression.blogspot.fr/2015/10/huffman-revisited-part-5-combining.html">zstd's speed advantage</a>. zstd is fast because it manages to keep multiple cpu execution units busy. This is achieved by reducing or eliminating dependencies between operations. At some point, it implies bitstream independence.<br />
<br />
In a zstd block, literals are encoded first, followed by LZ symbols. Bitstreams are not entangled : each one occupy its own memory segment.<br />
Considering this setup, it's required to access the full content block to start decoding it (well, more precisely, a few little things could be started in parallel, but it's damn complex and not worth the point here).<br />
<br />
Save any last-minute breakthrough on this topic, this direction is a dead-end : any compressed block must be received entirely before starting its decompression.<br />
As a consequence, since small decoding buffer is a consequence of constrained memory budget, it looks logical that the size of incoming compressed blocks should be limited too, to preserve memory.<br />
<br />
The limit size of a compressed block could be a dedicated parameter, but it would add complexity. A fairly natural assumption would be that a compressed block should be no larger than the decoding buffer. So let's use that.<br />
(PS : another potential candidate would be <span style="font-family: "courier new" , "courier" , monospace;">cBlockSize <= bufferSize/2</span> , but even such a simple division by 2 looks like a recipe for future confusion).<br />
<br />
So now, the encoder side enforces a maximum block size no larger than the decoding buffer. Fair enough. Multiple smaller blocks also means multiple headers, so it could impact compression efficiency. Thankfully, <a href="http://www.zstd.net/">zstd </a>includes both a "default statistics" and an experimental "repeat statistics" modes, which can be used to reduce header size to zero, and provide some answer to this issue.<br />
<br />
But there is more to it.<br />
Problem is, amount of data previously sent can be any size. The encoder may arbitrarily receive a "flush" order at any time. So each received block can be any size (up to maximum), and not necessarily fill the buffer.<br />
Hence, what happens when we get closer to buffer's end ?<br />
<br />
Presuming the decoder doesn't have the capability to stop decompression in the middle of a block, the next block shall not cross the limit of the decoder buffer. Hence, if there are 2.5 KB left in decoder buffer before reaching its end, the next block maximum size must be 2.5 KB.<br />
<br />
It becomes a new condition for the encoder to respect : keep track of decoder buffer fill level, ensure to never cross the limit, stop at exact end of the buffer, and then restart from zero.<br />
It looks complex, but the compressor knows the size of the decoder buffer : it was specified at the beginning of the frame. So it is manageable.<br />
<br />
But is that desirable ?<br />
From an encoder perspective, it seems better to get free of such restriction, just accept the block size and copy distance limits, and then let the decoder deal with it, even if it requires a complex capability of "stop and resume" in the middle of a block.<br />
From a decoder perspective, it looks better to only handle full blocks, and require the encoder to pay attention to never break this assumption.<br />
<br />
Classical transfer of complexity.<br />
It makes for an interesting design choice. And as v1.0 gets nearer, one will have to be selected.<br />
<br />
-------------------------------------------<br />
<i><b><u>Edit </u></b>: And the final choice is :</i><br />
<br />
Well, a decision was necessary, so here it is :<br /><br />The selected design only impose distance limit and maximum block size to the encoder , both values being equal, and provided in the frame header.<br />
The encoder doesn't need to track the "fill level" of the decoder buffer.<br /><br />As stated above, a compliant decoder using the exact buffer size should have the capability to break decompression operation in the middle of a block, in order to reach the exact end of the buffer, and restart from the beginning.<br />
<br />
However, there is a trick ...<br />
Should the decoder not have this capability, it's enough to extend the size of the buffer by the size of a single block (so it's basically 2x bigger for "small" buffer values (<= 128 KB) ). In which case, the decoder can safely decode every blocks in a single step, without breaking decoding operation in the middle.<br /><br />
Requiring more memory to safely decompress is an "implementation detail", and doesn't impact the spec, which is the real point here.<br />Thanks to this trick, it's possible to immediately target final spec, and update the decoder implementation later on, as a memory optimization. Therefore, it won't delay v1.0.<br />
<br />
<br />Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com2tag:blogger.com,1999:blog-834134852788085492.post-39591525870739541602016-02-05T17:52:00.000+01:002016-02-18T15:45:17.679+01:00Compressing small data<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDx7k3r9glpYlPzGQGJ-eG6YI7kGG0kFCvPDSLJFoedhpMjWdUIhy_sXEVHwJSImmvaoaXz5NlxxhF0za-mJLAX1Dc38l3QQNf7GTGsr1EF6OBvcUJ3hLIutxVNcJEA9E5B_BlWSgRxBA/s1600/linked_list_1.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="51" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDx7k3r9glpYlPzGQGJ-eG6YI7kGG0kFCvPDSLJFoedhpMjWdUIhy_sXEVHwJSImmvaoaXz5NlxxhF0za-mJLAX1Dc38l3QQNf7GTGsr1EF6OBvcUJ3hLIutxVNcJEA9E5B_BlWSgRxBA/s1600/linked_list_1.png" width="200" /></a></div>
Data compression is primarily seen as a file compression algorithm. After all, the main objective is to save storage space, is it ?<br />
With this background in mind, it's also logical to focus on bigger files. Good compression achieved on a single large archive is worth the savings for countless smaller ones.<br />
<br />
However, this is no longer where the bulk of compression happen. Today, compression is everywhere, embedded <i>within </i>systems, achieving its space and transmission savings without user intervention, nor awareness. The key to these invisible gains is to remain below the end-user perception threshold. To achieve this objective, it's not possible to wait for some large amount of data to process. Instead, data is processed in small amounts.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCcJheKfok-YKw1SH1_eDP0csjS8MBXknYrNmRiHImJXs8m6aLP14zr_hdb6xaFGapAREbLvKEqtm0KBBlEbESuljzi9336SEBd6UxL373tlPNSS-ZlHFCyWGb-Zj0pSDd4kVRuDh4OXw/s1600/mtpacking2%255B1%255D.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="89" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCcJheKfok-YKw1SH1_eDP0csjS8MBXknYrNmRiHImJXs8m6aLP14zr_hdb6xaFGapAREbLvKEqtm0KBBlEbESuljzi9336SEBd6UxL373tlPNSS-ZlHFCyWGb-Zj0pSDd4kVRuDh4OXw/s320/mtpacking2%255B1%255D.jpg" width="320" /></a></div>
<br />
<br />
This would be all good and well if it wasn't for a simple observation : the smaller the amount to compress, the worse the compression ratio.<br />
The reason is pretty simple : data compression works by finding redundancy within the processed source. When a new source starts, there is not yet any redundancy to build upon. And it takes time for any algorithm to achieve meaningful outcome.<br />
<br />
Therefore, as the issue comes from starting from a blank history, what about starting from an already populated history ?<br />
<br />
<b><u>Streaming to the rescue</u></b><br />
<br />
A first solution is streaming : data is cut into smaller blocks, but each block can make reference to previously sent ones. And it works quite well. In spite of some minor losses at block borders, most of the compression opportunities of a single large data source are preserved, but now with the advantage to process, send, and receive tiny blocks on the fly, making the experience smooth.<br />
<br />
However, this scenario only works with serial data, a communication channel for example, where order is known and preserved.<br />
<br />
For a large category of applications, such as database and storage, this cannot work : data must remain accessible in a random fashion, no known "a priori" order. Reaching a specific block sector should not require to decode all preceding ones just to rebuild the dynamic context.<br />
<br />
For such use case, a common work-around is to create some "not too small blocks". Say there are many records of a few hundred bytes each. Group them in packs of at least 16 KB. Now this achieves some nice middle-ground between not-to-poor compression ratio and good enough random access capability.<br />
This is still not ideal though, since it's required to decompress a full block just to get a single random record out of it. Therefore, each application will settle for its own middle ground, using block sizes of 4 KB, 16 KB or even 128 KB, depending on usage pattern.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLHk7cf-ypY9T_EjARVMAJ6DRBflQcqNLG0gS5nrsHMo1JPu2-b_Phcn4juMcFLEQfiNTQ3H-qmbRh8UdxGFAEJvPeXGNd1ozl0kc55fq-AzeKckoLonh6qGqkz2YPcsimlG7-QnmvZ5Q/s1600/5510506796_dff8c07b64_z%255B1%255D.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="188" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLHk7cf-ypY9T_EjARVMAJ6DRBflQcqNLG0gS5nrsHMo1JPu2-b_Phcn4juMcFLEQfiNTQ3H-qmbRh8UdxGFAEJvPeXGNd1ozl0kc55fq-AzeKckoLonh6qGqkz2YPcsimlG7-QnmvZ5Q/s200/5510506796_dff8c07b64_z%255B1%255D.jpg" width="200" /></a></div>
<br />
<b><u>Dictionary compression</u></b><br />
<br />
Preserving random access at record level <u><i>and</i></u> good compression ratio, is hard. But it's achievable too, using a <i>dictionary</i>. To summarize, it's a kind of common prefix, shared by all compressed objects. It makes every compression and decompression operation start from the same populated history.<br />
<br />
Dictionary compression has the great property to be compatible with random access. Even for communication scenarios, it can prove easier to manage at scale than "per-connection streaming", since instead of storing one different context per connection, there is always the same context to start from when compressing or decompressing any new data block.<br />
<br />
A good dictionary can compress small records into tiny compressed blobs. Sometimes, the current record can be found "as is" entirely within the dictionary, reducing it to a single reference. More likely, some critical redundant elements will be detected (header, footer, keywords) leaving only variable ones to be described (ID fields, date, etc.).<br />
<br />
For this situation to work properly, the dictionary needs to be tuned for the underlying structure of objects to compress. There is no such thing as a "universal dictionary". One must be created and used for a target data type.<br />
<br />
Fortunately, this condition can be met quite often.<br />
Just created some new protocol for a transaction engine or an online game ? It's likely based on a few common important messages and keywords (even binary ones). Have some event or log records ? There is likely a grammar for them (json, xml maybe). The same can be said of digital resources, be it html files, css stylesheets, javascript programs, etc.<br />
If you know what you are going to compress, you can create a dictionary for it.<br />
<br />
The key is, since it's not possible to create a meaningful "universal dictionary", one must create one dictionary per resource type.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPJQkLTK-ya44cuGOf7iI-j0uuTZsLiCirNcukSuC1PeP7UQN8SOogmXrf1nCuq-fkdmEa0Pt_F19nTETwyGRza2KGSTVb1Eza11aNU0oSIveindeaRUL9jZN28HUyLxUVxp5L-XeCK6Y/s1600/message+structure.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="303" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPJQkLTK-ya44cuGOf7iI-j0uuTZsLiCirNcukSuC1PeP7UQN8SOogmXrf1nCuq-fkdmEa0Pt_F19nTETwyGRza2KGSTVb1Eza11aNU0oSIveindeaRUL9jZN28HUyLxUVxp5L-XeCK6Y/s320/message+structure.png" width="320" /></a></div>
<i>Example of a structured JSON message</i><br />
<br />
How to create a dictionary from a training set ? Well, even though one could be tempted to manually create one, by compacting all keywords and repeatable sequences into a file, this can be a tedious task. Moreover, there is always a chance that the dictionary will have to be updated regularly due to moving conditions.<br />
This is why, <a href="https://github.com/Cyan4973/zstd/releases">starting from v0.5</a>, zstd offers a<a href="https://github.com/Cyan4973/zstd#dictionary-compression-how-to-"> dictionary builder capability</a>.<br />
<br />
Using the builder, it's possible to quickly create a dictionary from a list of samples. The process is relatively fast (a matter of seconds), which makes it possible to generate and update multiple dictionaries for multiple targets.<br />
<br />
But what good can achieve dictionary compression ?<br />
To answer this question, a few tests were run on some typical samples. A flow of JSON records from a probe, some Mercurial log events, and a collection of large JSON documents, provided by @KryzFr.<br />
<br />
<table style="border-collapse: collapse; border-spacing: 0px; box-sizing: border-box; color: #333333; display: block; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px; line-height: 23.2727px; margin-bottom: 16px; margin-top: 0px; overflow: auto; width: 888.182px; word-break: keep-all;"><thead style="box-sizing: border-box;">
<tr style="background-color: white; border-top-color: rgb(204, 204, 204); border-top-style: solid; border-top-width: 1px; box-sizing: border-box;"><th style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">Collection Name</th><th style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">direct <br />
compression</th><th style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">Dictionary <br />
compression</th><th style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">Gains</th><th align="right" style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">Average <br />
<div style="text-align: center;">
<span style="line-height: 23.2727px;">unit</span></div>
</th><th style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">Range</th></tr>
</thead><tbody style="box-sizing: border-box;">
<tr style="background-color: white; border-top-color: rgb(204, 204, 204); border-top-style: solid; border-top-width: 1px; box-sizing: border-box;"><td style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">Small JSON records</td><td style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">x1.331 - x1.366</td><td style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">x5.860 - x6.830</td><td style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;"><b>~ x4.7</b></td><td align="right" style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">300</td><td style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">200 - 400</td></tr>
<tr style="background-color: #f8f8f8; border-top-color: rgb(204, 204, 204); border-top-style: solid; border-top-width: 1px; box-sizing: border-box;"><td style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">Mercurial events</td><td style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">x2.322 - x2.538</td><td style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">x3.377 - x4.462</td><td style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;"><b>~ x1.5</b></td><td align="right" style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">1.5 KB</td><td style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">20 - 200 KB</td></tr>
<tr style="background-color: white; border-top-color: rgb(204, 204, 204); border-top-style: solid; border-top-width: 1px; box-sizing: border-box;"><td style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">Large JSON docs</td><td style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">x3.813 - x4.043</td><td style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">x8.935 - x13.366</td><td style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;"><b>~ x2.8</b></td><td align="right" style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">6 KB</td><td style="border: 1px solid rgb(221, 221, 221); box-sizing: border-box; padding: 6px 13px;">800 - 20 KB</td></tr>
</tbody></table>
<br />
These compression gains are achieved without any speed loss, and even feature faster decompression processing. As one can see, it's no "small improvement". This method can achieve transformative gains, especially for very small records.<br />
<br />
Large documents will benefit proportionally less, since dictionary gains are mostly effective in the first few KB. Then there is enough history to build upon, and the compression algorithm can rely on it to compress the rest of the file.<br />
<br />
Dictionary compression will work if there is some correlation in a family of small data (common keywords and structure). Hence, deploying one dictionary per type of data will provide the greater benefits.<br />
<br />
Anyway, if you are in a situation where compressing small data can be useful for your use case (databases and contextless communication scenarios come to mind, but there are likely other ones), you are welcomed to <a href="https://github.com/Cyan4973/zstd/releases">have a look at this new open source tool and compression methodology</a> and report your experience or feature requests.<br />
<br />
Zstd is now getting closer to v1.0 release, it's a good time to provide feedback and integrate them into final specification.<br />
<br />Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com12tag:blogger.com,1999:blog-834134852788085492.post-16413089626593457022015-10-14T13:58:00.003+02:002018-02-26T10:43:06.601+01:00Huffman revisited part 5 : combining multi-streams with multi-symbols<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em; text-align: left;">
<img border="0" data-original-height="96" data-original-width="150" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSa3WlE_8pOh5jfWEFkve3gcu21z1tNpHIHaOz6SdTqvpqn2uHbe-MubS2zmnT4Ert_MbMPAb3A3BxtRjXyxbdmBRgvvcwatWNrtMZ6aOmYkp2mggx-RWQF4s5yvVqlgJzzl77vLfOk6Y/s1600/huffSample150.png" /> <span style="text-align: left;"> In </span><a href="http://fastcompression.blogspot.com/2015/10/huffman-revisited-part-4-multi-bytes.html" style="text-align: left;">previous article</a><span style="text-align: left;">, a method to create a fast multi-symbols Huffman decoder has been described. The experiment used a single bitstream, for simplicity. However, earlier investigation proved that </span><a href="http://fastcompression.blogspot.com/2015/07/huffman-revisited-part-2-decoder.html" style="text-align: left;">using multiple bitstreams </a><span style="text-align: left;">was a good choice for speed on modern OoO (Out of Order) cpus, such as Intel's Core. So it seems only logical to combine both ideas and see where they lead.</span></div>
<br />
<br />
<div>
<br /></div>
<div>
The previous multi-streams format produced an entangled output, where each stream contributes regularly to 1-in-4 symbols, as shown below :</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgl2YvbDO0rR0lbsQcTO5W2V3naR7Qyvri4M7vG0sV8q9BeqTntR3d0FsmyeJc5nNzn5IMqVgw5UvHbx4I8jkXdWU5QB2hx2WCeMkMOcukeG3ZBw55kQXfpprKGXcoX4pMvDoan_0YTWNg/s1600/multiStreams_singleSymbol_outputScheme.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="13" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgl2YvbDO0rR0lbsQcTO5W2V3naR7Qyvri4M7vG0sV8q9BeqTntR3d0FsmyeJc5nNzn5IMqVgw5UvHbx4I8jkXdWU5QB2hx2WCeMkMOcukeG3ZBw55kQXfpprKGXcoX4pMvDoan_0YTWNg/s400/multiStreams_singleSymbol_outputScheme.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<i>Multi-Streams single-symbol entangled output pattern</i></div>
<div>
<br /></div>
<div>
This pattern is very predictable, therefore decoding operations can be done in no particular order, as each stream knows at which position to write its next symbol.</div>
<div>
This critical property is lost with multi-symbols decoding operations :</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgy0Rxc4S6q9qWtR7YhZfj4__OMyxPxwS8MMeKRJhj6Na2nV9_MXUooUJoP8v7X8IWzJEWgg0GHxJy2KXqCspnWIHuzlU5wnnEtPlQuUI8FCUuvVgWUZcSF9nsc5EONpL6G5FRFN02oJUI/s1600/multiStreams_multiSymbols_outputScheme.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="13" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgy0Rxc4S6q9qWtR7YhZfj4__OMyxPxwS8MMeKRJhj6Na2nV9_MXUooUJoP8v7X8IWzJEWgg0GHxJy2KXqCspnWIHuzlU5wnnEtPlQuUI8FCUuvVgWUZcSF9nsc5EONpL6G5FRFN02oJUI/s400/multiStreams_multiSymbols_outputScheme.png" width="400" /></a></div>
<div style="text-align: center;">
<i>Multi-Streams multi-symbols entangled output pattern (example)</i></div>
<div>
<br /></div>
<div>
It's no longer clear where next symbols must be written. Hence, parallel-streams decoding becomes synchronization-dependent, nullifying multi-streams speed advantage.</div>
<div>
<br /></div>
<div>
There are several solutions to this problem :</div>
<div>
- On the decoder side, reproduce regular output pattern, by breaking multi-symbols sequence into several single-symbol write operations. It works, but cost performance, since a single decode now produces multiple writes (or worse, introduce an unpredictable branch) and each stream requires its own tracking pointer.</div>
<div>
- On the encoder side, take into consideration the decoder natural pattern, by grouping symbols exactly the same way they will be regenerated. This works too, and is the fastest method from a decoder perspective, introducing just some non-negligible complexity on the encoder side.</div>
<div>
<br /></div>
<div>
Ultimately, none of these solutions looked particularly attractive. I was especially worried about introducing a "rigid format", specifically built for a single efficient way to decode. For example, taking into consideration the way symbols will be grouped during decoding ties the format to a specific table construction.</div>
<div>
An algorithm created for a large number of platforms cannot tolerate such rigidity. Maybe some implementations will prefer single-symbol decoding, maybe other ones will select a custom amount of memory for decoding tables. Such flexibility must be possible.</div>
<div>
<br /></div>
<div>
Final choice was to remove entanglement. And the new output pattern becomes :</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEip_Mnvq4FgAU4vy4JBNdt8AhmUm0A7ydw5cl9i3wiWojaPQMadhTa_L_00Y5Ej_lZ6yKeBMyWJzqBUNTKwls2NpWGXRHKU9ODkC5Fm98AWe1k9qNoeI7dDCob7j4vPuC5Kq9RhqSHGbz0/s1600/multiStreams_multiSymbol_segmentOutputPattern.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="11" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEip_Mnvq4FgAU4vy4JBNdt8AhmUm0A7ydw5cl9i3wiWojaPQMadhTa_L_00Y5Ej_lZ6yKeBMyWJzqBUNTKwls2NpWGXRHKU9ODkC5Fm98AWe1k9qNoeI7dDCob7j4vPuC5Kq9RhqSHGbz0/s320/multiStreams_multiSymbol_segmentOutputPattern.png" width="320" /></a></div>
<div style="text-align: center;">
<i>Multi-Streams multi-symbols segment output pattern (example)</i></div>
<div>
<br /></div>
<div>
With 4 separate segments being decoded in parallel, the design looks a lot like classical multi-threading, but at micro-op level. And that's a fair enough description.</div>
<div>
<br /></div>
<div>
The picture looks simpler, but from a coding perspective, it's not.<br />
The first issue is that each segment has its own tracking pointer during decoding operation. It increases the number of required registers from 1 to 4. Not a huge deal when registers are plentiful, but that's not always the case (x86 32-bits mode notably).</div>
<div>
The second more important issue is that each segment gets decoded at its own speed, meaning some of them will be finished before other ones. Previously, entanglement ensured that all streams would finish together, with just a small tail to take care off. This is now more complex : we don't know which segment will finish first, and the "tail" sequence is now spread over multiple streams, of unpredictable length.<br />
<br />
These complexities will cost a bit of performance, but we get serious benefits in exchange :</div>
<div>
- Multi-streams decoding is an option : platforms may decide to decode segments serially, one after another, or 2 by 2, depending on their optimal capabilities.</div>
<div>
- Single-symbol and multi-symbols decoding strategies are compatible </div>
<div>
- Decoding table depth can be any size, including "frugal" ones trading cpu operations for memory space.</div>
<div>
In essence, it's opened to a lot more trade-offs.</div>
<div>
<br /></div>
<div>
These new properties introduce a new API requirement : regenerated size must be known, exactly, to start decoding operation (previously, upper regenerated size limit was enough). This is required to guess where each segment starts before even finishing previous ones.</div>
<div>
<br /></div>
<div>
So, what kind of performance this new design delivers ? Here is an example, based on generic samples :</div>
<div>
<br /></div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeaIP37x791LRnojOO6ZOi6MgDVpLiAiLjqsqfOSbkSgWOGgsx_WqlKcoPVVD6vg_VtjG65LgkGLapJ9HhM4Yc0PBb7m8WLSG275PF2c7vuI1pWTQO7FfukTiqWhva8vtSs_elnDe977g/s1600/decodingSpeedMultiStream32KB.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="198" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeaIP37x791LRnojOO6ZOi6MgDVpLiAiLjqsqfOSbkSgWOGgsx_WqlKcoPVVD6vg_VtjG65LgkGLapJ9HhM4Yc0PBb7m8WLSG275PF2c7vuI1pWTQO7FfukTiqWhva8vtSs_elnDe977g/s400/decodingSpeedMultiStream32KB.png" width="400" /></a></div>
<div style="text-align: center;">
<i>Decoding speed, multi-streams, 32 KB blocks</i></div>
</div>
<div>
<br /></div>
<div>
The picture looks similar to previous "single-stream" measurements, but features much higher speeds. Single-symbol variant wins when compression ratio is very poor. Quite quickly though, double-symbols variant dominates the region where Huffman compression makes most sense (underlined in red boxes). Quad-symbols performance catch up when distribution becomes more favorable, and clearly dominates later on, but that's a region where Huffman is no longer an optimal choice for entropy compression.<br />
<br />
Still, by providing speed in the range of <b>800-900 MB/s</b>, the new multi-symbol decoder delivers sensible improvements over previous version. So, job done ?<br />
<br />
Let's dig a little deeper. You may have noticed that previous measurements were produced on block sizes of 32 KB, which is a nice "average" situation. However, in many compressors such as <a href="http://www.zstd.net/">zstd</a>, blocks of symbols are the product of (LZ) transformation, and their size can vary, by a lot. Therefore, is above conclusion still valid when block size changes ?<br />
<br />
Let's test this hypothesis in both directions, by measuring large (128 KB) and small (8 KB) block sizes. Results become :<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh12XqbqJMrVIRm_enISC7kfEOEdu3IpQHpZmeZ1loBvkRtjp-QgEsKxi3_U_HOy2XjcuTKelibllOm0jIcl51NljCCz3-lnElXWR_lNiZi55QOd3cJDpMYi1vY7pWAuA0z6rgg9YiSTOU/s1600/decodingSpeedMultiStream128KB.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="207" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh12XqbqJMrVIRm_enISC7kfEOEdu3IpQHpZmeZ1loBvkRtjp-QgEsKxi3_U_HOy2XjcuTKelibllOm0jIcl51NljCCz3-lnElXWR_lNiZi55QOd3cJDpMYi1vY7pWAuA0z6rgg9YiSTOU/s400/decodingSpeedMultiStream128KB.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<i>Decoding speed, multi-streams, 128 KB blocks</i></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzhdyb8q6UFRzglpQrJ-a5FHv15-A-62_FvrtBLOMXv718W4wRNB7hP1qabjyM8fx2s8C_3_B_S9UBKbjUMQN6NWyNojA1DLFIkTn3OXh2e9zvv_LRaeqKr0Nlq4OFnX4NdNN29GmL2Sc/s1600/decodingSpeedMultiStream8KB.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="207" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzhdyb8q6UFRzglpQrJ-a5FHv15-A-62_FvrtBLOMXv718W4wRNB7hP1qabjyM8fx2s8C_3_B_S9UBKbjUMQN6NWyNojA1DLFIkTn3OXh2e9zvv_LRaeqKr0Nlq4OFnX4NdNN29GmL2Sc/s400/decodingSpeedMultiStream8KB.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<i>Decoding speed, multi-streams, 8 KB blocks</i></div>
<br />
While the general picture may look similar, some differences are indeed spotted.<br />
<br />
First, 128 KB blocks are remarkably faster than 8 KB ones. This is a natural consequence of table construction times, which is a fixed cost whatever the size of blocks. Hence, their relative impact is inversely proportional to block sizes.<br />
At 128 KB, symbol decoding dominates. It makes the quad-symbols version slightly better compared to double-symbols. Not necessarily enough, but still an alternative to consider when the right conditions are met.<br />
At 8 KB, the reverse situation happens : quad-symbols is definitely out of the equation, due to its larger table construction time. Single-symbol relative performance is now better, taking the top spot when compression ratio is low enough.<br />
<br /></div>
<div>
With so many parameters, it may seem difficult to guess which version will perform best on a given compressed block, since it depends on the content to decode. Fortunately, such guess can be performed automatically by the library itself.<br />
<a href="https://github.com/Cyan4973/FiniteStateEntropy">huff0</a>'s solution is to propose a single decoder (<span style="font-family: "courier new" , "courier" , monospace;">HUF_decompress()</span>) which makes such selection transparently. Given a set of heuristic values (table construction time, raw decoding speed, quantized compression ratio), it will automatically select which decoding algorithm it believes is a better fit for the job.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSHnNbs2_agzddZ7ELVVMuxQnWENWYwM-M3LQ_Paa36mow88Am2Pe7ZFk0Z7ti3I-KVB_4vAudRRYvL6Mf4u-4ztjir8_UEqACE9wfiK4q88xPuvaaaG3Nk9BJdZtp6DVOaofHVuQaG6M/s1600/decodingSpeedMultiStream32KBwithAuto.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="198" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSHnNbs2_agzddZ7ELVVMuxQnWENWYwM-M3LQ_Paa36mow88Am2Pe7ZFk0Z7ti3I-KVB_4vAudRRYvL6Mf4u-4ztjir8_UEqACE9wfiK4q88xPuvaaaG3Nk9BJdZtp6DVOaofHVuQaG6M/s400/decodingSpeedMultiStream32KBwithAuto.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<i>Decoding speed, auto-mode, 32 KB blocks</i></div>
<br />
Ultimately, it only impacts faster speeds, since all versions are compatible and produce valid results. And if a user doesn't like automatic choices, it's still possible to manually override which decoder version is preferred.</div>
<div>
<div>
<br /></div>
<div>
As usual, the result of this investigation is made available as <a href="https://github.com/Cyan4973/FiniteStateEntropy">open source software, at github,</a> under a BSD license. If you are used to previous versions of <a href="https://github.com/Cyan4973/FiniteStateEntropy">fse</a>, pay attention that the directory and file structures have been changed quite a bit. In order to clarify interfaces, <a href="https://github.com/Cyan4973/FiniteStateEntropy/blob/master/lib/huff0.h">huff0</a> now gets its own files and header.<br />
<br /></div>
</div>
Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com1tag:blogger.com,1999:blog-834134852788085492.post-51659111764432409252015-10-14T12:11:00.001+02:002018-02-26T10:28:25.941+01:00Huffman revisited, Part 4 : Multi-bytes decoding<div class="separator" style="clear: both; text-align: left;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSa3WlE_8pOh5jfWEFkve3gcu21z1tNpHIHaOz6SdTqvpqn2uHbe-MubS2zmnT4Ert_MbMPAb3A3BxtRjXyxbdmBRgvvcwatWNrtMZ6aOmYkp2mggx-RWQF4s5yvVqlgJzzl77vLfOk6Y/s1600/huffSample150.png" imageanchor="1" style="clear: left; display: inline !important; float: left; margin-bottom: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="96" data-original-width="150" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSa3WlE_8pOh5jfWEFkve3gcu21z1tNpHIHaOz6SdTqvpqn2uHbe-MubS2zmnT4Ert_MbMPAb3A3BxtRjXyxbdmBRgvvcwatWNrtMZ6aOmYkp2mggx-RWQF4s5yvVqlgJzzl77vLfOk6Y/s1600/huffSample150.png" /></a> In most Huffman implementations I'm aware of, decoding symbols is achieved in a serial fashion, one-symbol-after-another.</div>
<br />
Decoding fast is not that trivial, but it has been already well studied. Eventually, the one symbol per decoding operation becomes its upper limit.<br />
<br />
Consider how work a fast Huffman decoder : all possible bit combinations are pre-calculated into a table, of predefined maximum depth. For each bit combination, it's a simple table lookup to get the symbol decoded and the number of bits to consume.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiayCaM5KoX3pZNBP00vFLx0n1y-QzkARo5Bcc1VU6ef2Q-4tGxt3_F88FuTHsszM2CBazYj1h6yE_6tuqeSVPhAMXF65hFS-ETEl1kE5yuyCvcloxja8B507fXnzScVqJH45_6pR4wvXM/s1600/HuffmanSingleSymbolLookup.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiayCaM5KoX3pZNBP00vFLx0n1y-QzkARo5Bcc1VU6ef2Q-4tGxt3_F88FuTHsszM2CBazYj1h6yE_6tuqeSVPhAMXF65hFS-ETEl1kE5yuyCvcloxja8B507fXnzScVqJH45_6pR4wvXM/s200/HuffmanSingleSymbolLookup.png" width="199" /></a></div>
<div style="text-align: center;">
<i>Huffman Table lookup (example)</i></div>
<br />
More complex schemes may break the decoding into 2 steps, most notably in an attempt to reduce look-up table sizes and still manage to decode symbols which exceed table depth. But it doesn't change the whole picture : that's still a number of operations to decode a single symbol.<br />
<br />
In an attempt to extract more speed from decoding operation, I was curious to investigate if it would be possible to decode <i>more than one symbol</i> per lookup.<br />
<br />
Intuitively, that sounds plausible. Consider some large Huffman decoding table, there is ample room for some bit sequences to represent 2 or more unequivocal symbols. For example, if one symbol is dominant, it only needs 1 bit. So, with only 2 bits, we have 25% chances to get a sequence which means "decode 2 dominant symbols in a row", in a single decode operation.<br />
<br />
This can be visualized on below example :<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYYYKgdwACvekEv4ybqolhOA4maY812s3AjrAvV_w-Ximxnc7PJnq5RsMzEZoqV4PimWYaiu-VlFr3FQeiqPXmu8W3H2Ane5J0LYn6MA_mOWa7dQuMWhno6w6YubYa9vJXlbs5sx1zuPc/s1600/SingleSymbolDecodingTable.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="92" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYYYKgdwACvekEv4ybqolhOA4maY812s3AjrAvV_w-Ximxnc7PJnq5RsMzEZoqV4PimWYaiu-VlFr3FQeiqPXmu8W3H2Ane5J0LYn6MA_mOWa7dQuMWhno6w6YubYa9vJXlbs5sx1zuPc/s400/SingleSymbolDecodingTable.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div style="text-align: center;">
<i>Example of small single-symbol decoding table</i></div>
<br />
which can be transformed into :<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7UKUPKdbvs4DNFURSpaMgNxWtd1mAhs_6k8hRLuPc3HnRXbpC0-_AGprd3PPy4CEWS0oaVpv6uy5ZEZBAJt6UWZqjgVdRL-v8Ut4CV_NXMIaL0vR3HCp5DCWpJsRBhRJKpAF1rN_7qpM/s1600/HuffmanMultiSymbolsDecoding.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="105" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7UKUPKdbvs4DNFURSpaMgNxWtd1mAhs_6k8hRLuPc3HnRXbpC0-_AGprd3PPy4CEWS0oaVpv6uy5ZEZBAJt6UWZqjgVdRL-v8Ut4CV_NXMIaL0vR3HCp5DCWpJsRBhRJKpAF1rN_7qpM/s400/HuffmanMultiSymbolsDecoding.png" width="400" /></a></div>
<div style="text-align: center;">
<i>Example of multi-symbols decoding table</i></div>
<br />
In some ways, it can look reminiscent of <a href="https://en.wikipedia.org/wiki/Tunstall_coding">Tunstall codes</a>, since we basically try to fit as many symbols as possible into a given depth. But it's not : we don't guarantee reading the entire depth each time, the number of bits read is still variable, just more regular. And there is no "order 1 correlation" : probabilities remain the same per symbol, without depending on prior prefix.<br />
<br />
Even with above table available, there is still the question of using it efficiently. It doesn't make any good if a single decoding step is now a lot more complex in order to potentially decode multiple symbols. As an example of what <i>not </i>to do, a straightforward approach would be to start decoding the first symbol, then figure out if there is some place left for another one, proceed with the second symbol, then test for a 3rd one, etc. Each of these tests become an unpredictable branch, destroying performance in the process.<br />
<br />
The breakthrough came by observing LZ decompression process such as <a href="http://www.lz4.org/">lz4</a> : it's insanely fast, because it decodes <i>matches</i>, aka. suite of symbols, as a <i>single copy </i>operation.<br />
This is in essence what we should do here : copy a sequence of multiple symbols, and <i>then </i>decide how many symbols there really are. It avoids branches.<br />
On current generation CPU, copying 2 or 4 bytes is not much slower than copying a single byte, so the strategy is effective. Overwriting same position is also not an issue thanks to modern cache structure.<br />
<br />
With this principle settled, it now requires an adapted lookup table structure to work with. I settled with these ones :<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div style="text-align: center;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj-MT848g0Pym6sMxU66sVb_LLh2D2V2CpY6bWB0KEmXUCIcezvvV-3SCtA_qcwnHBRycxnea3WT36m9Q1iRcI4d_VSaFRUhyphenhyphenTszVJiNAstOlj_98OkCFI77xrnVzt9RKLbEsyEXwp8CKk/s1600/multi-symbols+cell+structure.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="156" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj-MT848g0Pym6sMxU66sVb_LLh2D2V2CpY6bWB0KEmXUCIcezvvV-3SCtA_qcwnHBRycxnea3WT36m9Q1iRcI4d_VSaFRUhyphenhyphenTszVJiNAstOlj_98OkCFI77xrnVzt9RKLbEsyEXwp8CKk/s400/multi-symbols+cell+structure.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<i>Huffman lookup cell structure</i></div>
<br />
The double-symbols structure could seem poorly ambitious : after all, it is only able to store up to 2 symbols into the `<span style="font-family: "courier new" , "courier" , monospace;">sequence</span>` field. But in fact, tests will show it's a good trade-off, since most of the time, 2 symbols is what can be reasonably stored into a table lookup depth.<br />
<br />
Some quick maths : depth of a lookup table is necessarily limited, in order to fit into memory cache where access times are best. An Intel's cpu L1 data cache is typically 32 KB (potentially shared due to hyper-threading). Since no reasonable OS is single-threaded anymore, let's not use the entire cache : half seems good enough, that's 16 KB. Since a single cell for double-symbols is now 4 bytes (incidentally, the same size as <a href="http://fastcompression.blogspot.fr/2014/01/fse-decoding-how-it-works.html">FSE decoder</a>), that means 4K cells, hence a maximum depth of 12 bits. Within 12 bits, it's unlikely to get more than 2 symbols at a time. But this outcome entirely depends on alphabet distribution.<br />
<br />
This limitation must be balanced with increased complexity for table lookup construction. The quad-symbols one is significantly slower, due to more fine-tuned decisions and recursive nature of the algorithm, defeating inlining optimizations. Below graph show the relative speed of each construction algorithm (right side, in grey, is provided for information, since if target distribution falls into this category, Huffman entropy is no longer a recommended choice).<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipr3UUxSI7y_HZwacuS8WZWRYZWL-87aBuxV1pXZXCTULaQz308bww-xCOgmMO_3UiD1VPxbeWhP8IwrTtx8DyU9RttzkMOkKEdCroWgEIFQkqL8YGL5udMz2ZGoR_J3evcuVfhG7gZjQ/s1600/lut+construction+speed.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="210" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipr3UUxSI7y_HZwacuS8WZWRYZWL-87aBuxV1pXZXCTULaQz308bww-xCOgmMO_3UiD1VPxbeWhP8IwrTtx8DyU9RttzkMOkKEdCroWgEIFQkqL8YGL5udMz2ZGoR_J3evcuVfhG7gZjQ/s400/lut+construction+speed.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div style="text-align: center;">
<i>Lookup table construction speed</i></div>
<br />
The important part is roughly underlined in red boxes, showing areas which are relevant for some typical LZ symbols. The single-symbol lut construction is always faster, significantly. To make sense, slower table construction must be compensated by improved symbol decoding speed. Which, fortunately, is the case.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYlUB6XNbfc3LTjASA271dufvFoMMTikM0VItzmes0GbKHKgRO4jf5uFBQEN8XV8ipfdU60YfN84w0_Db6G4j2s5p3j7thetXX6DjV4fozjxF8z3pXrwbNGWkCrsyq7xt00BHpORdJtJs/s1600/decodingSpeedSinglestream32KB.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="201" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYlUB6XNbfc3LTjASA271dufvFoMMTikM0VItzmes0GbKHKgRO4jf5uFBQEN8XV8ipfdU60YfN84w0_Db6G4j2s5p3j7thetXX6DjV4fozjxF8z3pXrwbNGWkCrsyq7xt00BHpORdJtJs/s400/decodingSpeedSinglestream32KB.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div style="text-align: center;">
<i>Decoding speed, at 32 KB block</i></div>
<br />
As suspected, the "potentially faster" quad-symbols variant is hampered by its slower construction time. It manages to become competitive at "length & offset" area, but since it costs 50% more memory, it needs to be unquestionably better to justify that cost. Which is the case as alphabet distribution become more squeezed. By that time though, it becomes questionable if Huffman is still a reasonable choice for the selected alphabet, since its compression power will start to wane significantly against more precise methods such as <a href="http://fastcompression.blogspot.com/2013/12/finite-state-entropy-new-breed-of.html">FSE</a>.<br />
The "double-symbols" variant, on the other hand, takes off relatively fast and dominate the distribution region where Huffman makes most sense, making it a prime contender for an upgrade.<br />
<br />
By moving from a 260 MB/s baseline to a faster 350-450 MB/s region, the new decoding algorithm is providing fairly sensible gains, but we still have not reached the level of <a href="http://fastcompression.blogspot.com/2015/07/huffman-revisited-part-2-decoder.html">previous multi-stream variant</a>, which gets closer to 600 MB/s. The logical next step is to combine both ideas, creating a multi-streams multi-symbols variant. A challenge which proved more involving than it sounds. But that's for another post ...<br />
<br />Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com8tag:blogger.com,1999:blog-834134852788085492.post-38526708265229550822015-08-25T14:25:00.000+02:002015-11-17T12:18:46.614+01:00Fuzz testing Zstandard<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWyW76NbWjsxqz4Wz8jdgveulknHDAO-DtvsY20dK8g8jz50TCQ6SJFl3P-9d5z9Z3m32HBVjeAuS_euhJupLD75AsI88n8pSJ1tXBCnjP-iWs4Vz7zUlvSYczSsVfBK2rLl1O2TldPSg/s1600/testing-code-is-for-wimps-real-men-test-in-production%255B1%255D.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWyW76NbWjsxqz4Wz8jdgveulknHDAO-DtvsY20dK8g8jz50TCQ6SJFl3P-9d5z9Z3m32HBVjeAuS_euhJupLD75AsI88n8pSJ1tXBCnjP-iWs4Vz7zUlvSYczSsVfBK2rLl1O2TldPSg/s200/testing-code-is-for-wimps-real-men-test-in-production%255B1%255D.jpg" width="172" /></a></div>
An advance issue that any production-grade codec must face is the ability to deal with erroneous data.<br />
<div>
<br /></div>
<div>
Such requirement tends to come at a second development stage, since it's already difficult enough to make an algorithm work under "normal conditions". Before reaching erroneous data, there is already a large number of valid edge cases to properly deal with.</div>
<div>
<br /></div>
<div>
Erroneous input is nonetheless important, not least because it can degenerate into a full program crash if not properly taken care of. At a more advanced level, it can even serve as an attack vector, trying to push some executable code into unauthorized memory segments. Even without reaching that point, just the perspective to make a system crash with the use of a predictable pattern is a good enough nuisance.</div>
<div>
<br /></div>
<div>
Dealing with such problems can be partially mitigated using stringent <a href="https://en.wikipedia.org/wiki/Unit_testing">unit tests</a>. But that's more easily said than done. Sometimes, not only is it painful to build <i>and maintain</i> a thorough and wishfully complete list of unit test for each function, it's also useless in predicting some unexpected behavior resulting from an improbable chain of events at different stages in the program.</div>
<div>
<br />
<div>
Hence the idea to find such bugs at "system level". The system's input will be fed with a set of data, and the results will be observed. If you create test set manually, you will likely test some important, visible and expected use cases, which is still a pretty good start. But some less obvious interaction patterns will be missed.</div>
<div>
<br /></div>
<div>
That's where starts the realm of <a href="https://en.wikipedia.org/wiki/Fuzz_testing">Fuzz Testing</a>. The main idea is that random will make a better job at finding stupid forgotten edge cases, which are good candidates to crash a program. And it works pretty well. But how to setup "random" ?</div>
</div>
<div>
<br /></div>
<div>
In fact, even "random" must be defined within some limits. For example, if you only feed a lossless compression algorithm with some random input, it will simply not be able to compress it, meaning you will always test the same code path. </div>
<div>
<br /></div>
<div>
The way I've dealt with such issue for <a href="http://www.lz4.org/">lz4 </a>or <a href="http://www.zstd.net/">zstd </a>is to create programs able to generate "random compressible data", with some programmable characteristics (compressibility, symbol variation, reproducible by seed). And it helped a lot to test valid code path.</div>
<div>
<br /></div>
<div>
The decompression side is more interested by resistance to invalid input. But even with random parameters, there is a need to target interesting properties to test. Typically, a valid decompression stage is first run, to serve as a model. Then some "credible" fail scenarios are built from them. Zstd fuzzer tool typically tests : truncated input, too small destination buffer, and noisy source created from a valid one with some random changes, in order to bypass too simple screening stages.</div>
<div>
<br /></div>
<div>
All these tests were extremely useful to strengthen the reliability of the code. But the idea that "random" was in fact defined within some limits make it clear that maybe some other code path, outside of limits of "random", may still fail if properly triggered.</div>
<div>
<br /></div>
<div>
But how to find them ? As stated earlier, brute force is not a good approach. There are too many similar cases which would be trivially reduced to a single code path. For example, the compressed format of <a href="http://www.zstd.net/">zstd </a>includes an initial 4-bytes identifier. A dumb random input would therefore have a 1 in 4 billion chances to pass such early screening, leaving little energy to test the rest of the code.</div>
<div>
<br /></div>
<div>
For a long time, I believed it was necessary to know in details one's code to create some useful fuzzer tool. Thanks to kind notification from Vitaly Magerya, it seems this is no longer the only one solution. I discovered earlier today the<a href="https://en.wikipedia.org/wiki/American_Fuzzy_Lop"> American Fuzzy Lop</a>. No, not the rabbit; <b><a href="http://lcamtuf.coredump.cx/afl/">this </a></b>test tool, by <a href="http://lcamtuf.coredump.cx/">Michał Zalewski</a>.</div>
<div>
<br /></div>
<div>
It's <i>relatively </i>easy to setup (for Unix programmers). Build, install and usage follow clean conventions, and the Readme is a fairly good read, easy to follow. With just a few initial test cases to provide, a special compilation stage and a command line, the tool is ready to go.</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://dl.dropboxusercontent.com/u/59565338/Images/afl.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="252" src="https://dl.dropboxusercontent.com/u/59565338/Images/afl.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">American Fuzzy Lop, testing zstd decoder</td></tr>
</tbody></table>
<div>
<br /></div>
<div>
It displays a simple live board in text mode, which successfully captures the mind. One can see, or rather guess, how the genetic algorithm tries to create new use cases. It basically starts from the initially provided set of tests, and create new ones by modifying them using simple transformations. It analyzes the results, which are relatively precise thanks to special instrumentation installed in the target binary during the compilation stage. It deduces from them the triggered code path and if it has found a new one. Then generate new test cases built on top of "promising" previous ones, restart, ad infinitum. </div>
<div>
<br /></div>
<div>
This is simple and brilliant. Most importantly, it is <i>generic</i>, meaning no special knowledge of zstd was required for it to test thoroughly the algorithm and its associated source code.</div>
<div>
<br /></div>
<div>
There are obviously limits. For example, the amount of memory that can be spent for each test. Therefore, successfully resisting for hours the tricky tests created by this fuzzer tool is not the same as "bug free", but it's a damn good step into this direction, and would at least deserve the term "robust".</div>
<div>
<br /></div>
<div>
Anyway, the result of all these tests, using internal and external fuzzer tools, is <a href="https://github.com/Cyan4973/zstd/releases">a first release of Zstandard</a>. It's not yet "format stable", meaning specifically that the current format is not guaranteed to remain unmodified in the future (such stage is planned to be reached early 2016). But it's already quite robust. So if you wanted to test the algorithm in your application, now seems a good time, <a href="https://emeryblogger.files.wordpress.com/2012/05/testing-code-is-for-wimps-real-men-test-in-production.jpg">even in production environment</a>.</div>
<div>
<br />
<b><i>[Edit]</i></b> : If you're interested in fuzz testing, I recommend reading an <a href="https://extrememoderate.wordpress.com/2015/11/16/fuzz-testing-compressors/">excellent follow up by Maciej Adamczyk</a>, which get into great details on how to do your own fuzz testing for your project.</div>
Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com6tag:blogger.com,1999:blog-834134852788085492.post-28367487076969723052015-08-19T17:40:00.000+02:002015-10-14T13:48:07.350+02:00Accessing unaligned memory<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKlg0W3IgkolpGKAhpd4yG4nzriHDVuiAKGKfLkfK3OAbM2TrWrc_OPjzVAcGYyZ8QVDOQV59Irir8SFukXorA9zFyhuiBHQsx_DZjoavBRv15Z7muekc7_ml1FDfVK_zg8oEH57LucrA/s1600/er_photo_213457%255B1%255D.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="112" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKlg0W3IgkolpGKAhpd4yG4nzriHDVuiAKGKfLkfK3OAbM2TrWrc_OPjzVAcGYyZ8QVDOQV59Irir8SFukXorA9zFyhuiBHQsx_DZjoavBRv15Z7muekc7_ml1FDfVK_zg8oEH57LucrA/s200/er_photo_213457%255B1%255D.jpg" width="200" /></a></div>
Thanks to <a href="http://www.first-world.info/">Herman Brule</a>, I recently received an access to real ARM hardware systems, in order to test C code and tune them for performance. It proved a great experience, with lots of learnings.<br />
<br />
It started with the finding that <a href="https://github.com/Cyan4973/xxHash">xxhash </a>speed was rubbish on ARM systems. To this end, 2 systems were benchmarked : first, an ARMv6-J, and then an ARMv7-A.<br />
<br />
This was a unwelcomed surprise, and among the multiple potential reasons, it turns out that accessing unaligned data became the most critical one.<br />
<br />
Since my <a href="http://fastcompression.blogspot.fr/2014/11/portability-woes-endianess-and.html">latest blog entry on this issue</a>, I converted unaligned-access code to the QEMU-promoted solution using `memcpy()`. Compared with earlier method (`pack` statement), the `memcpy()` version has a big advantage : it's highly portable. It's also supposed to be correctly optimized by the compiler, to end up to a trivial `unaligned load` instruction on CPU architecture which support this feature.<br />
<br />
Well, supposed to is really the right word. It turns out, <a href="http://stackoverflow.com/a/32095106/646947">this is not true in a number of cases</a>. While initially only direct benchmark tests were my main investigation tool, I was pointed towards <a href="https://gcc.godbolt.org/#">godbolt online assembly generator</a>, which became an invaluable asset to <a href="https://goo.gl/7FWDB8">properly understand what was going on at assembly level</a>.<br />
<br />
Thanks to these new tools, the issue could be summarized into a selection between 3 possibilities to access unaligned memory :<br />
<br />
1. Using `memcpy()` : this is the most portable and safe one.<br />
It's also efficient in a large number of situations. For example, on all tested targets, clang translates `memcpy()` into a single `load` instruction when hardware supports it. gcc is also good on most target tested (x86, x64, arm64, ppc), with just arm 32bits standing out.<br />
The issue here is that your mileage will vary depending on specific compiler / targets. And it's difficult, if not impossible, to test and check all possible combinations. But at least, `memcpy()` is a good generic backup, a safe harbour to be compared to.<br />
<br />
2. `pack` instruction : the problem is that it's a compiler-specific extension. It tends to be present on most compilers, but using multiple different, and incompatible, semantics. Therefore, it's a pain for portability and maintenance.<br />
<br />
That being said, in a number of cases where `memcpy()` doesn't produce optimal code, `pack` <i>tends </i>to do a better job. So it's possible to `special case` these situations, and left the rest to `memcpy`.<br />
<br />
The most important use case was <u>gcc with <b>ARMv7</b></u>, basically the most important 32-bits ARM version nowadays (included in current crop of smartphones and tablets).<br />
Here, using `pack` for unaligned memory improved performance <b>from 120 MB/s to 765 MB/s </b>compared to `memcpy()`. That's definitely a too large difference to be missed.<br />
<br />
Unfortunately, on gcc with <i>ARMv6</i>, this solution was still as bad as `memcpy()`.<br />
<br />
3. direct `u32` access : the only solution I could find for gcc on ARMv6.<br />
This solution is not recommended, as it basically "lies" to the compiler by pretending data is properly aligned, thus generating a fast `load` instruction. It works when the target cpu is hardware compatible with unaligned memory access, <i>and </i>does not risk generating some opcode which are <i>only </i>compatible with strictly-aligned memory accesses.<br />
This is exactly the situation of ARMv6.<br />
Don't use it for ARMv7 though : although it's compatible with unaligned load, it can also issue <i>multiple load</i> instruction, which is a<i> strict-align</i> only opcode. So the resulting binary would crash.<br />
<br />
In this case too, the performance gain is too large to be neglected : on unaligned memory access, read speed went up <b><i>from 75 MB/s to 390 MB/s</i></b> compared to `memcpy()` or `pack`. That's more than 5 times faster.<br />
<br />
So there you have it, a complex setup, which tries to select the best possible method depending on compiler and target. Current findings can be summarized as below :<br />
<br />
<pre class="lang-c prettyprint prettyprinted" style="background-color: #eeeeee; border: 0px; color: #393318; font-family: Consolas, Menlo, Monaco, 'Lucida Console', 'Liberation Mono', 'DejaVu Sans Mono', 'Bitstream Vera Sans Mono', 'Courier New', monospace, sans-serif; font-size: 13px; margin-bottom: 1em; max-height: 600px; overflow: auto; padding: 5px; width: auto; word-wrap: normal;"><code style="border: 0px; font-family: Consolas, Menlo, Monaco, 'Lucida Console', 'Liberation Mono', 'DejaVu Sans Mono', 'Bitstream Vera Sans Mono', 'Courier New', monospace, sans-serif; margin: 0px; padding: 0px; white-space: inherit;"><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">Better unaligned read method :
------------------------------
|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> compiler </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> x86</span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">/</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;">x64 </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> </span><span class="typ" style="border: 0px; color: #2b91af; margin: 0px; padding: 0px;">ARMv7</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> </span><span class="typ" style="border: 0px; color: #2b91af; margin: 0px; padding: 0px;">ARMv6</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> ARM64 </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> PPC </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;">
</span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|-----------|---------|--------|--------|--------|--------|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;">
</span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> GCC </span><span class="lit" style="border: 0px; color: maroon; margin: 0px; padding: 0px;">4.8</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> memcpy </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> packed </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> direct </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> memcpy </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> memcpy </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;">
</span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> clang </span><span class="lit" style="border: 0px; color: maroon; margin: 0px; padding: 0px;">3.6</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> memcpy </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> memcpy </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> memcpy </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> memcpy </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">?</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;">
</span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> icc </span><span class="lit" style="border: 0px; color: maroon; margin: 0px; padding: 0px;">13</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> packed </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> N</span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">/</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;">A </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> N</span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">/</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;">A </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> N</span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">/</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;">A </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> N</span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">/</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;">A </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span></code></pre>
A good news is that there is a safe default method, which tends to work well in a majority of situations. Now, it's only a matter of special-casing specific combinations, to use alternate method.<br />
<br />
Of course, a better solution would be for all compilers, and gcc specifically, to properly translate `memcpy()` into efficient assembly for all targets. But that's wishful thinking, clearly outside of our responsibility. Even if it does improve some day, we nonetheless need an efficient solution now, for current crop of compilers.<br />
<br />
The new unaligned memory access design is currently available within <a href="https://github.com/Cyan4973/xxHash/tree/dev">xxHash source code on github, dev branch</a>.<br />
<br />
<u>Summary of gains on tested platforms :</u><br />
compiled with gcc v4.7.4<br />
<pre class="lang-c prettyprint prettyprinted" style="background-color: #eeeeee; border: 0px; color: #393318; font-family: Consolas, Menlo, Monaco, 'Lucida Console', 'Liberation Mono', 'DejaVu Sans Mono', 'Bitstream Vera Sans Mono', 'Courier New', monospace, sans-serif; font-size: 13px; margin-bottom: 1em; max-height: 600px; overflow: auto; padding: 5px; width: auto; word-wrap: normal;"><code style="border: 0px; font-family: Consolas, Menlo, Monaco, 'Lucida Console', 'Liberation Mono', 'DejaVu Sans Mono', 'Bitstream Vera Sans Mono', 'Courier New', monospace, sans-serif; margin: 0px; padding: 0px; white-space: inherit;"><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> program </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> platform</span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> before</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> after </span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> </span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;">
</span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|--------------------|---------|----------|----------|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;">
</span><span class="pun" style="border: 0px; color: black; margin: 0px; padding: 0px;">|</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;"> xxhash32 unaligned | ARMv6 | 75 MB/s | 390 MB/s |
</span><span class="pln" style="border: 0px; color: black; margin: 0px; padding: 0px;">| <span style="white-space: inherit;">xxhash32 unaligned | ARMv7 | 122 MB/s | 765 MB/s |</span>
| lz4 compression | ARMv6 | 13 MB/s | 18 MB/s |
| lz4 compression | ARMv7 | 33 MB/s | 49 MB/s |
</span></code></pre>
<i>[Edit]</i> : apparently, this issue will help<a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366"> improve GCC for the better</a>Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com9tag:blogger.com,1999:blog-834134852788085492.post-86864012362364799442015-07-30T16:35:00.000+02:002018-02-26T10:16:19.725+01:00Huffman revisited - Part 3 - Depth limited tree<div class="separator" style="clear: both; text-align: center;">
</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSa3WlE_8pOh5jfWEFkve3gcu21z1tNpHIHaOz6SdTqvpqn2uHbe-MubS2zmnT4Ert_MbMPAb3A3BxtRjXyxbdmBRgvvcwatWNrtMZ6aOmYkp2mggx-RWQF4s5yvVqlgJzzl77vLfOk6Y/s1600/huffSample150.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="96" data-original-width="150" height="128" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSa3WlE_8pOh5jfWEFkve3gcu21z1tNpHIHaOz6SdTqvpqn2uHbe-MubS2zmnT4Ert_MbMPAb3A3BxtRjXyxbdmBRgvvcwatWNrtMZ6aOmYkp2mggx-RWQF4s5yvVqlgJzzl77vLfOk6Y/s200/huffSample150.png" width="200" /></a></div>
A non-trivial issue that most real-world Huffman implementations must deal with is tree depth limitation.<br />
<br />
Huffman construction doesn't limit the depth. If it would, it would no longer be "optimal". Granted, the maximum depth of an Huffman tree is bounded by the <a href="https://en.wikipedia.org/wiki/Fibonacci_number" style="color: #4183c4; text-decoration: none;">Fibonacci serie</a>, but that leave ample room for larger depth than wanted.</div>
<div style="color: #333333; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
<span style="font-family: helvetica neue, helvetica, arial, sans-serif;">Why limit Huffman tree depth ? Fast huffman decoders use lookup tables. It's possible to use multiple table levels to mitigate the memory cost, but a very fast decoder such as Huff0 goes for a single table, both for simplicity and speed. In which case the table size is a direct product of the tree depth (</span><span style="font-family: Courier New, Courier, monospace;">tablesize = 1 << treeDepth</span><span style="font-family: helvetica neue, helvetica, arial, sans-serif;">).</span></div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
For the benefit of speed and memory management, a limit had to be selected : it's 8 KB for the decoding table, which nicely fits into Intel's L1 cache, and leaves some room to combine it with other tables if need be. Since <a href="http://fastcompression.blogspot.fr/2015/07/huffman-revisited-part-2-decoder.html" style="color: #4183c4; text-decoration: none;">latest decoding table uses 2 bytes per cell</a>, it translates into 4K cells, hence a maximum tree depth of 12 bits.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
12 bits for compressing literals is generally too little, at least according to optimal Huffman construction. Creating a depth-limited tree is therefore a practical issue to solve. The question is : how to achieve this objective with minimum impact on compression ratio, and how to do it <i>fast</i> ?</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
Depth-limited huffman trees have been studied since the 1960's, so there is ample literature available. What's more surprising is how complex the proposed solutions are, and how many decades were<br />
necessary to converge towards an optimal solution.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
<div style="-webkit-text-stroke-width: 0px; color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 1.3em; margin-bottom: 1.2em; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">
</div>
<i style="line-height: 1.3em;">(Note </i><span style="line-height: 1.3em;">: in below paragraph, </span><b style="line-height: 1.3em;">n</b><span style="line-height: 1.3em;"> is the alphabet size, and </span><b style="line-height: 1.3em;">D</b><span style="line-height: 1.3em;"> is the maximum tree Depth.)</span><br />
It started with Karp, in 1961 (<em>Minimum-redundancy coding for the discrete noiseless channel</em>), proposing a solution in exponential time. Then Gilbert, in 1971 (<em>Codes based on inaccurate source probabilities</em>), still in exponential time. Hu and Tan, in 1972 (<em>Path length of binary search trees</em>), with a solution in O(n.D.2^D). Finally, a solution in polynomial time was proposed by Garey in 1974 (<em>Optimal binary search trees with restricted maximal depth</em>), but still O(n^2.D) time and using O(n^2.D) space. In 1987, Larmore proposed an improved solution using O(n^3/2.D.log1/2.n) time and space (<em>Height restricted optimal binary trees</em>). The breakthrough happened in 1990 (<em>A fast algorithm for optimal length-limited Huffman codes</em>), when Larmore and Hirschberg propose the <strong>Package_Merge</strong> algoritm, a completely different kind of solution using <em>only</em> O(n.D) time and O(n) space. It became a classic, and was refined a few times over the next decades, with the notable contribution of Mordecai Golin in 2008 (<em>A Dynamic Programming Approach To Length-Limited Huffman Coding</em>).<br />
<br />
<span style="line-height: 1.3em;">Most of these papers are plain difficult to read, and it's usually harder than necessary to develop a working solution by just reading them (at least, I couldn't. Honorable mention for Mordecai Golin, which proposes a graph-traversal formulation relatively straightforward. Alas, it was still too much CPU workload for my taste).</span></div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
In practice, most fast Huffman implementations don't bother with them. Sure, when <em>optimal compression</em> is required, the PackageMerge algorithm is preferred, but in most circumstances, being optimal is not really the point. After all, Huffman is already a trade-off between optimal and speed. By following this logic, we don't want to sacrifice everything for an optimal solution, we just need a good enough one, fast and light.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
That's why you'll find some cheap heuristics in many huffman codes. A simple one : start with a classic Huffman tree, flatten all leaves beyond maximum depth, then flatten enough higher leaves to maxBits to get back the total length to one. It's fast, it's certainly not optimal, but in practice, the difference is small and barely noticeable. Only when the tree depth is very constrained does it make a visible difference (you can read some relevant <a href="http://cbloomrants.blogspot.fr/2010/07/07-02-10-length-limitted-huffman-codes.html?showComment=1278186165626#c7476201953397888895" style="color: #4183c4; text-decoration: none;">great comments from Charles Bloom on its blog</a>).</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
Nonetheless, for huff0, I was willing to find a solution a bit better than cheap heuristic, closer to optimal. 12 bits is not exactly "very constrained", so the pressure is not high, but it's still constrained enough that the depth-limited algorithm is going to be necessary in most circumstances. So better have a good one.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
I started by making some simple observations : after completing an huffman tree, all symbols are sorted in decreasing count order. That means that the number of bits required to represent each symbol must follow a strict increasing order. That means the only thing I need to track is the border decision (from 5 to 6 bits, from 6 to 7 bits, etc.).</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7cxmyhYOOSzyiwkrwtkmBvgb7CICb96bCVquobi8U0Vgz0a9p85d-iwBD0_oSr0CGWm-sHJcdM3dDKOllAqeHFxRfaVnDGD0cUiFN02uPresECOCTppUw8a6tPRNbHH-ZFeVBK2k-ilI/s1600/HuffmanBitDistrib.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="88" data-original-width="626" height="55" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7cxmyhYOOSzyiwkrwtkmBvgb7CICb96bCVquobi8U0Vgz0a9p85d-iwBD0_oSr0CGWm-sHJcdM3dDKOllAqeHFxRfaVnDGD0cUiFN02uPresECOCTppUw8a6tPRNbHH-ZFeVBK2k-ilI/s400/HuffmanBitDistrib.png" width="400" /></a></div>
</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
So now, the algorithm will concentrate on moving the arrows.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
The first part is the same as the cheap heuristic : flatten everything that needs more than <em>maxBits</em>. This will create a "debt" : a symbol requiring <em>maxBits+1</em> bits creates a debt of 1/2=0.5 when pushed to <em>maxBits</em>. A symbol requiring <em>maxBits+2</em> creates a debt of 3/4=0.75, and so on. What may not be totally obvious is that the sum of these fractional debts is necessarily an integer number. This is a consequence of starting from a solved huffman tree, and can be proven by simple recurrence : if the huffman tree natural length is <em>maxBits+1</em>, then the number of elements at <em>maxBits+1</em> is necessarily even, otherwise the sum of probabilities can't be equal to one. The debt's sum is therefore necessarily a multiple of 2 * 0.5 = 1, hence an integer number. Rince and repeat for <em>maxBits+2</em> and further depth.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
So now we have a <em>debt</em> to repay. Each time you demote a symbol from <em>maxBits-1</em> to <em>maxBits</em>, you repay 1 debt. Since the symbols are already sorted in decreasing frequency, it's easy to just grab the smallest <em>maxBits-1</em> symbols, and demote them to <em>maxBits,</em> up to repaying the debt. This is in essence what the cheap heuristic does.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
But one must note that demoting a symbol from <em>maxBits-2</em> to <em>maxBits-1</em> repay not 1 but 2 debts. Demoting from <em>maxbits-3</em> to <em>maxBits-2</em> repay 4 debts. And so on. So now the question becomes : is it preferable to demote a single <em>maxBits-2</em> symbol or two <em>maxBits-1</em> symbols ?</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
The answer to this question is trivial since we deal with integer number of bits : just compare the sum of occurrences of the two <em>maxBits-1</em> symbols with the occurrence of the single <em>maxBits-2</em> one. Whichever is smallest costs less bits to demote. Proceed.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
This approach can be scaled. Need to repay 16 debts ? A single symbol at <em>maxBits-5</em> might be enough, or 2 at <em>maxBits-4</em>. By recurrence, each <em>maxBits-4</em> symbol might be better replaced by two <em>maxBits-3</em> ones, and so on. The best solution will show up by a simple recurrence algorithm.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
Sometimes, it might be better to overshoot : if you have to repay a debt of 7, which formula is better ? 4+2+1, or 8-1 ? (the <em>-1</em> can be achieved by promoting the best <em>maxBits</em> symbol to <em>maxBits-1</em>). In theory, you would have to compare both and select the better one. Doing so leads to an <em>optimal</em> algorithm. In practice though, the positive debt repay (4+2+1) is most likely the better one, since distribution must be severely twisted for the overshoot solution to win.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
The algorithm becomes a bit more complex when some bits ranks are missing. For example, one needs to repay a debt of 2, but there is no symbol left at <em>maxBits-2</em>. In such case, one can still uses <em>maxBits-1</em> symbols, but maybe there is no more of these symbols left either. In which case, the only remaining solution is to overshoot (<em>maxBits-3</em>) and promote enough elements to get the debt back to zero.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
On average, implementation of this algorithm is pretty fast. Its CPU cost is unnoticeable, compared to the decoding cost itself, and the final compression ratio is barely affected (<0.1%) compared to unconstrained tree depth. So that's mission accomplished.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
The fast variant of the depth limit algorithm is available in open source and can be grabbed at <a href="https://github.com/Cyan4973/FiniteStateEntropy/blob/dev/fse.c#L1737" style="color: #4183c4; text-decoration: none;">github</a>, under the function name<code style="background-color: #f7f7f9; border-radius: 3px; border: 1px solid rgb(225, 225, 232); color: #dd1144; font-family: Menlo, Monaco, 'Courier New', monospace; font-size: 12px; padding: 2px 4px;">HUF_setMaxHeight()</code>.</div>
Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com2tag:blogger.com,1999:blog-834134852788085492.post-51295023791499569932015-07-29T19:18:00.000+02:002018-02-26T09:57:15.487+01:00Huffman revisited - Part 2 : the Decoder<div style="color: #333333; font-size: 14px; line-height: 1.3em; margin-bottom: 1.2em;">
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSa3WlE_8pOh5jfWEFkve3gcu21z1tNpHIHaOz6SdTqvpqn2uHbe-MubS2zmnT4Ert_MbMPAb3A3BxtRjXyxbdmBRgvvcwatWNrtMZ6aOmYkp2mggx-RWQF4s5yvVqlgJzzl77vLfOk6Y/s1600/huffSample150.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="96" data-original-width="150" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSa3WlE_8pOh5jfWEFkve3gcu21z1tNpHIHaOz6SdTqvpqn2uHbe-MubS2zmnT4Ert_MbMPAb3A3BxtRjXyxbdmBRgvvcwatWNrtMZ6aOmYkp2mggx-RWQF4s5yvVqlgJzzl77vLfOk6Y/s1600/huffSample150.png" /></a></div>
The first attempt to decompress the Huffman bitStream created by a <a href="http://fastcompression.blogspot.fr/2015/07/huffman-revisited-part-1.html" style="color: #4183c4; text-decoration-line: none;">version of huff0 modified to use FSE bitStream</a> ended up in brutal disenchanting. While decoding operation itself worked fine, the resulting speed was a mere <strong>180 MB/s</strong>.</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
OK, in absolute, it looks reasonable speed, but keep in mind this is far off the objective of beating FSE (which decodes at 475 MB/s on the same system), and even worse than <a href="http://markable.in/editor/(https://github.com/Cyan4973/FiniteStateEntropy/tree/master/benchmarkResults)" style="color: #4183c4; text-decoration: none;">reference zlib huffman</a>. Some generic attempts at improving speed barely changed this, moving up just above 190 MB/s.</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
This was a disappointment, and a clear proof that the bitStream alone wasn't enough to explain FSE speed. So what could produce such a large difference ?</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
Let's look at the code. The critical section of FSE decoding loop looks like this :</div>
<pre style="background-color: whitesmoke; border-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.15); font-family: menlo, monaco, "courier new", monospace; font-size: 12.025px; line-height: 18px; margin-bottom: 9px; padding: 8.5px; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><code style="background-color: transparent; border-radius: 3px; border: 0px; color: inherit; font-family: Menlo, Monaco, 'Courier New', monospace; font-size: 12px; padding: 0px;"> DInfo = table[state];
nbBits = DInfo.nbBits;
symbol = DInfo.symbol;
lowBits = FSE_readBits(bitD, nbBits);
state = DInfo.newState + lowBits;
return symbol;
</code></pre>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
while for Huff0, it would look like this :</div>
<pre style="background-color: whitesmoke; border-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.15); font-family: menlo, monaco, "courier new", monospace; font-size: 12.025px; line-height: 18px; margin-bottom: 9px; padding: 8.5px; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><code style="background-color: transparent; border-radius: 3px; border: 0px; color: inherit; font-family: Menlo, Monaco, 'Courier New', monospace; font-size: 12px; padding: 0px;"> symbol = tableSymbols[state];
nbBits = tableNbBits[symbol];
lowBits = FSE_readBits(bitD, nbBits);
state = ((state << nbBits) & mask) + lowBits;
return symbol;
</code></pre>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
There are some similarities, but also some visible differences. First, Huff0 creates 2 decoding tables, one to determine the symbol being decoded, the other one to determine how many bits are read. This is a good design for memory space : the larger table is <code style="background-color: #f7f7f9; border-radius: 3px; border: 1px solid rgb(225, 225, 232); color: #dd1144; font-family: Menlo, Monaco, 'Courier New', monospace; font-size: 12px; padding: 2px 4px;">tableSymbols</code>, as its size primarily depends on <code style="background-color: #f7f7f9; border-radius: 3px; border: 1px solid rgb(225, 225, 232); color: #dd1144; font-family: Menlo, Monaco, 'Courier New', monospace; font-size: 12px; padding: 2px 4px;">1<<maxNbBits</code>. The second table, <code style="background-color: #f7f7f9; border-radius: 3px; border: 1px solid rgb(225, 225, 232); color: #dd1144; font-family: Menlo, Monaco, 'Courier New', monospace; font-size: 12px; padding: 2px 4px;">tableNbBits</code>, is much smaller : its size only depends on <code style="background-color: #f7f7f9; border-radius: 3px; border: 1px solid rgb(225, 225, 232); color: #dd1144; font-family: Menlo, Monaco, 'Courier New', monospace; font-size: 12px; padding: 2px 4px;">nbSymbols</code>. This construction allows using only 1 byte per cell. It favorably compares to 4 bytes per cell for FSE. This memory advantage can be used either as a net space saver, or as a way to boost accuracy, by increasing <code style="background-color: #f7f7f9; border-radius: 3px; border: 1px solid rgb(225, 225, 232); color: #dd1144; font-family: Menlo, Monaco, 'Courier New', monospace; font-size: 12px; padding: 2px 4px;">maxNbBits</code>.</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
The cost for it is that there are 2 interdependent operations : first decode the state to get the symbol, <em>then</em> use the symbol to get <code style="background-color: #f7f7f9; border-radius: 3px; border: 1px solid rgb(225, 225, 232); color: #dd1144; font-family: Menlo, Monaco, 'Courier New', monospace; font-size: 12px; padding: 2px 4px;">nbBits</code>.</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
This interdependence is likely the bottleneck. When trying to design high performance computation loops, there are 3 major rules to keep in mind :</div>
<ul style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 18px; margin: 0px 0px 9px 25px; padding: 0px;">
<li>Ensure hot data is already in the cache.</li>
<li>Avoid badly predictable branches (predictable ones are fine)</li>
<li>For modern OoO (Out of Order) CPU : keep their multiple execution units busy by feeding them with independent (parallelizable) operations.</li>
</ul>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
This list is given in priority order. It makes no sense to try optimizing your code for OoO operations if the CPU has to wait for data from main memory, as the latency cost is much higher than any CPU operation. If your code is full of badly predictable branches, resulting in branch flush penalties, this is also a much larger problem than having some idle execution units. So you can only get to the third set of optimization after properly solving the previous ones.</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
This is exactly the situation where Huff0 is, with a fully branchless bitstream and data tables entirely within L1 cache. So the next performance boost will likely be found into OoO operations.</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
In order to avoid dependency between <em>symbol first, then nbBits</em> , let's try a different table design, where nbBits is directly stored alongside symbol, in the state table. This double the memory cost, hence reducing the memory advantage enjoyed by Huffman compared to FSE. But let's see where it goes :</div>
<pre style="background-color: whitesmoke; border-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.15); font-family: menlo, monaco, "courier new", monospace; font-size: 12.025px; line-height: 18px; margin-bottom: 9px; padding: 8.5px; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><code style="background-color: transparent; border-radius: 3px; border: 0px; color: inherit; font-family: Menlo, Monaco, 'Courier New', monospace; font-size: 12px; padding: 0px;"> symbol = table[state].symbol;
nbBits = table[state].nbBits;
lowBits = FSE_readBits(bitD, nbBits);
state = ((state << nbBits) & mask) + lowBits;
return symbol;
</code></pre>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
This simple change alone is enough to boost the speed to <strong>250 MB/s</strong>. Still quite far from the 475 MB/s enjoyed by FSE on the same system, but nonetheless a nice performance boost. More critically, it underlines that the diagnosis was correct : untangling operation dependency free up CPU OoO execution units, they can do more work within each cycle.</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
So let's ramp up the concept. We have removed one operation dependency. Is there another one ?</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
Yes. When looking at the main decoding loop from a higher perspective, we can see there are 4 decoding operations per loop. But each decoding operation must wait for the previous one to be completed, because in order to know how to read the bitStream for symbol 2, we need first to know of many bits were consumed by symbol 1.</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
Compare with how FSE work : since state values are separated from bitStream, it's possible to decode symbol1 and symbol2, <em>and</em> retrieve their respective nbBits, in any order, without any dependency. Only later operations, retrieving <code style="background-color: #f7f7f9; border-radius: 3px; border: 1px solid rgb(225, 225, 232); color: #dd1144; font-family: Menlo, Monaco, 'Courier New', monospace; font-size: 12px; padding: 2px 4px;">lowBits</code> from the bitStream to calculate the next state values, introduce some ordering dependency (and even this one can be partially unordered).</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
The main idea is this one : to decode faster, it's necessary retrieve several symbols in parallel, without dependency. So let's create a compressed data flow which makes such operation possible.</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
Re-using FSE principles "as is" to design a faster Huffman decoding is an obvious choice, but it predictably results in about the same speed. As stated previously, it's not interesting to design a new Huffman encoder/decoder if it just ends up being as fast as FSE. If that is the outcome, then let's simply use FSE instead.</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
Fortunately, we already know that compression can be faster. So let's concentrate on the decoding side. Since it seems impossible to decode the next symbol without first decoding the previous one from the same bitStream, let's design multiple bitStreams.</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
The new design is a bit more complex. Compression side is affected : in order to create multiple bitStreams, one solution is to scan input data block multiple times. It proved efficient enough to not bother with a different design. On top of that, a jumptable is required at the beginning of the block, to let the decoder know where each bitStream starts.</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiyVZp3t_JCItPyqY_PTummqchaBfQ9qwdigfqcFOcQaisGwGYwjf8UV-LmJUrVAfvVPkg5j_XPAnu2L9XkJO8Ts4OSA6YHwyi7eSOtKBqHYTcvKsyLKUg9CxuhzDLTk7SKRxdaowmPAyg/s1600/huff0BitStreamJumpTable.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="30" data-original-width="673" height="17" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiyVZp3t_JCItPyqY_PTummqchaBfQ9qwdigfqcFOcQaisGwGYwjf8UV-LmJUrVAfvVPkg5j_XPAnu2L9XkJO8Ts4OSA6YHwyi7eSOtKBqHYTcvKsyLKUg9CxuhzDLTk7SKRxdaowmPAyg/s400/huff0BitStreamJumpTable.png" width="400" /></a></div>
</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
Within each bitStream, it's still necessary to decode the first symbol to read the second. But each bitStream is independent, so it's possible to decode up to 4 symbols in parallel.</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
This proved a design win. The new huff0 decompresses at <strong>600 MB/s</strong> while preserving the compression speed of <strong>500 MB/s</strong>. This compares favorably to FSE or zlib's huffman, as detailed below :</div>
<table style="border-collapse: collapse; border-spacing: 0px; color: black; font-family: "helvetica neue", helvetica, arial, sans-serif; font-size: 14px; line-height: 18px; margin-bottom: 1em; max-width: 100%; vertical-align: middle;"><thead>
<tr><th align="left" style="background-color: #dddddd; border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">Algorithm</th><th align="right" style="background-color: #dddddd; border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">Compression</th><th align="right" style="background-color: #dddddd; border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">Decompression</th></tr>
</thead><tbody>
<tr><td align="left" style="border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">huff0</td><td align="right" style="border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">500 MB/s</td><td align="right" style="border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">600 MB/s</td></tr>
<tr><td align="left" style="border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">FSE</td><td align="right" style="border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">320 MB/s</td><td align="right" style="border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">475 MB/s</td></tr>
<tr><td align="left" style="border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">zlib-h</td><td align="right" style="border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">250 MB/s</td><td align="right" style="border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">250 MB/s</td></tr>
</tbody></table>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
With that part solved, it was possible to check that there is no visible compression difference between FSE and Huff0 on Literals data. To be more precise, compression is slightly worse, but header size is slightly better (huffman headers are simpler to describe). On average, both effects compensate.</div>
<div style="line-height: 1.3em; margin-bottom: 1.2em;">
<span style="font-family: helvetica neue, helvetica, arial, sans-serif;">The resulting code is open sourced and currently available at : </span><a href="https://github.com/Cyan4973/FiniteStateEntropy/tree/dev" style="color: #4183c4; font-family: "helvetica neue", helvetica, arial, sans-serif; text-decoration: none;">https://github.com/Cyan4973/FiniteStateEntropy</a><span style="font-family: helvetica neue, helvetica, arial, sans-serif;"> (</span><span style="font-family: Courier New, Courier, monospace;">dev</span><span style="font-family: helvetica neue, helvetica, arial, sans-serif;"> branch)</span></div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
The new API mimic its FSE counterparts, and provides only the higher (simpler) prototypes for now :</div>
<pre style="background-color: whitesmoke; border-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.15); font-family: menlo, monaco, "courier new", monospace; font-size: 12.025px; line-height: 18px; margin-bottom: 9px; padding: 8.5px; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><code style="background-color: transparent; border-radius: 3px; border: 0px; color: inherit; font-family: Menlo, Monaco, 'Courier New', monospace; font-size: 12px; padding: 0px;">size_t HUF_compress (void* dst, size_t dstSize,
const void* src, size_t srcSize);
size_t HUF_decompress(void* dst, size_t maxDstSize,
const void* cSrc, size_t cSrcSize);
</code></pre>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
For the time being, both FSE and huff0 are available within the same library, and even within the same file. The reasoning is that they share the same bitStream code. Obviously, many design choices will have the opportunity to be challenged and improved in the near future.</div>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
Having created a new competitor to FSE, it was only logical to check how it would behave within <a href="http://www.zstandard.net/" style="color: #4183c4; text-decoration: none;">Zstandard</a>. It's almost a drop-in replacement for literal compression.</div>
<table style="border-collapse: collapse; border-spacing: 0px; color: black; font-family: "helvetica neue", helvetica, arial, sans-serif; font-size: 14px; line-height: 18px; margin-bottom: 1em; max-width: 100%; vertical-align: middle;"><thead>
<tr><th align="left" style="background-color: #dddddd; border: 1px solid rgb(238, 238, 238); padding: 5px 10px;"><a href="http://www.zstandard.net/" style="color: #4183c4; text-decoration: none;">Zstandard</a></th><th align="right" style="background-color: #dddddd; border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">previous</th><th align="right" style="background-color: #dddddd; border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">Huff0 literals</th></tr>
</thead><tbody>
<tr><td align="left" style="border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">compression speed</td><td align="right" style="border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">200 MB/s</td><td align="right" style="border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">240 MB/s</td></tr>
<tr><td align="left" style="border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">decompression speed</td><td align="right" style="border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">540 MB/s</td><td align="right" style="border: 1px solid rgb(238, 238, 238); padding: 5px 10px;">620 MB/s</td></tr>
</tbody></table>
<div style="font-family: "helvetica neue", helvetica, arial, sans-serif; line-height: 1.3em; margin-bottom: 1.2em;">
A nice speed boost with no impact on compression ratio. Overall, a fairly positive outcome.</div>
</div>
Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com8tag:blogger.com,1999:blog-834134852788085492.post-44728345285758995072015-07-28T14:34:00.000+02:002018-02-26T09:44:58.720+01:00Huffman revisited - part 1<h1>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhbayrc-InxtxCZaZDbkouRNiRGUYevh1cnVLe0FPc0aXiLY-W01Wzv0dFCXLoFhiLmD50XPiZq6I7OclumgeOmCToDkpEEWYVedrgem1te_A-cfwuzdGfVeQFJZc9NtmBHWiBC9tTeLLA/s1600/huffSample150.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="96" data-original-width="150" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhbayrc-InxtxCZaZDbkouRNiRGUYevh1cnVLe0FPc0aXiLY-W01Wzv0dFCXLoFhiLmD50XPiZq6I7OclumgeOmCToDkpEEWYVedrgem1te_A-cfwuzdGfVeQFJZc9NtmBHWiBC9tTeLLA/s1600/huffSample150.png" /></a></div>
<a href="https://en.wikipedia.org/wiki/Huffman_coding" style="color: #4183c4; text-decoration: none;">Huffman compression</a> is a well known entropic compression technique since the 1950's. It's <em>optimal</em>, in the sense there is no better construction if one accept the limitation of using an integer number of bits per symbol, a constraint that can severely limit its compression capability in presence of <em>high probability</em> symbols.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
Huffman compression is very popular, and quite rightly so, thanks to its simplicity and clarity. (It's also patent-free which helps too). For a long time, it remained the entropic compressor of choice due to its excellent speed / efficiency trade off.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
Today, we can use more powerful entropic compressors such as <a href="https://en.wikipedia.org/wiki/Arithmetic_coding" style="color: #4183c4; text-decoration: none;">Arithmetic Coding</a> or the newer ANS based <a href="http://fastcompression.blogspot.fr/2013/12/finite-state-entropy-new-breed-of.html" style="color: #4183c4; text-decoration: none;">Finite State Entropy</a>, which are able to grab fractional bits, hence ensuring a better compression ratio, closer to the <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)" style="color: #4183c4; text-decoration: none;">Shannon Limit</a>.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
The <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)" style="color: #4183c4; text-decoration: none;">Shannon Limit</a> must be considered like the speed of light, as a hard wall that cannot be crossed. Anytime someone claims the contrary, it is either hiding some cost portions (such as headers, or the decoder itself), or solving a different problem, entangling modeling and entropy. As long as entropy alone is considered, there is simply no way to beat the <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)" style="color: #4183c4; text-decoration: none;">Shannon Limit</a>. You can just get closer to it.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
This leads us to a simple question : are there situations where <a href="https://en.wikipedia.org/wiki/Huffman_coding" style="color: #4183c4; text-decoration: none;">Huffman compression</a> is good enough, meaning that it is so close to <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)" style="color: #4183c4; text-decoration: none;">Shannon limit</a> that there is very little gain remaining, if any ?</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
The answer to this question is <strong>yes</strong>.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
Let's forget some curious corner cases where symbol frequencies are clean power of 2. Of course, in such case, <a href="https://en.wikipedia.org/wiki/Huffman_coding" style="color: #4183c4; text-decoration: none;">Huffman compression</a> would be optimal, but this is way too specific to consider.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
Let's therefore imagine a more "natural" situation where all symbol frequencies are randomly scattered along the probability axis, with the sole condition that the sum of all probabilities must be equal to 1.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
A simple observation : the more numerous the symbols, the most likely each symbol probability is going to be small (since their total sum must be equal to 1).</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
This is an important observation. When the probability of a symbol is small, its deviation from the nearest power of 2 is also small. At some point, this deviation becomes negligible.<br />
<i>(Edit : strictly speaking, it's a bit more complex than that. The power of low probability symbols also comes from their combinatorial effects : they help the huffman tree to be more balanced. But that part is more complex to analyze, so just take my word for it.)</i></div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzygdcmzyuUcay4ryC589hQqHjfVvGWiVfZbPz7GzOO5bX0l3JurgjnRiQwtUYFLRVvYcmHbh9U5dxfQ6pBnKgLnIC7CNRzAUL7kP0VCnZ2a0shHA7Ex7hduZkNRR3O9eGHtoAX7QsHsc/s1600/HuffmanDeviation.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="477" data-original-width="836" height="365" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzygdcmzyuUcay4ryC589hQqHjfVvGWiVfZbPz7GzOO5bX0l3JurgjnRiQwtUYFLRVvYcmHbh9U5dxfQ6pBnKgLnIC7CNRzAUL7kP0VCnZ2a0shHA7Ex7hduZkNRR3O9eGHtoAX7QsHsc/s640/HuffmanDeviation.png" width="640" /></a></div>
<br /></div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
Therefore, if we are in a situation where no symbol get a large probability (<10%), <a href="https://en.wikipedia.org/wiki/Huffman_coding" style="color: #4183c4; text-decoration: none;">Huffman compression</a> is likely to provide a "good enough" compression result, meaning close enough to the hard "Shannon limit" so that it doesn't matter to get even closer to it.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
In a compression algorithm such as <a href="http://markable.in/editor/(http://fastcompression.blogspot.com/2015/01/zstd-stronger-compression-algorithm.html)" style="color: #4183c4; text-decoration: none;">Zstandard</a>, the literals are symbols which belong to this category. They are basically the "rest" from LZ compression, which couldn't be identified as part of repeated sequences. They can be any byte value from 0 to 255, which means every symbol get an average of 0.4% probability. Of course, there are some large differences between most common and less common ones, especially on text files. But in practice, most probabilities remain small, so Huffman deviation should be negligible.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
In <a href="http://markable.in/editor/(http://fastcompression.blogspot.com/2015/01/zstd-stronger-compression-algorithm.html)" style="color: #4183c4; text-decoration: none;">Zstandard</a>, all symbols are compressed using <a href="http://fastcompression.blogspot.fr/2013/12/finite-state-entropy-new-breed-of.html" style="color: #4183c4; text-decoration: none;">Finite State Entropy</a>, which is very fast and performs fractional bit compression. We are saying that, for literals, fractional bit makes little difference, so Huffman can be "good enough". So could we use Huffman instead of FSE for such symbols ?</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
This would only make sense if Huffman compression could bring some kind of advantage on the table, for example speed, and/or memory usage. Alas, currently known versions of Huffman perform <em>worse</em> than <a href="http://fastcompression.blogspot.fr/2013/12/finite-state-entropy-new-breed-of.html" style="color: #4183c4; text-decoration: none;">Finite State Entropy</a>. The <a href="https://github.com/Cyan4973/FiniteStateEntropy/tree/master/benchmarkResults" style="color: #4183c4; text-decoration: none;">zlib reference version</a>, which is pretty good, max out at 250-300 MB/s, which isn't close to FSE results. My own, older, version of Huffman, <a href="http://fastcompression.blogspot.fr/p/huff0-range0-entropy-coders.html" style="color: #4183c4; text-decoration: none;">huff0</a>, is not even as good as the zlib one.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
But it's not game over. After all, analysing FSE algorithm in detail, there is no reason for it to be faster than Huffman, since their complexity are similar. A fast, modern, Huffman compressor should reach equivalent speed, if not better on the compression side (due to an additional operation required by FSE to provide fractional bit).</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
Part of the reasons why FSE is fast is that it uses some clever bitStream techniques, combining multiple symbols into branchless writes, a trick which is not strictly tied to FSE and can be used into different context. So the idea was to re-use the bitStream interface, and combine with a Huffman compressor.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
<strong><em>huff0</em></strong> was refurbished and improved to employ FSE bitStream. In order to preserve code compatibility, I kept FSE design of compressing and decompressing in reverse directions, which is not strictly necessary for Huffman. I could test though that it does not make any noticeable difference for Huffman compression, making this feature a non-event as long as it remains hidden within block API level.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
Moving huff0 to this new bitStream proved extremely easy. And the result was very rewarding. With little effort, I could make it reach <strong>500 MB/s</strong> compression speed, way better than any other huffman compressor I'm aware of, and more critically way better than FSE compression, making it a replacement candidate.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; font-weight: normal; line-height: 1.3em; margin-bottom: 1.2em;">
With such great result at hand, I confidently proceeded to implement huffman decompression based on the same design. I was in for a nasty surprise ...</div>
</h1>
Cyanhttp://www.blogger.com/profile/02905407922640810117noreply@blogger.com1