Monday, December 19, 2011

LZ4 into Hadoop-MapReduce

 After a very fast evaluation, LZ4 has been recently integrated into the Apache project Hadoop - MapReduce.

This is an important news, since, in my humble opinion, Hadoop is among the most advanced and ambitious projects to date (an opinion which is shared by some). It also serves as an excellent  illustration of LZ4 usage, as an in-memory compression algorithm for big server applications.

But first, a few words on Hadoop.
By 2005, Google shook the IT world by presenting Big Table, its home-grown distributed database with eventual consistency, able to store virtually the entire web and queries it. BigTable was built on top of Google FS, a virtual file system covering the entire planet, tens of thousands of computers distributed in hundreds of datarooms all over the world, as if it was a single massive one. This limitless amount of stored data could then be processed in parallel, typically for query preparation, thanks to the MapReduce framework, which allows to process petabytes of data in a small amount of time (if you can afford the number of servers necessary for that).

At this very moment, Google stealed the crown of programmatic champion from Microsoft. It was now clear that they were setting the future. Although most of these technologies were already studied, it was the first time they were executed together and so well, at such a huge scale for commercially available products. This gave Google literally years of advance over the competition, since most of its Web products were based on these capabilities.

Since then, all other "big names" of IT, (namely Yahoo, Facebook, Amazon, IBM, Microsoft, Apple, etc.) have been willing to duplicate this architecture. The result of all these efforts finally converged into the open-source project Hadoop.
Hadoop now has most of the capabilities presented in 2005 by Google, including a Distributed File storage system (HDFS), a distributed Database (HBase), and the same distributed-computing framework as Google, MapReduce.

So, where does that leave any place for LZ4 ?
Well, in such architecture, compression is used as a performance enabler.

As can be guessed, massive amounts of data are traveling between the different nodes. Moreover, each node is also processing a fair amount of data, more or less permanently.
In such situations, compression offers some advantages : less data to transfer means less time and cost to send/receive it. It also means that more data can be stored into RAM memory, or that more data can remain into the CPU cache. All this translates into better system speed.

Or does it ? For this affirmation to be true, it is mandatory for the compression algorithm to be "unobtrusive", which means it should consume very little CPU cycles. Otherwise, the cost of compression can void or reverse the speed advantage. This basically means only fast compressors do qualify for the job.

In the beginning, LZO was such a champion. It offered great speed, however with some important usage limitations, due to its GPL license.
Then early 2011, Google released Snappy, ex-zippy, the very same algorithm used by Google in its own BigTable implementation. It quickly became a great alternative, thanks to its better licensing policy and better performance.

LZ4 was also released this year, just after Snappy. Google's notoriety means there was basically little attention left for competing algorithms. But it also raised awareness that Fast compression algorithms have a role in IT architecture. LZ4 gradually improved overtime, to the point of providing now better performance than Google's creation. One Hadoop's contributors, Binglin Chang, made the effort to implement LZ4 as a JNI patch, and compare it directly to Snappy. LZ4 performance was found better than Snappy, even when using Snappy's own set of calibration tests.
In a relatively quick decision process, the LZ4 patch was then integrated into the main Hadoop - MapReduce source trunk.

/* Update : Google's Snappy developer kindly reminds that benchmark figures depend on the tested configuration, and states that on their own test platform, Snappy keeps an edge with regards to compression speed. See comment :  http://fastcompression.blogspot.com/2011/12/lz4-into-hadoop-mapreduce.html?showComment=1326071743955#c7649536680897239608 */

The advantage of using fast compression algorithms, as does Hadoop, can be replicated into many server-side applications, for example DataBases. Recently, column-oriented databases have been dragging attention, since they make heavy usage of compression to grab some impressive performance advantage. The idea remains the same : compress data to keep more of it into RAM and into CPU cache : it directly translates into better performance.

14 comments:

  1. You're too humble, this is very very big news indeed. Congratulations!

    Now I REALLY need to write my 16-bit x86 asm decompressor :-)

    ReplyDelete
  2. Glad to see this. LZ4 is really a lot better than snappy, which is way over-complex for what it is.

    ReplyDelete
  3. That's a really great news! Congratulations!

    There's one big BUT! Snappy is, imho, not over-complex at all! It's bigger since it has comments and TEST SUITE!!!

    Test suite is really needed! Is there one for LZ4? Hadoop usage will push LZ4 into every corner. Is every single release (commit) of LZ4 _always_ producing and reading backward compatible bit-streams? Are there _no_ edge cases to fail? Is there no performance regression?

    Upgrading hadoop cluster and loosing a couple computer-years of computation is not an option. Compression has to be 100% reliable.

    BTW: The numbers looks very impressive given that snappy and LZ4 are more or less the same algorithms... I'm going to give it a try ;-)

    ReplyDelete
  4. Thanks for the kind words.

    @Jenda : The LZ4 format has been rock solid, and has not changed since its introduction. Therefore, any LZ4 stream produced by any version is decodable by current and future LZ4 version. This is a prime requirement for all versions of LZ4, and obviously future ones.

    The test suite is well over >10GB, and therefore unsuitable for download. Each and every release of LZ4 is tested against the test suite, using Windows and Linux OS, 32 bits & 64 bits systems, GCC and Visual compilers. All known edge cases are present into the test suite, which, as a result, tends to grow overtime. Not a single regression is authorised when increasing the version number.

    You are very welcomed if you wish to contribute in the project.

    Regards

    ReplyDelete
  5. Hi!

    First of all, congratulations on getting LZ4 into Hadoop. It's certainly a nice format, and the implementation has been making good progress.

    However, I have to take some slight reservation at the claim that LZ4 is faster than Snappy on its own test suite. It certainly is in some tests, but in general it depends on the machine. Snappy is, as the README says, primarily optimized for 64-bit machines,
    but the Hadoop tests seem to have been run on 32-bit. (For instance, the HTML test compresses over 77% faster on my local machine than in the benchmark results in that bug report, and even though Core i7 is a fast CPU, I doubt the clock frequency is what does it :-) )

    In 64-bit, I generally still find that Snappy is faster than LZ4 (comparing Snappy r57 against LZ4 r46 on compress+decompress cycles), although the difference is pretty small now. The general trend seems to be that LZ4 compresses a bit denser but also somewhat slower, and then wins back most of that but not all of it in decompression speed, probably partially helped by the fact that it is generally faster to decompress less data. (LZ4HC decompression is really fast!) For 32-bit, LZ4 has an edge.

    Of course, both libraries will continue to improve at some rate, and eventually LZ4 might really be unconditionally faster than Snappy; it's always hard to know where the true upper limit lies, and changing hardware tends to make previous assumptions invalid. (I won't see that as a Snappy failure when it happens; rather a testament to the beauty of open source.) But until that point, please remember that benchmarking is hard, and making broad statements can lead to inaccuracies. :-)

    /* Steinar */
    - responsible for the Snappy open-sourcing process, and many of the original optimizations

    ReplyDelete
  6. Hi Steinar

    Good comment. Indeed, benchmark numbers depend on hardware configuration.

    The simplified assumption that faster hardware leads to faster numbers keeping the ranking intact is incorrect. Indeed, some implementation may be better optimized for some hardware, and as consequence, the ranking can change.

    Therefore, it is fair to say that the favorable LZ4 vs Snappy comparison was produced on the Hadoop's developer specific configuration. I will update the article to highlight your comment.

    As a sidenote, I also agree with your statement on 32-bit/64-bit configurations. Initially, LZ4 was developed on 32-bit system, hence the better optimization for this platform.
    Hopefully the situation will improve in the next few releases, as more time will be spent for 64-bit performance.

    Best Regards

    ReplyDelete
  7. Hi,

    Thanks for your response. Note that Snappy was also initially developed on 32-bit (it's much older than the open-source release), so the situation is more or less the same here.

    By the way, are the two directly interchangable? As I read the LZ4 source, the decompressor is not safe from corrupted/malicious input; the comments say that it is “immune to buffer overflows” since it doesn't write anything out-of-bounds, but if it can _read_ out-of-bounds, corrupted input can still crash the program, or leak potentially sensitive parts of memory. I guess this is something one should consider when making apples-to-apples comparisons. I haven't looked into this in detail, though, it just struck me as a bit odd.

    It shouldn't be hard to add such checks to LZ4, though -- Snappy has them, and they do not seem to cost a lot of performance, since the branches are easy for the CPU to predict and you only need one check per copy.

    Thanks!

    /* Steinar */

    ReplyDelete
  8. Good point.

    Indeed, this notice is mentioned in the source since the early days of LZ4, and it typically does not raise any eyebrow.
    I guess it may be because most real-world applications use LZ4 in a closed environment, e.g. with trusted sources producing LZ4 streams. Malicious input then are non-existent.

    Nonetheless, this is a good occasion to tackle this issue. It is now corrected in release 47.

    Best Regards

    ReplyDelete
  9. Hi Steinar,

    The test result was posted by me, I am pretty sure it is 64bit test results. After I notice this blog and read you comments, I think it may be compiler or environment problem which leads to snappy's bad performance.

    Here is my test environment:
    MacBook Pro Mac OS X 10.6.8
    Processor Name: Intel Core i5
    Processor Speed: 2.3 GHz
    Number of Processors: 1
    Total Number of Cores: 2
    L2 Cache (per Core): 256 KB
    L3 Cache: 3 MB
    Memory: 4 GB
    gcc version 4.2.1 (Apple Inc. build 5659)

    After noticing that there are lots of assert() in snappy source code. If I use default build option:
    ./configure
    sudo make install
    here are the test results:
    12/01/15 14:31:43 INFO Block size: 64K
    12/01/15 14:31:53 INFO Compress: 219M/s Decompress: 397M/s( 880M/s) ratio: 45.2% - Total

    But if I use:
    ./configure CPPFLAGS=-DNDEBUG
    sudo make install
    result:
    12/01/15 14:30:44 INFO Compress: 363M/s Decompress: 401M/s( 887M/s) ratio: 45.2% - Total
    This is very close to lz4 on compression speed

    I don't know whether -DNDEBUG is in default CPPFLAGS option in other environments, but on my computer it is not, and this leads to inaccurate results. Perhaps there are other compile options I need to take care of, any suggestions?

    Many people just use ./configure && make install and don't add NDEBUG,
    perhaps it should be set in Makefile by default?

    I post some detail test results along with source code on https://github.com/decster/jnicompressions

    ReplyDelete
  10. Hi Chang

    As a small side-comment, if you are using 64-bit configuration, please note that latest LZ4 version has improved 64-bit performance, and r50 is a recommended upgrade.

    Best Regards

    ReplyDelete
  11. Hi, Yann
    It seems LZ4 from r47 all crushed using my test case, so I just use r46. It seems that the bug is in LZ4_uncompress, it will write some extra bytes beyond dest+osize, because the extra bytes is in another page, it will cause segment fault.
    I add extra 8 bytes to dest buffer, and the test can pass.
    Maybe LZ4_WILDCOPY is not safe at buffer bouderies, just guess.

    ReplyDelete
  12. Thanks for the notification Chang. I'll send you a candidate correction.

    ReplyDelete
  13. Just a little addition to this post: Hadoop and related M/R job can now leverage even more the power of LZ4.
    By making use of https://github.com/carlomedas/4mc it is now actually possible to use and tune LZ4 compression at any stage of your jobs, being able to also decide the level of compression as a natural tradeoff VS needed speed.

    ReplyDelete
  14. Great tutorial mentioned on this blog about the subject, and it was like attending hadoop online training. Thanks for the step by step instructions.

    ReplyDelete