RealTime Data Compression: Advanced Multithreading algorithm (Huff0 v0.9)

Not being entirely satisfied by latest results from Huff0 v0.8, i've been investigating the threading performance issue mentioned in the last post. It seems that the suspect was guilty : creating threads is a really costly operation.

This was not too sensible with LZ4, thanks to its naturally large chunk size, but this is not an all-around solution. Large chunks also means granularity is impacted. For example, if a test file is divided into 3 approximately equivalent chunks, on double core systems, one core will get twice more work to do, and will become the bottleneck of the system.

In order to solve such issue, jobs must be divided into smaller tasks, so that the potentially less efficient "tail job" remain small in comparison with the rest of the job, when all cores are working full steam.

In order to work on small thread tasks, the right thing to do : keep your threads alive, and feed them regularly.

There are obviously several ways to achieve this, including "do it yourself" ones. Right now, i've been experimenting 2 well documented ones for Windows : Thread Pools, and I/O completion port.

On first look, Thread Pools seems the obvious way to go. This windows API was meant for such task : just pile up jobs, and they will get dealt with as fast as system can complete them, dispatching them over available cores.
There is just a little problem, this API is relatively recent, and unavailable to Windows XP and prior systems. Quite a deal breaker.
Or is it ? Its little cousin, QueueUserWorkItem() was created in 2000, and does exactly the same.
And its usage is damn simple. The structure is the same as CreateThread(). Just use it as a kind of "fire and forget" job queue.
It's extremely efficient. Using this API, I was able to decrease chunk size by a factor 100 with no performance degradation. Quite an achievement.

There are just 2 little problems :

1) There is no API to check when all jobs are completed. Quite a dumb (or faulty?) design. This was apparently later corrected for other languages (C#, C++, even VB), but never for C. In order to know when all segments are compressed, it is necessary to create one's own method. And things are becoming quite nasty at this stage. Since this problem is pretty frequent, many methods can be found on searching Internet, several being questionably stable hackish-styled.
The most interesting one, in my opinion, is Interlocked counter, in conjunction with event wait. Well, now the whole system is a lot more complex than it was supposed to be.

2) You get no control on system resource. No way to tell for example "use just 2 cores please". This simple API uses whatever CPU resource is available.

OK, this starts to make a lot of issues. So i investigated the second method, I/O completion port.
It sounds scary, and ... it looks scary too. On using this old API's method (available since Windows NT4!), it's obvious it was never meant to become a general-purpose job scheduler. But that's one thing it does, and very well.
Initialization is significantly more complex than Thread Pools, but once this is correctly understood, you get all the advantage without the problems. Speed is excellent, number of concurrent thread is finely tuned, and job completion check is much more natural and secure.

So here it is. This new version of Huff0 (v0.9) is using I/O completion port to divide data into much smaller chunks, resulting in more stable and faster results. A little scanning improvement was also added, for the benefit of faster compression.

Detailed performance assessment :

		Huff0 v0.8	-t2		Huff0 v0.9	-t2
	Ratio	Compress	Decoding	Ratio	Compress	Decoding
Not compressible
enwik8.7z	1.000	1.42 GB/s	1.91 GB/s	1.000	1.50 GB/s	1.94 GB/s
Hardly compressible
win98-lz4hc-lit	1.024	850 MB/s	995 MB/s	1.024	905 MB/s	995 MB/s
audio1	1.070	540 MB/s	525 MB/s	1.070	560 MB/s	530 MB/s
Distributed
enwik8-lz4hc-lit	1.290	395 MB/s	375 MB/s	1.290	400 MB/s	380 MB/s
Lightly Ordered
enwik8-lz4hc-offh	1.131	375 MB/s	350 MB/s	1.131	380 MB/s	360 MB/s
Ordered
enwik8-lz4hc-ml	2.309	375 MB/s	380 MB/s	2.309	375 MB/s	385 MB/s
Squeezed
office-lz4hc-run	3.152	330 MB/s	310 MB/s	3.152	420 MB/s	400 MB/s
enwik8-lz4hc-run	4.959	475 MB/s	430 MB/s	4.959	480 MB/s	430 MB/s
Ultra compressible
loong	278	1.49 GB/s	3.22 GB/s	278	1.98 GB/s	3.50 GB/s

The interesting bit is here : this new threading method is generic, and can be applied to any other program. So this is good news for further improvements.

RealTime Data Compression

Saturday, March 12, 2011

Advanced Multithreading algorithm (Huff0 v0.9)

No comments:

Post a Comment