BZip2 zipping to network stream VERY slow

Aug 10, 2011 at 2:50 AM

I am liking the new BZip2 capability, it has reduced a lot of my files by up to 1Gb, however, I have ran into a problem where if Im zipping to a NetworkStream, its going much slower than if I were to use the deflate method. When I zip to a NetworkStream(which in my case is to an FTP server) with BZip2, it transfers at around 50KB/s, as apposed to around 150KB/s not using BZip2. Also if I zip BZip2 directly to a FileStream it goes normal speed.

For my application Ive created a "MultiWriteStream" which basically takes the same data and writes it to a FileStream and a NetworkStream simultaniously. The way I have it configured is to continue writing to the FileStream if the NetworkStream fails at any point. So if I manually "fail" the NetworkStream, the application is performing a seek on the stream even though it is not a seekable stream. 1) Is this seeking normal behavior? 2) Could the Non-seekability of the stream be causing the delay?

Also I noticed when the delay is happening it seems to be writing to the stream, it does not fire any events therefor my application becomes unresponsive until it starts "reading" again from the source.

Sorry if I seem to have a lack of understanding for the zipping process, this whole situation is a little confusing.

Thanks for any help

Coordinator
Aug 10, 2011 at 1:04 PM

Hi Brian...

The BZip2 algorithm is known to be slower than the deflate algorithm, independent of this particular pair of implementations.  But I haven't measured or quantified the difference, so I don't know whether your numbers make sense or not. It would be good for me to measure.

I'd love to look at the stacktrace for when it seeks. It's possible that the seek is causing the problem because BZip2OutputStream.Seek() will throw an exception, and that takes time.  I'd like to see who or what is seeking.  It could be the MultiWriteStream() is seeking only after you fail the bzip2outputstream.  It could be the Seek() is a symptom of a failure detection within MultiWriteStream and is completely unrelated to BZip2OutputStream or ZipFile.  So, I think it's best to check for Seek() calls in the simplest possible case, when the app is using BZip2OutputStream alone without any decorators.  

The ZipFile class *does* seek in its output by design.  It is designed to check CanSeek on the stream it writes to, and it calls Seek() if possible to set metadata, twice for every ZipEntry (seek back and seek forward again).  In the ZIP spec, there is metadata which normally needs to appear before the compressed data in the output, but is knowable only after the compressed data is produced. The compressed data size, for example.  So in normal operation, ZipFile writes out the header leaving zeroes in the "compressed data size" field, compresses (counting the compressed data) and then seeks backward to write over the zeroes with the actual data. There are 3 fields like this in each entry header. When saving to seekable streams, ZipFile will Seek back. 

This will not result in a Seek() on the BZip2OutputStream, though.  Here's why: the zipfile maintains a stream for the output target you specify - a FileStream if you save to a filename, or the actual stream if you save to a NetworkStream or MemoryStream or some other kind of stream.  A zipfile consists of metadata and actual compressed data.  The ZipFile class writes the metadata directly to the output stream, and then wraps a BZip2OutputStream or DeflateStream around the output stream temporarily, in order to write the compressed data.  Only when the compressed data has been written completely, does ZipFile (maybe) call Seek on the original output stream, as I described above.  So ZipFile should never call Seek on a  BZip2OutputStream.

NetworkStream is non-seekable, right?  If ZipFile is calling Seek() on a non-seekable output stream, that's a problem that could lead to perf issues.  The intended goal of the design is to NOT do this, so if you see a Seek, I'd like to examine that stacktrace.  Here again the best way to document/diagnose would be to simplify as much as possible - use no MultiWriteStream , instead just save to a non-seekable NetworkStream.  And if ZipFile calls Seek() on that, I need to check/fix that.

I said ZipFile seeks back in "normal operation", which suggests there is an alternative code path that does not seek back.  PKZIP extended the ZIP specification some time ago to allow streaming writers (like ZipFile) to not seek back, and to write the 3 items of metadata after the compressed data.  This is sometimes known as the "bit 3" feature. Easy, right?  You may wonder, if it's possible to produce a zipfile without seeking backward, why not do that all the time?  That was my original design but it turns out that zipfiles produced in this way, while compliant with the spec, are not readable by many zip tools. The MAC OS archive reader, for example, cannot read zipfiles with bit 3 set.  So ZipFile avoids that behavior unless the output stream is nonseekable.  

Also regarding events and delays - BZip2 works by breaking the input into chunks. The chunk size is settable via the BlockSize property ot blockSize parameter that is passable to some of the BZip2OutputStream() constructors. If you are using BZIP2 via ZipFile, then the ZipFile class uses the default BlockSize which is 900k.  BZip2 then performs 8 different steps on that block of data, some of which take a loooooong time.  So what you can see in use, as you write into the BZip2 output stream, is that it accepts calls to Write() as fast as a MemoryStream - because it is simply accumulating 900k of data. After 900k has been written, the Write() call on BZip2 will consume "a significant amount" of time, while the 8 steps are performed and the uncompressed block is transformed, and then written out to the captive stream. Memory buffers get cleared, and then finally the Write() succeeds. Because of this design of the BZip2 stream, if you write in 256k blocks, you may see:

Write(256k) -- 0.01s
Write(256k) -- 0.01s
Write(256k) -- 0.01s
Write(256k) -- 1.2.s
Write(256k) -- 0.01s

...etc.   As I said I have not closely measured it, so the times above are only for illustration.  That's not actual data.  The ZipFile events get thrown after the write succeeds, so this is why you may detect a stutter-step like effect with the events in your application. I'm not sure if that's what you're describing or not.  This approach is independent of the use of Seek() by the ZipFile.

Aug 10, 2011 at 3:38 PM

I tested it with a NetworkStream only and it never called the Seek. It is only calling the seek when I am using the MultiWriteStream and the NetworkStream fails. The CanSeek override always returns false so Im not sure why its trying to Seek at that point.

Coordinator
Aug 11, 2011 at 4:46 AM

So are you saying, it appears to be anomalous behavior in MultiWriteStream ?

Aug 11, 2011 at 2:23 PM

Yes it seems that is whats happening. As for the speed issue it seems to be related to the non-seekability, I created an override stream that writes to a FileStream but does not allow seek operations, and it was as slow as the NetworkStream. Im guessing there's no way to correct this behavior? since you said the Seek exception takes time.

Coordinator
Aug 11, 2011 at 2:51 PM

Well I don't know if there is a way to correct the behavior. It sounds to me like a problem in MultiWriteStream.  I suppose you could implement your own multi-write stream and implement your desired failure behavior.  If I were confronting this problem that's what I would consider.  The problem seems to be completely independent of DotNetZip.  Or am I misunderstanding?

 

Aug 11, 2011 at 3:44 PM

Well the MultiWriteStream is of my own implementation. I guess the question is, is there anyway to speed up bzip2 compression for a non-seekable stream?

Coordinator
Aug 11, 2011 at 3:53 PM

oh! I see.  So your stream is calling Seek() on a stream that reports CanSeek = false?  That seems like a bug in your code.

As for speeding up bzip for a nonseekable stream.... as I said before, I don't believe BZip2OutputStream calls Seek() on its output; and ZipFile does not call Seek() on BZip2OutputStream, ever.  ZipFile calls Seek() on its output stream, but only it CanSeek=true.  It should perform reasonably. 

Before when I talked about the performance cost of the Seek() , it was in relation to the Exception that gets thrown when Seek() is called on a non-seekable stream.  If a stream reports CanSeek=false, and then you call Seek() on the stream, you will get an exception, and THAT is what takes time.  I'm not sure if this is what you are seeing or not. But surely handling exceptions is slow, compared to just a condition statement. 

What I mean is that a try...catch like this:

try
{ 
   stream.Seek(0,SeekOrigin.Begin);
}
catch (NotSupportedException e)
{
  // seek is not supported.
  // take mitigating action
}

...is much, much slower than a conditional like this:

if (stream.CanSeek)
{ 
   stream.Seek(0,SeekOrigin.Begin);
}
else 
{
  // seek is not supported.
  // take mitigating action
}

I don't know if that is what is affecting you in your MultiWriteStream. 

BZip2 is slow.  It could be that you are experiencing that problem.  And there's no way I know of to speed it up.  You could try the ParallelBZip2OutputStream - that may be faster for you.  To use it set ParallelDeflateThreshold to a non-zero number.

 

Aug 11, 2011 at 7:45 PM

Isn't the default value for ParallelDeflateThreshold 512k though? Im a little confused about the documentation for this property, It says to set it to zero for "Always use parallel deflate" but you're saying set it to a non-zero value.

Coordinator
Aug 11, 2011 at 9:05 PM

Because you probably DONT want to always use parallel deflate.  Zero is sort of a dumb value to use there.  The size of the BZip2 work buffer is from 100k o 900k, so I'd guess you'd want to use a threshold that's some multiple of the work buffer size. If you use a workbuffer that is 900k, and your file is 512k, then you will never completely fill a work buffer.  After you've written all 512k into the BZip2OutputStream, the work buffer will still be partially full. When you call Close() on the BZip2OutputStream, it will begin to compress the partially filled buffer. That implies that there will be just one thread doing compression, ever, given those parameters.   But the goal of Parallel compression is deliver performance benefits by using multiple threads to compress multiple independent buffers in parallel.  So, you see it doesn't make sense to use Zero as a threshold value.  Parallel compression really makes sense only if you had files larger than the work buffer size.  (Actually it's even larger than that, but I won't get into why).  There is some overhead to using a parallel deflation approach - setting up the structures and buffers to prepare  for multi-threaded work.  If you use only one thread, but are set up for multiple threads, you're being inefficient.

But all of this is implementation specific - you don't need to know all this in order to observe and measure performance of your application. Try it and see.  try it with multiple different values of the threshold.  That way you can find the value that seems to give you the best performance.

 

Aug 11, 2011 at 9:47 PM

Ok thank you for the suggestions, I'll do some testing with the different values and let you know if I have any more issues with it.