OutOfMemoryException on extract

Jun 11, 2009 at 12:26 PM

Hi,

I read the thread 28768 discussion (Out Of Memory with Large set of large files), which seemed to indicate that this had been fixed, but ts doesn't seem so for me...

With v1.7, extracting a 9485922 byte file (about 1MB compressed) from within a zip containing multiple other files, to stream (MemoryStream), on a Win2K3 server with 2GB or RAM, I still got this error.  This is well below any old-school 2GB per entry zip limit.

I looked at the 1.7.2.21 source, it looks like my Extract to stream call uses
private Int32 _ExtractOne(System.IO.Stream output)
which always extracts the whole zip entry to stream before returning.  If so, then this is always going to be a potential memory problem isn't it?

However, there is a better way (more memory efficient) / workaround, which I missed to start with, and may benefit others encountering this error. 

Basically, calling the exposed ZipEntry.Extract(stream) is always a bad / memory inefficient way of doing things, whether the stream be memory or file, (and potentially slower too, if you're just after some data near the start of the zip entry sub-file).  Don't know if this is documented anywhere else - I didn't come across it.

Instead, use something like the below, which decompresses the zip entry in chunks (you could of course use m_DatStrm.Read(byte[],...)), avoiding using up a swedge of memory (IDisposable/using left out for clarity):

 

<font size="2">

m_zip =

</font>

new ZipFile(sZipName);
m_zip.Password=
null; // The default anyway?
m_zip.Encryption=
EncryptionAlgorithm.None;   // The default anyway?

m_ZipEntry = m_zip["MySubFile.txt"];
if (m_ZipEntry==null) throw new FileNotFoundException("MySubFile.txt");
m_ZipStrm=m_ZipEntry.OpenReader();   // Access to the underlying ArchiveStream, includes seek to the desired entry)
m_DatStrm=
new StreamReader(m_ZipStrm);

 

 

string sLine;
while ((sLine=m_DatStrm.ReadLine())!=null)
   DoSomething(sLine);

Cheers,
Alan

 

 

 

Coordinator
Jun 11, 2009 at 11:39 PM
Edited Jun 11, 2009 at 11:44 PM

When you Extract a ZipEntry to a stream, ....I know this is going to sound obvious.... yes, it will extract the zip entry to the stream.  This is what you asked it to do, and that is just what you get.  The problem is, your stream is a MemoryStream, which (a) retains all of its contents in memory, and (b) re-allocates itself and copies large chunks of bytes as it grows.  I don't know the default starting size of a memory stream, but it's probably less than 1mb.  As the extraction of the entry proceeds, the internal buffer in the MemoryStream will fill, and the MemoryStream will allocate a new contiguous buffer, and then copy the existing data into that new buffer, and then continue accepting writes.  Again, I don't know the implementation, but let's suppose it doubles, each time it needs to resize.  So 1mb, then 2mb, then ... 32mb, then 64mb, then.... 512mb, 1gb, 2gb, and so on.  When this expansion occurs, the MemoryStream needs to have 2 buffers available.  Eg, when the 512mb buffer is too small, it first allocates a 1gb buffer before copying, and then de-allocating the 512mb buffer.   If your zip entry (~1mb compressed) expands significantly, then it is not surprising at all that you get an out-of-memory condition.

I disagree with your statement that calling the exposed ZipEntry.Extract(stream) is always bad or inefficient.  If you pass in a FileStream, the ZipEntry is extracted to the FileStream, which writes the extracted bytes to a disk file.  This is not "bad" or memory inefficient at all.   I think your understanding is off. 

When you say "the method extracts the entire entry to the stream before returning" - that is correct.  But "extract to the stream" does not imply "store in memory".  It means, call stream.Write() with the extracted data.  If the stream implementation keeps all the data in memory, as a MemoryStream will, then the data will accumulate in memory.  If the stream writes the data to a file, as a FileStream does, then the bytes get written to disk, and there is no accumulation of data in memory.   Regardless whether the data went to memory, or to a File, or to Aunt Martha's house, the Extract() call returns when it completes.  As advertised.

What do you really want to do? If you want to read the data from a de-compressed entry in chunks, why not use OpenReader() and you can pull data from the stream at your own rate?  It's your option. 

 

Jun 15, 2009 at 6:12 PM

Hi Cheeso,

 

My intent was simply to provide a page for people searching such as "Extract Stream OutOfMemoryException DotNetZip Ionic.Zip" to find, and give them a solution...

 

If I have issue with the library/documentation (minor), then it is:

1) There are Extract methods specifically for unzipping to file.  So as expected, ZipEntry.Extract(stream) is documented as "For example, the caller could specify Console.Out, or a MemoryStream.", but as we agree, passing a MemoryStream here is likely to be a bad idea, as you will unzip the entire ZipEntry to memory before you can touch any of it, which is inherently dangerous (OutOfMemory likely exasperated by fragmentation as you said). So at best, the documentation is misleading.

2) All Extract methods are prefixed with "Extract", yet if you want to extract to memory, you should actually use OpenReader(), as we both suggest.  I'd rather see something like an ExtractReader(), (ie, "Extract" prefixed for auto-complete intuitiveness).

 

By the way #1: I would have thought that most uses of a .NET zip library, when extracting, would be rather unlikely to want to unzip to anything but memory - if you want to unzip to file, it's probably more efficient to spawn/exec a native compiled zlib/pkzip/unzip exe or it's ilk :-)  Personally I use my own C++ wrapper of zlib in my C++ apps that access zip archives, and now your (I assume you're Dino Chiesa?) rather nice C# wrapper for my user-interface C# apps.  As a matter of curiosity, how do the timing metrics compare for large zips, C# c/w zlib native?  All that managed memory, immutable strings, unicode, array bounds checking and system.text.encoding's got to hurt surely?  Curious though.

 

By the way #2: As a total aside, the multiple ZipEntry.ExtractWithPassword() overloads seem a little odd, when a password belongs to (is a property of) an entire ZipFile, rather than of an individual ZipEntry? There aren't any OpenReader "WithPassword" suffixes after all ;-)

 

Thanks anyway, one way or another, hopefully there won't be many others spending a few hours fighting an OutOfMemory extract error :-)

 

Alan

Coordinator
Jun 15, 2009 at 7:54 PM
Edited Jun 16, 2009 at 1:27 PM

Hey Alan, thanks for the feedback.

On your item 1 - in some cases it is very much ok to pass a MemoryStream when extracting. For example, suppose the entry is an image, and the application in question wants to just display the image. There's no need to extract it to a file, or anything durable. In fact the MemoryStream is the best destination in this case.   It is highly unlikely that extracting a small  .jpg image into a MemoryStream would result in an OutOfMemory error.  There are other similar mainstream scenarios where MemoryStream makes sense.  I'd say, you have to be careful when extracting to a MemoryStream, but no more so that in dealing with any other situation where large blocks of data may be implicitly allocated.

In other words, this pitfall is not specific to using DotNetZip or ZIP files with MemoryStream. It is common to any use of the MemoryStream class, or any case where you allocate large blocks of memory. Ya gotta take care.  A similar situation exists with the File.ReadAllLines() method in the BCL. If you call it on a 3gb text file, your app will try to allocate enough strings to store all 3gb. In many many cases though, ReadAllLines is just what the Doctor ordered.

So I would disagree with your statement that extracting into a MemoryStream is inherently dangerous or otherwise almost always wrong. It is not dangerous if you take the proper care.   I think the problem you saw came from maybe not calculating how much memory the MemoryStream might use in your case.

On your item 2 - yes, you are correct that all methods that perform extraction are prefixed with "Extract".  I don't agree that the OpenReader() method should be renamed to also share  that prefix.  The OpenReader() method does not in fact extract an entry.  It opens a stream that can be used to extract an entry.  Renaming that method so that it begins with Extract.... would lead to confusion, I believe.  I agree that the OpenReader name is not optimal.  It does not open a reader, it opens a stream.  A better name might be "OpenExtractionStream()" or something like that.  I could take a workitem on that but I wouldn't consider it high-priority.  This is one of those cases where, the fix is not necessarily better than the problem.  Changing the name will break all the old code, so I try to make name changes only rarely, and with good justification.  OpenReader() isn't the best name, but it's not a disaster either.  So for now I'm going to keep it.

re: BTW#1, Most users of the library, when they extract, extract to files - I'd say 80%.  The benefit of DotNetZip is that it is a programmable .NET library. Resorting to spawning/execing programs means you give up flexibility and programmability in a .NET language.  WinZip is faster - can be 2x as fast but it depends on the situation - but DotNetZip is more flexible for a .NET developer.  I haven't benchmarked zlib vs DotNetZip but would expect a similar delta.

Edited, 2009 jun 16 - 824am re: BTW#2, yes, I agree that the handling of passwords in the interface is has been a little uneven.  ExtractWithPassword() isn't strictly necessary, because of the Password property on the ZipFile class.  But it's a convenience that helps when different entries in the ZipFile are encrypted with different passwords. I think it might be reasonable to provide an OpenReader() override that takes a password, but no one has asked for one yet. I had previously added an OpenReader() overload that accepts a password. You will find it in v1.8.3.25.

Thanks again for the input.