Cannot Read and Save to the same stream

Jan 19, 2010 at 8:00 PM

I really need to be able to operate on a stream without rewriting it.  

In the test case below if the ZipFile is Save'd to another stream it works...

  But when operating on large ZipEntries this is not optimal because I would need to move/copy the stream to the original after Save'ing.

If I only operate on files all the way through the test, it works.

The following test case produces a corrupt zip file (Error in file #1: bad Zip file offset (Error local header signature not found): disk #1 offset 11959) when trying to extract the large file using WinZip.

 

 

        public void ZipPackageStreamTest2()
        {
            const string filename = @"c:\ZipPackageStreamTest2.zip";
            File.Delete(filename);
            using (var zip = new ZipFile(filename))
            {
                zip.Name = filename;
                var ms = new MemoryStream();
                ms.SetLength(10240001);
                zip.AddEntry("fake_large.txt", ms);
                zip.AddEntry("two.txt", "small entry: 2");
                zip.Save();
            }

            var inputStream = new FileStream(filename, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite);
            using (ZipFile zip = ZipFile.Read(inputStream))
            {
                zip.AddEntry("three.txt", "small entry: 3");
                zip.Save(inputStream);
            }
        }

 

 

        public void ZipPackageStreamTest2()
        {
            const string filename = @"c:\ZipPackageStreamTest2.zip";
            File.Delete(filename);
            using (var zip = new ZipFile(filename))
            {
                zip.Name = filename;
                var ms = new MemoryStream();
                ms.SetLength(10240001);
                zip.AddEntry("fake_large.txt", ms);
                zip.AddEntry("two.txt", "small entry: 2");
                zip.Save();
            }
            var inputStream = new FileStream(filename, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite);
            using (ZipFile zip = ZipFile.Read(inputStream))
            {
                //zip.ReadStreamIsOurs = true;
                //zip.Name = filename;
                zip.AddEntry("three.txt", "small entry: 3");
                // zip.RemoveEntry("two.txt");
                //inputStream.Position = 0;
                zip.Save(inputStream);
                //zip.Save();
            }
        }

 

 

Coordinator
Jan 20, 2010 at 2:24 AM

I'm not sure what problem you're having.

If you want to update a zipfile, you can just instantiate with the filename, and call Save().  This updates the original zipfile.  I take it this is not acceptable to you, for some reason, but I don't know why.

Maybe you could explain that. 

 

 

 

Coordinator
Jan 20, 2010 at 2:42 AM
This discussion has been copied to a work item. Click here to go to the work item and continue the discussion.
Coordinator
Jan 20, 2010 at 2:44 AM

I took a workitem on your issue - rather than creating a corrupted zip file, the Save() should throw an exception when you pass in the same stream as the input stream.

 

Jan 20, 2010 at 3:22 PM

I want to update a zipfile that is operating on streams.  You are correct that if I instantiate with the filename then Save() it updates the zip.

This is not acceptable because I am implementing a plugin that only passes streams around (it's a provider for System.IO.Packaging).  

The workaround of writing to another stream and then copying it back to the original stream means that on large files (>4GB) the performance goes from a few minutes to over 20 minutes.

It seems logical that a Seekable stream should be able to handle this case (another zip library works with this case).

How does the library update the original zipfile when operating on a filename?  I'll be taking a look at the source code...

Coordinator
Jan 20, 2010 at 4:22 PM

DotNetZip cannot correctly save an update of the zip file to the same stream the zip file is being read from.  This is because, an update may involve removal of entries, or modification of metadata on entries (for example the compression level, or the encryption), in which case, the entry data will change.  Suppose you have the case where the input zip file has 10 entries, and the metadata + data for each of them is 1000 bytes, for a total of 10,000 bytes.  Now suppose the application modifies the compression level setting to be "None"  on the 3rd entry.  

Now, when a Save() is called, for all entries that are unchanged, DotNetZip simply copies the metadata+data through without modification.  therefore entries 1 and 2 are copied through, from the input stream to the output stream, with no analysis or extra elaboration.  The 3rd entry is no longer 1000 bytes; because of lower compression, it is now 1200 bytes.  Entry 4 is copied through unchanged, as with entry #1 and #2.  The position in the input stream where entry 4 can be found is 3000.  But if DotNetZip has saved Entry 3 into the input stream, then the new Entry 3 will have over-written the beginning of the original entry 4, by 200 bytes. 

This illustrates why, during update, the input stream and the output stream cannot be the same.   Other similar problems can result when adding or removing entries during an update, or when modifying other metadata, including the encryption used, the entry name, the timestamp information, the entry comment, and so on.  Adding entries is not the same as "appending" entries.  

When saving to a filename, the library saves to a new file, using the same approach as described above: copying through unchanged entries, and re-constituting changed entries.  When the save is completed, DotNetZip deletes the old file, and renames the new file to the appropriate name.

As for using streams - your code can certainly use streams, but because the update of a zip file does not in general, result in a stream of the same size, you cannot use the input stream as the output stream.  Your case is odd, because according to your code, there is no change in the metadata for the existing entries, you are only adding one entry.  In general it won't work but in that special case, it might work.  I'd have to look into it further.

-----

Even so, I would not recommend depending on that behavior.  I'm pretty sure there is a way to use DotNetZip effectively to accomplish what you want, if you drop your insistence to use a single stream for both input and output.

> The workaround of writing to another stream and then copying it back to the original stream means that on large files (>4GB) the performance goes from a few minutes to over 20 minutes.

I can see that copying 4gb around would be inefficient.  I don't understand why you would need to copy streams. 

Jan 20, 2010 at 5:07 PM
Cheeso wrote: DotNetZip cannot correctly save an update of the zip file to the same stream the zip file is being read from.  This is because, an update may involve removal of entries, or modification of metadata on entries (for example the compression level, or the encryption), in which case, the entry data will change.  Suppose you have the case where the input zip file has 10 entries, and the metadata + data for each of them is 1000 bytes, for a total of 10,000 bytes.  Now suppose the application modifies the compression level setting to be "None"  on the 3rd entry.  

Yes, I do understand this case and when doing the initial write of the zip file the large files are put at the beginning of the file so that if other files are added or deleted the whole file does not need to be rewritten.  I really want to use DotNetZip because of the wonderful compression performance and that you do such a good job supporting it!

Cheeso wrote: When saving to a filename, the library saves to a new file, using the same approach as described above: copying through unchanged entries, and re-constituting changed entries.  When the save is completed, DotNetZip deletes the old file, and renames the new file to the appropriate name.

Yes, when I was stepping through the code I noticed that. Therefore, even using filenames would not be sufficient because the library rewrites the whole file.  I'm not sure yet if you copy the compressed entry from the old file to the new file (which would save on recompressing).

Cheeso wrote:As for using streams - your code can certainly use streams, but because the update of a zip file does not in general, result in a stream of the same size, you cannot use the input stream as the output stream.  Your case is odd, because according to your code, there is no change in the metadata for the existing entries, you are only adding one entry.  In general it won't work but in that special case, it might work.  I'd have to look into it further.

Thanks!  That sounds promising.  I guess that means figuring out why the file is corrupt...

-----

Even so, I would not recommend depending on that behavior.  I'm pretty sure there is a way to use DotNetZip effectively to accomplish what you want, if you drop your insistence to use a single stream for both input and output.

> The workaround of writing to another stream and then copying it back to the original stream means that on large files (>4GB) the performance goes from a few minutes to over 20 minutes.

I can see that copying 4gb around would be inefficient.  I don't understand why you would need to copy streams. 

I am given a stream which is the input stream and then, I create a temporary stream for the output.  Now in order to return to the caller, I need to have modified the original stream (the zip library input stream).  So I need to copy the temporary stream to the input stream.  I avoided that copy by reusing the SafeFileHandle.

I suppose in order to be perfectly safe on having single stream for input and output, a bunch of accounting would need to be done i.e. has the compression changed? has an entry been removed?  and all the other bits that invalidate a file.  Then any entries after that would need to be stored aside and re-processed.

It seems to me that the zip entry for the large file could stay in place and then the entries after that written and then the [central directory] could get written after that.

------

I see that the second save that I am doing wraps the FileStream up with a CountingStream at the position that it's at when it gets saved...  then it starts writing out entries.   The zip entries that haven't changed would need to be skipped on re-writing.


Coordinator
Jan 20, 2010 at 6:05 PM
Edited Jan 20, 2010 at 6:08 PM
linuxbox wrote:

I'm not sure yet if you copy the compressed entry from the old file to the new file (which would save on recompressing).

Yes, the old entry is simply copied over. There is no decompress/recompress cycle unless it is necessary - for example if you set the CompressionLevel on a compressed entry to None, then it will decompress and then restream the decompressed bytes.

Thanks!  That sounds promising.  I guess that means figuring out why the file is corrupt...

Yes.

I am given a stream which is the input stream and then, I create a temporary stream for the output.  Now in order to return to the caller, I need to have modified the original stream (the zip library input stream).  So I need to copy the temporary stream to the input stream.  I avoided that copy by reusing the SafeFileHandle.

Well that's a strange interface. You are given a single stream and it is up to you to modify the contents of the stream? Typically using the filter pattern, there is an input stream and an output stream. It's odd to need to do filter-type actions with a single stream. In general it will not work.  If the interface you are required to use mandates this, it seems like a drawback to the interface.

Can you tell me more about this interface? where does it come from? Is it your own code? Or is it a public interface on a library generated by Microsoft or some other third party?  The stream you are reading from and writing to - is it a MemoryStream?   I'd like to know more about it.  If the interface that provides a single stream is in your own code, then I think the correct solution, rather than modifying DotNetZip to comply with your requirements, is to modify your requirements.  You can properly implement a Filter pattern on that interface, with an input and an output stream. This would eliminate the trouble entirely.

I don't know System.IO.Packaging, and I don't know what a provider is. I did a quick search on those two terms, which turned up nothing obvious. Exactly how does a provider plug into System.IO.Packaging?  Can you refer me to a document?

I suppose in order to be perfectly safe on having single stream for input and output, a bunch of accounting would need to be done i.e. has the compression changed? has an entry been removed?  and all the other bits that invalidate a file.  Then any entries after that would need to be stored aside and re-processed.

I think you are misunderstanding something. As I explained, if you write over the entries in the original stream, they are no longer available. Suppose I modify the library to do the accounting you suggest. Suppose entries 1,2 & 3 are unchanged. They get skipped in the optimal case. Now suppose entry 4 has changed - the compression or encryption has changed. Entry 5 has not changed.  According to your idea, Entry 5 will just be copied from its original location in the stream, to the new location, which is also the original location of entry 4.  This means, as you said, the library must "store aside" all the data for entry 4. Where shall the data be stored? What if entry 4 is 1gb in size? The library cannot and must not try to allocate 1gb of memory. Storing it to a filesystem file is equally problematic. 

The use of an input and an output stream solves this problem. The application provides the input and output stream, and the library need not "store aside" anything.

It seems to me that the zip entry for the large file could stay in place and then the entries after that written and then the [central directory] could get written after that.

The case you want to cover is a special one. You want to append a single entry and not modify any other entries. I don't know if that is a generally interesting scenario. Remember, this is already possible using an input and a distinct output stream. Your special requirement is, you want to append an entry and you want to do it using a single stream. I don't think that is generally useful.

 

Coordinator
Jan 20, 2010 at 6:10 PM

Maybe I am misunderstanding and what you are doing is implementing System.IO.Packaging for Mono?  

 

Jan 20, 2010 at 7:35 PM

I am reimplementing System.IO.Packaging.ZipPackage because the Microsoft solution has issues with multithreading.  They use Isolated Storage which can halt if more than 1 thread uses it at the same time.  Each thread is using its own zip package but when it overflows the maximum memory that should be used for a MemoryStream it stores it in Isolated Storage.

I don't have time right now to look into the problem with my test case....  I have to get this working...  I'm onto getting SharpZipLib (#ZipLib) working (with 64-bit).

 

Coordinator
Jan 21, 2010 at 8:16 AM
Edited Jan 21, 2010 at 8:19 AM

Ah, ok.

Well I'm still thinking about the Append capability.  I've opened a new workitem to track it.

http://dotnetzip.codeplex.com/WorkItem/View.aspx?WorkItemId=10034