Extracting without ZipFile.Read

Nov 16, 2008 at 2:16 PM
In my app in which I utilize the DotNet Zip Library I usually handle very large files (500Mb<) and therefore when extracting them using my application the process takes a very long time.
The one part that could be omitted is actually the ZipFile.Read (I need that since I need also to read not only create Zip files). This method actually reads the whole file first and then you can do an extract on it.
I know (from code reading) that a direct pipe (zip -> extract) is not possible without using the ZipFile.Read method, but that would be definitely a nice new feature.

What do you think? I have already a few ideas, like having a new parameter to the extract method which gives the path to the zip file. Maybe then reading and extracting could happen on the fly. It should not be done as a static method as (for my case at least) then extraction progress events will not be available.

Cheers,
Moz
Coordinator
Nov 16, 2008 at 3:20 PM
Edited Nov 16, 2008 at 3:33 PM
I'm interested in understanding your particular issue.
The ZipFile.Read() methods all use the internal ReadIntoInstance() method that behaves as you described - it scans through the entire zip file to obtain metadata about the archive, including the names of the enttries, how each one was compressed (or not), comments on those entries, encryption on the entries, and so on.

If you would like to extract only ONE of the entries of a large set, then I can see potentially the benefit in implementing what you describe. 
But if you are extracting a bunch of files, then I don't know if there are great possibilities for optimization.

If the size of the entry is listed in the entry header, then the Read() operation just does a Seek() on the filestream, which is fast.  If the size of the entry is not included in the header, then the Read() operation needs to do, effectively, a search of the bytes looking for the end of the entry.  Whether the size of the entry is in the entry header depends on how the zip archive was constructed.   The search is what can take a long time, especially with larger files.  It may be unavoidable, even in the case where you want to extract a single entry from an archive.

An alternative may be to seek to the end of the archive, read the central directory structure, and thereby learn the list of entries and the compressed sizes for each.  This would avoid the search operation I described, and may improve performance on large files.  Before I do this I would want to understand your scenario, how you measure, and whether this would even help.
Nov 17, 2008 at 10:00 AM
Well basically in my particular case I don't really care what is in the archive. I just need to directly extract the archive to a specific location. After extraction the process should just exit gracefully.
So for my case I don't need to know the structure or any other information of the archive itself.
Coordinator
Nov 17, 2008 at 4:57 PM
So... is your objection to the observed performance?
or the observed design?
or something else.

I think you want to do ExtractAll(), but you looked at the code and you don't think the ctor or the Read() method should read the file all the way through.

You have a valid point.  A reasonable approach would be to read only the central directory structure at the end of the zip archive, rather than reading through the entire zip file to find each zip entry.  I am not sure of the performance difference it would make, though.

Tell ya what - I will conduct some tests and see how it goes.

Coordinator
Nov 17, 2008 at 5:11 PM
This discussion has been copied to a work item. Click here to go to the work item and continue the discussion.