Modifying Archive vs Creating New Archive

Jan 11, 2011 at 1:40 PM

Hi,

first of all thank you for your zip library - it is really great.

But i have one question about performance: i need to add some files to existing archive. Is it better to create a new archive with new files or there is no difference with adding a files and saving an existing archive?

I ask this question because i saw some comments in one of ZIP libraries for PHP where author said that it is much better to create a new archive with new files instead of modifying any existing acrhives due to some performance problems of ZIP format.

 

Thanks in advance.

Coordinator
Jan 12, 2011 at 2:44 AM

Hi -

In general, it will be faster to read an existing zip file into a ZipFile instance , then add new entries or update existing entries, and then call .Save(),  as compared to the case where you create a new ZipFile with all new entries.

The DotNetZip ZipFile class is smart about which entries are updated, and which are unchanged.  For a ZipFile with 10001 entries, only one of which is new or updated, when the application calls Save(), DotNetZip will do direct byte copy from the old file to the new file for the unchanged 10000 entries, changing only the metadata as necessary, and in some cases even that is unnecessary. DotNetZip will then compress/encrypt the one entry that requires it.  Since compression is the most expensive operation in the chain, this can save a great deal of time.  

I'm not sure what the other commenter might have been referring to, but I think it is a misunderstanding on their part, if you've reported the gist of their comment accurately.  There's nothing in the zip spec that prevents an intelligent and efficient implementation of the Update use case in a zip library.  If there is an inefficiency, it is more likely a limitation of the particular PHP zip library under discussion, rather than of the spec itself, and the commenter didn't realize this.  

But there is a cost to the Update case.  You need to read an existing zipfile into memory, and DotNetZip takes time doing this.  It checks various things as it reads the file in.  This is a cost you don't pay when creating a zip file from all new entries.  There's also a "cost" in terms of maintaining additional code.

The interesting question becomes - where is the crossover?  At what point does the open-and-update case become faster than the just-create-a-zipfile case?  For 2 files in the zipfile?  For 20?  For 200?  I suspect it will depend on the files in question, their size and compressibility, as well as the performance characteristics of your IO subsystem. So you'll have to test it, to be sure.