Adding 1 entry to large pre-existing archive, Question

Jan 21, 2010 at 7:09 AM

When adding one file to a large archive within, let's say, WinZip, it looks like WinZip has to write the ENTIRE file again just to absorb that one new file.  In testing this out with DotNetZip, things look much much better, but I notice it is still a pretty intensive operation.  So I have a 100 MB archive with 10,000 files, within 13 folders.  Adding one file and then saving:

zipArchive.AddEntry("folder9/file77a.xml", "Some Data");
zipArchive.Save();

takes on my fairly fast system just short of 1 second (.9).  That is, again, far outperforming what WinZip apparently does, but of course, it would still be even better if it could be quicker, and namely, if big parts of the file/archive didn't have to be touched or re-written just for adding one more file.  So if anyone is up to letting me know anything about this, I'd really appreciate it!  Namely: in the code above, after Save is called, how much of the 99 MB file is rewritten just to accomodate one extra (almost zero length) file?  Or are there any tricks I'm missing?

Thanks!

 

Coordinator
Jan 21, 2010 at 7:24 AM

This came up on another thread as well. 

The current implementation copies through the existing data for all existing ZIP entries, and then adds the data for the newly-added entry at the end.  When you call Save(), the semantics are not "append".  If you have a 99mb zip file to start with, it copies ~99 mb of data to a new filesystem file, before appending the (new) last entry.  It does not decompress and recompress any of the existing zip entries.

The reason DotNetZip does not just "append" to the existing zipfile, is that the Update capability allows you to modify any of the existing entries in a zip, or in fact remove entries from a zip.  Therefore a full update capability means the library cannot be sure that none of the other entries has been changed. 

There's been a request, recently, to allow "Append" operations.  I'm thinking about how that might be possible.  So far I don't see a huge need for it, since Update is generally pretty fast. 

 

Coordinator
Jan 21, 2010 at 7:35 AM

Copernicus, what do you think about AppendFile/AppendEntry() as a new set of static methods on the ZipFile class?  There would be various overloads:

void ZipFile.AppendFile(String zipFileName, String nameOfFilesystemFileToAppend);
void ZipFile.AppendFile(Stream seekableStream, String nameOfFilesystemFileToAppend);
void ZipFile.AppendEntry(String zipFileName, WriteDelegate d);
void ZipFile.AppendEntry(Stream seekableStream, WriteDelegate d);
void ZipFile.AppendEntry(String zipFileName, String entryContent);
void ZipFile.AppendEntry(Stream seekableStream, String entryContent);
void ZipFile.AppendEntry(String zipFileName, OpenDelegate o, CloseDelegate c);
void ZipFile.AppendEntry(Stream seekableStream, OpenDelegate o, CloseDelegate c);

The function would be to append a single new entry to an existing zip file.  It would generally be high-performance, and economical because no redundant writes would be performed. 

Just an idea.

 

Jan 21, 2010 at 8:26 AM

It could be easier to first create a second zip with one or more files and then provide a method for appending the second zip to the first one. If entries with the same name would be detected, the entry from the second zip wuld be used. The previous content from the first zip would no more be referenced.

Coordinator
Jan 21, 2010 at 3:17 PM

There's already a workitem for merging zip files.  If I produced the merging capability, would this Append capability  be unnecessary ?

Also I wrote some more about Append - see http://dotnetzip.codeplex.com/WorkItem/View.aspx?WorkItemId=10034

 

Jan 21, 2010 at 4:04 PM

For me, it would be perfect, expecting that for entries with the same path in the two zip files, the entry from the new file will overwrite the other one.

Coordinator
Jan 21, 2010 at 4:23 PM

Gilles, what you want is not an Append semantic.  I don't think you want Append.   You want a merge. 

Jan 21, 2010 at 6:12 PM

Cheeso, It would absolutely rock to have a ZipFile.OpenForAppend capability.  It would take archiving to a whole new level by changing the archive from being static and immutable (like the string type in .NET) to being pliable and dynamic (like StringBuilder, to continue the analogy)!  I think that having some dynamic capabilities such as this could really expand what DotNetZip (or any file archive system) could be used for. Going the route that you discussed in the workitem (http://dotnetzip.codeplex.com/WorkItem/View.aspx?WorkItemId=10034), about having to be explicit about opening only for appending entries, and throwing Exceptions if illegal operations are called, really sounds like the way to go to me.

But perhaps we could expand this to one more important area.  Let's say an archive has a directory named customerSet4, with a file named customer920.xml.  Its of equal importance I believe to have the same dynamic capabilities for being able to make a change to that one file and save the results, without having to alter the entire zip-file/archive.  I'd imagine that, for just appending a file, there would still have to be a check to make sure that the directory that would contain it doesn't already have a file with that same name.  So, with allowing a dynamic change of a pre-existing file (ZipFile.OpenForAlter ? ZipFile.OpenToAlterEntry / Entries ?), is it possible that the actual operation wouldn't be that much different?  Namely, "the central directory" could have the "entry pointer" (?) (for the entry that is being altered) de-referenced or erased, so that the actual "file" in the archive would simply be lost because of being dereferenced (or it could simply be written over with zeros).  Then it would just be a matter of doing the same AppendFile operation.  I am very enthusiastic about these possibilities.

Copernicus

ps

[[I don't currently know much at all about how archives actually work, but I'm hoping to learn more on this; so you wrote Cheeso in the Append Work Item about fixing up "the central directory".  Is the central directory then the work horse that makes the whole archive possible?, recording I'd imagine a hierarchical list of file entry names and their file position pointers?  Is there a resource you could point me to Cheeso that covers the underlying basics of archives?  I feel that, if we think outside of the box, archives could be used for more things than they are currently being tapped into for, but making them more dynamic with regard to changes would be a big part of that]]

Jan 21, 2010 at 6:32 PM

I vote for the Append syntax where individual files could be append to an existing zip file.

 

Coordinator
Jan 21, 2010 at 6:34 PM

Copernicus, thanks for the feedback.

Today, it is possible to edit or alter a ZIP file.  What I am exploring in that workitem is a way to do so more efficiently. It is an optimization, really, a way to eliminate some IO, that is possible if and only if I restrict the kinds of alterations that can be made to the zipfile, to be Append only.

Today, if you want to edit or update a zip file, you can do so very simply.

using (var zip = ZipFile.Read("myzip.zip"))
{
  zip.RemoveEntry("directory/customer54.xml");
  zip.UpdateEntry("directory/customer55.xml");
  zip.Save();
}

This kind of thing works today, and it does not "alter the entire zip file". Today, when DotNetZip is used to update a zip file, for all unchanged entries in the zip, What it does is copy the bytes from the source to the destination. This works and is reliable, but it does incur IO cost.

That IO cost could be optimized away if I can enforce constraints on the kinds of updates. The idea is that if the changes are ONLY additive, then I could append the new entry data to the existing zip stream, without changing anything.

The updates you are talking about - updating a particular file or removving an entry and so on - would NOT be compatible with an Append operation. That sort of update involves making changes to the entries in the zipfile, and the size of the zipfile would change, and so on. As I said, that is already possible in today's library.

about the zip directory: each zip file is a series of records, one for each entry in the zip file, followed by a zip directory, that lists the offsets in the file for the zip entry records. An "Append" operation would not need to modify any of the existing records. It would need to wipe out the existing directory, append any new entry records there, and then re-generate a new directory. For a 100mb file, an Append might operate on only the last 1k of the file, which would be speedier.

Jan 22, 2010 at 12:33 AM

Codeplex just swallowed a reply I wrote with an error ... grrrr. So this a much less pretty reply bec. my fingers don't want to repeat it all (error #93aa3299-2a1c-4d0a-a60c-2551e9b40486).

I definitely don't want to presume anything here, it just may not be possible like you said.  But just to be clear, I was not saying that the actual "file" would be altered, it would remain completely untouched.  Instead, I am wondering if it would be possible to just erase the "directory" from within the central directory that points to that record (I think you're calling all the references to the "files" "directories" .. if that's correct).  The only harm the actual file would do then is that it would take up unreferenced space, but for most uses, that's not a problem.  It's kind of like with non-rewritable data CD's, in Nero I burn these and check the option to be able to write more data to the CD.  Even though the CD is not a rewritable CD, it still allows you to "replace" files or folders on a reburn.  How?  Well, you'll notice that the CD size just keeps getting bigger, so obviously, they're just dereferencing a file you're "replacing" and then APPENDING (back to our first question) a new, altered copy of that file.  Through time, you could periodically clean up the zip file with a full resave that could perhaps loose all those dereferenced files.  Maybe this still isn't possible, I just wanted to make sure I wasn't being misunderstood (i.e. as if the actual file in the archive could literally be written over - which is clearly completely impossible).  For many uses I can think of, this could be a very robust possibility.  I don't want to muddy the idea of doing appends, that would be a powerful option in its own right.  

Thanks!!

 

 

Coordinator
Jan 22, 2010 at 3:13 AM

ok, I see what you're saying.

It's certanly possible to do as you say - to leave dead space in the zip file.  I'll have to think about whether that would be a good thing to do.  I don't see, so far, the primacy of the reqmt to not update the zip file.  I mean this isn't a write-once world, like a CDROM.  Considering that, I'm not sure why I would want a library that leaves what would be arbitrary amounts of dead space in the zip file.  I think ti would be possible, but I don't think most people need this kind of capability.

 

Jan 22, 2010 at 8:19 AM

If you already track unchanged entries, the optimization of I/O would just be to not move entries at start of the file until you encounter a new or updated one. It would not create empty space, and in case of only append operations, it would give the expected result.

Sort the entries by position in the files, putting the new ones at the end of the sort, e.g. sort by file offset, using long.MaxValue for new entries. Then skip entries until you find a new or updated one, and then process them all.

Jan 22, 2010 at 4:12 PM

First, thanks again for entertaining these ideas Cheeso.

I think this functionality could be very useful to a number of people and implementations, while I admit that it is true that for many common uses of an archive, this more dynamic, far - less IO intensive option is not needed (as you show in your post 4 above).  For me though, I have a number of uses that would benefit very much from not having to recopy the bulk of a Gigabyte large+ zip file onto the disk. 

We're considering two things, ... the first and foremost is the ability to append a file; I added to that discussion the additional possibility of changing records too.  I'm glad that you're already at least somewhat considering that the append functionality could be a nice feature to add.  I definitely vote up on that one!  But while considering that, it seems to me that the additional functionality of being able to "change" a file in a IO friendly manner as I've suggested (really: to abandon an old file and append it as a new one) would perhaps be no more than one or two more steps to the append file operation.  I think you're indicating that the ZipFile.OpenForAppend capability would already necessitate rewriting the central directory.  And when appending a file, it would already be necessary to make sure the appended file doesn't already exist.  If it did exist, the default would be to throw an exception: file already exist.  But maybe even the same function could have an overload: ZipFile.OpenForAppend(..., bool replaceIfExists) [for instance].  It would simply be a matter then of scratching that record pointer (or whatever it's called) from the central directory, while of course including the new reference to the newly appended file.

Thanks

ps gillesmichard, it sounds to me very difficult and complicated what your suggesting, and often wouldn't be beneficial either ("until you encounter a new or updated one" ... ).