Fast & efficient way to store ZipEntry objects in database

May 21, 2010 at 7:17 PM

I allow people to upload zip files to my web site. Those zip files are stored in a database (SQL Server varbinary(max)) by the web server. What I need to do is pull the zip file out of the database, extract all of the entires, then put the entries back into a different database table (one row per entry).

The thing to keep in mind is that the files may be quite large (2GB).

Currently, I "chunk" the database value to disk then use DotNetZip to extract each entry to disk. Then I re-zip each individual entry (on disk) and then "chunk" the new zip file (with only one entry) back into the new database table. I am assuming that when ZipFile does it's work on disk it makes good use of memory (i.e. it does not load the entire file into memory at once).

However, I really don't need/want all this stuff on disk. I would prefer to stream the zip file out of the database into a ZipFile object (but not all at once, I want to buffer it) and then stream each ZipEntry back into the new table (no need to uncompress it then recompress it like I am doing now).

If I can't buffer the stream that populates the ZipFile object then I might be able to live with that (holding the entire ZipFile in memory). But I would really, really like to be able to stream each ZipEntry back into the new database table without uncompressing it (e.g. using Extract(stream)). Is there a way to get at the uncompressed stream on the ZipEntry object?

Sorry if this doesn't make sense, I'm not intimately familiar with zip archives and/or DotNetZip. I have a solution that is working, it just doesn't seem very efficient to write everything to disk and to uncompress then re-compress each entry.

May 21, 2010 at 7:20 PM

Sorry, I meant to ask "Is there a way to get at the compressed stream on the ZipEntry object?"

Coordinator
May 21, 2010 at 8:39 PM
Edited May 21, 2010 at 8:41 PM

One way to do what you want might be, to open the stream on the original database entry, then delete every ZipEntry except the one you want, then save the resulting ZipFile to the new DB table.   You would have to open the large aggregate ZipFile N times if there are N entries. 

Doing that, DotNetZip will write just the compressed bytes for just the one entry, to the new save location.  There's would be no decompression and recompression.

The way DotNetZip works, it doesn't read all the bytes in a zip file when you open it.  It *does* scan to the end of the zipfile, to get the directory (metadata).   So, suppose you have an aggregate zip file that is 2gb.  If you try to read it, DNZ will scan to the end, read the directory (maybe this is 3k), then hold the offsets for all the ZipEntries.  If you remove a ZipEntry from the ZipFile, DotNetZip removes that item from those that will be written upon next save.  Suppsoe you do that for all entries except one.  Then call ZipFile.Save(), specifying a new location. DotNetZip will do a foreach in all remaining entries (just one), scan to the location of the compressed stream or that entry in the original zip file, read the raw (compressed) data, and write the raw data to the new save location.  Strictly a byte-for-byte copy.  The metadata at the end of the zipfile will be constructed anew, for the new zip file, but for a single entry this is something like 30-80 bytes, depending on the length of the filename and other options.  So this should be very fast, assuming the Seek() time on a 2gb stream in the database, is fast.  If you have 50 items in the zip file, you'll need to open it and seek to the end 50 times.

That should work.

If the Seek performance in the database stream is slow, you might optimize by copying the original aggregate zip file to the filesystem, once, then opening it 50 times in the filesystem .   The filesystem will deliver good seek performance.   That way you would avoid unzipping each entry, as in your current approach.

----

People have asked for a ZipEntry import/export capability, where you can read a ZipFile, then "export" one or more entries into a different zipfile.   http://dotnetzip.codeplex.com/WorkItem/View.aspx?WorkItemId=7896   That would fit your needs directly, and it would require that you open the original ZipFile only once.  I haven't implemented that, though.   yet?

 

 

 

May 22, 2010 at 6:23 PM

Thanks for the suggestion Cheeso!

I combined your suggestion with a "VarbinaryStream" class I found here (http://www.eggheadcafe.com/software/aspnet/29988841/how-to-readwrite-chunked.aspx) and it works great. I don't have to write anything to disk. I just use the stream methods of DotNetZip and all reads/writes are done directly to the database. There is no disk I/O (other than on the database server obviously) and memory usage never goes above 20MB even for very large files. Nice!

Coordinator
May 23, 2010 at 1:48 AM

Ahh, excellent, glad it worked for you.