Streaming Output

Apr 25, 2010 at 7:32 AM

Question, I am attempting to zip large amounts of data automatically on a schedule, the resulting zip file would be too large to store in memory (GBs). Is it possible to stream the resulting zip file out to a stream object as it is generated, and not to store the file contents in memory and then save it all at once? So something like...

I'm new to this library, so please bare with me...

 using (ZipFile zip = new ZipFile())
                    {
                        zip.SetOutputStream(xxx);

                        zip.AddEntry(xxx,yyy);
                        zip.AddEntry(xxx,yyy);
                        zip.AddEntry(xxx,yyy);

                        zip.Flush();


                    }

 

Apr 26, 2010 at 12:25 PM

IIRC, the library supports deferred adding of data - if you add an entry with a stream or a WriteDelegate, the library reads the data of the individual items while saving, so memory usage should be moderate.

Coordinator
Apr 26, 2010 at 6:46 PM
Edited Apr 26, 2010 at 6:48 PM

Yes, you are correct hardcodet, there are options for deferring.

Two interfaces: ZipFile and ZipOutputStream

First, there are two main interfaces for creating zip fiels in DotNEtZip.  One is the ZipFile class, which you've seen.  The other is the ZipOutputStream class.  Let's take them in reverse.

ZipOutputStream

Using ZipOutputStream, you treat the zip as a writable stream.  Your code creates this stream, and then for each entry you want to appear in the zip file, specify the entry name, then write the data for that entry into the stream.  As the data is written it is zipped.  It is a full streaming model.  It is a forward-only write-stream.  The code for that looks like this:

using (var output= new ZipOutputStream(outputFileName))
{
    output.Password = "VerySecret!";
    output.Encryption = EncryptionAlgorithm.WinZipAes256;

    foreach (string inputFileName in filesToZip)
    {
        System.Console.WriteLine("file: {0}", inputFileName);

        output.PutNextEntry(inputFileName);
        using (var input = File.Open(inputFileName, FileMode.Open, FileAccess.Read,
                                     FileShare.Read | FileShare.Write ))
        {
            byte[] buffer= new byte[2048];
            int n;
            while ((n= input.Read(buffer,0,buffer.Length)) > 0)
            {
                output.Write(buffer,0,n);
            }
        }
    }
}

This is a handy model, and may be what you want.  It isn't satisfactory for all uses, though, because this model requires the use of bit 3, which is a part of the zip spec that somehow isn't supported on some platforms, or by some tools.  It's really not that exotic, so I don't know why, but in any case some 3rd party tools will choke when consuming zips produced in this way.   The other attribute of the ZipOutputStream is that it is an output stream only. there's no support for updating a zip file, or for random access, or simply reading a zip file. 

ZipFile

The alternative when producing a zipfile is to use the ZipFile class.  The main verbs you use with ZipFile are AddFile and AddEntry.  Contrary to your suggestion, in most cases using these verbs does not cause the entire contents of the entries to be stored in memory at any one time, ever.  If you are adding files via AddFile, at the time of the call to ZipFile.AddFile, DotNetZip stores in memory the metadata about the entry - the name, whether it will use encryption or not, where the data will come from when the zipFile is eventually saved, and so on.  Regardless of the size of the file you are adding, what's stored in memory is generally less than 256 bytes worth of data. At the time of ZipFile.Save, the source file is read and its data compressed and written to the zip, in a streaming manner. So, the entire contents of the entry is never held in memory. 

The only exception to that, is when calling AddEntry() with a byte array or a string defining the content of the entry to be written into the zipfile.  The overloads accepting these types of inputs are intended to support the insertion of entries in the zip file with dynamically-sourced content - for example a readme.txt file that contains a few lines of text.  You can do this with the AddEntry overload that accepts a string. Of course in this case the entire string is in memory at one time.

Stream stream = ObtainStreamFromSomewhere(); 
using (ZipFile zip = new ZipFile())
{
  // The content for this entry will be read from a filesystem file,
  // at the time of the call to Save(). 
  // The entire data from the file is never held in memory.
  zip.AddFile("C:\\whatever\\Name-of-Entry1.txt", "files"); 

  // The contents for this entry will be read from the provided stream,
  // at the time of the call to Save().
  // The entire contents of the stream is never held in memory.  
  zip.AddEntry("files\\Name-of-Entry2.bin", stream);

  // This content for this entry will be obtained from a string. 
  // Obviously, the string is held in memory.  
  zip.AddEntry("files\\Readme.txt","this is the content of the readme entry");

  // Save() will read data from each of the above sources and
  // write to the provided zip file, in a streaming fashion.
  zip.Save("c:\\archive.zip");
}

The WriteDelegate

In the prior email, hardcodet suggested an option, the WriteDelegate.  WriteDelegate is a different source that you can use with the ZipFile class. The use of the WriteDelegate may be interesting to you, but it is really orthogonal to the issue of whether the entry data is stored in memory at one time.  As I said above, allowing 2 exceptions, the content for entries is never stored in memory. 

So what does the WriteDelegate do?   It switches from a pull model to a push model.  What I mean is this:  When calling ZipFile.AddFile or ZipFile.AddEntry, your app provides a source for entry data to DotNetZip.  This source might be a filesystem file, a stream, a string, or a byte array.  When your app calls ZipFile.Save(), DotNetZip then retrieves the data for each entry, from the source you provided.  This is what I might call "Pull".  DotNetZip reads the source your app provided. 

There are some cases where the application does not have a source from which DotNetZip can "pull" content.  An example is a .NET DataSet.  There's a nice WriteXml() method on the DataSet class, which can write an xml representation of a dataset into a stream.  But there's no way to get a stream for the dataset, if you see what I mean.  The DataSet can write to a sink, but cannot act as a source of data.   The WriteDelegate solves that problem.  The way it works:  at the time of ZipFile.Save, for each entry that has a WriteDelegate as a source, DotNetZip will invoke your application code, and allow your code to write directly into the zip stream.  It's something like the model for ZipOutputStream, if that makes sense, but just for a single entry.  The code looks like this:

private void WriteEntry (String filename, Stream output)
{
    DataSet ds1 = ObtainDataSet();
    ds1.WriteXml(output);
}

private void Run()
{
    using (var zip = new ZipFile())
    {
        zip.AddEntry(zipEntryName, WriteEntry);
        zip.Save(zipFileName);
    }
}

All of this is described in various places in the fairly complete reference that is available at http://dotnetzip.codeplex.com/documentation .  

I've been meaning to write some programming guide material, to complement that reference.   The workitem for that is http://dotnetzip.codeplex.com/WorkItem/View.aspx?WorkItemId=9032  This response is the kind of information I would put in that programming-guide document page.