Zip File length

Mar 3, 2011 at 8:23 PM

I'm streaming a zip file from a web application.

The data is not compressed and I would like to set the content length so that the user gets some feedback in the download box.

I do think I had this working at one point, but it's either stopped working or I'm going mad.

In a simple test case I seem to be about 88 bytes short.  For each file entry I calculate

Length += blob.Properties.Length + 30 + filename.Length + 46 + filename.Length + 22;

When I inspect the binary of the created file there seems to be an extra 0x24 bytes as an 'ExtraField' in the local directory structure and I suspect there is the (12 byte) Zip Data Descriptor at the end of each file.

I think my 22 should be applied to the whole file.

But I can't make the numbers add up.  So.  I'm streaming a zip file with no compression and I want to work out hte content length exactly before it leaves.  I know the lengths of the streams and the names.

Oh - if I remove the content length header, it works again!  But I'd still like to know...

 

Iain

Coordinator
Mar 9, 2011 at 2:06 PM

I'm not clear on the problem.  Can you provide some code?  I don't know what you mean by "I'm streaming a zip file" and "the data is not compressed."  What data? 

If you are trying to calculate the size of a zipfile externally, independently of the library, I think that is a bad idea.  You may get it right in one case, but not another.  It is a brittle approach and it will break if the implementation of the library changes. I think you should avoid that.

If you want to specify the content length in a reliable manner, I suggest this approach: create the zipfile, save it into a temporary directory on the server, then stream the zip file to the client.  The content-length will be correctly supplied by ASP.NET.  Of course you will need to remove the temporary file after the download completes.

If you are doing full streaming (without creating a temp file, and without ASP.NET output buffering), then by definition the length of the zipfile is not knowable until the stream has been completely written.  (Any attempt to make assumptions about the size of a zipfile generated this way are liable to be wrong).  In that case ASPNET will use Transfer-Encoding: Chunked, and the client won't know the total size of the download until it has been completed.

On the other hand if you are creating a simple zip file with no compression, then maybe you don't need the Ionic library; you can just create it yourself, manually twiddling the bits, and you will know exactly how large each entry and each zip file will be.   

 

Mar 9, 2011 at 2:29 PM

Thanks for your reply.

I'm writing a zip file out to the Response.OutputStream in an ASP.Net application.  I'm using ZIP purely as a container, with the files uncompressed.  I know the length of each file segment in zip file, what I'm not clear about is the length of each directory entry.

Basically, I open a stream for each file (they are Azure Blobs as it happens), add zip entries and then you do the magic when I write it out.  Nothing terribly clever.

I've used the definition of zip in wikipedia, but it still doesn't add up to what gets downloaded.  One complication is that the first directory entry seems to have an 'extra data field', though I can't find out what sort it is.

I've adopted this approach to minimise the resource usage with streaming many potentially large zip files out of a small server and don't particularly want to stage and so on.

I take your point on the last suggestion;  I guess I picked Ionic because I don't want to spend the time doing it myself - lazy programmer! 

I suppose I think it would be a good idea if you could get this information from Ionic at specification time.  I also think it would be good if I could turn off whatever extra information is being put the directory entries.

Iain

Coordinator
Mar 9, 2011 at 4:10 PM

Hi Iain,

By default the Ionic Zip library does add an Extra Data Field into each zip entry, to store higher resolution timestamps.  In the original zip spec, times had a 2-second resolution.  An extension to the zip spec allows higher-res timestamps to be encoded in an extra field.  For each entry, there is a timestamp for the created time, the modified time, and the accessed time, and each one is 8 bytes.  So that is 24 bytes plus a header, which I cannot recall the size of at this moment.  It is maybe 28 bytes total for that particular extra data field, plus some additional framing data for the extra data field in general.  (I can't recall if there is framing data for extra fields, but I think there is.  If you look in the comments of the source code for ZipEntry.Write.cs, it will be clearer.).

You can disable the generation of this extra data field by setting the EmitTimesInWindowsFormat property on the ZipEntry to false, before saving the ZipFile. The other way around it is to just increase your size total by 28 bytes (or whatever it is) to handle the hi-resolution timestamps data.

However, you may still run into the bit-3 issue.  I think this is what you referred to as the 12-byte Data Descriptor that follows the uncompressed entry data.  Here's the thing: the original zip format put the sizes of the data (compress and uncompressed) into the Zip Entry header; the compressed data followed the header.  When doing compression, one won't know the size of the resulting compressed blob, until the compression is done.  So, when streaming a zip file, it was not possible to know the compressed size, which needed to be placed into the zip entry header, at the time the zip entry header had to be emitted.  DotNetZip normally solves this by seeking backward in the output stream and inserting the correct value.  But ASP.NET's Response.OutputStream is not seekable, so that is not possible.  A modification made by PKWare to the zip spec in 1996-ish addresses this issue. If a zip entry has bit 3 set, in the Zip entry data field, the the compressed and uncompressed sizes in the zip entry header are ignored, and these data are written after the compressed data (optionally prefaced by a 4-byte signature).  Though it dates to 1996-ish, but many tools still don't correctly read zip files formatted this way. For example, the default zip reader on MacOS.  If you don't care about this compatibility issue, then you should just add 12 bytes for the timestamp and the 2 signatures to your computed zipentry size.

I say that you may still run into this, because when compression is OFF for a zip entry, then of course the compressed size of an entry is the same as the uncompressed size for the entry, and both will be known as soon as all the uncompressed data is read in.  DotNetZip uses a streaming input and output design internally; for each entry it (optionally) compresses and writes output to the output stream, as it reads the uncompressed data from the input stream.  The upshot is the library does not know the size of the uncompressed data until the writing (and optional compression) is done.  When the input is a regular filesystem file, though, there is an opportunity to inquire the filesize before the reading begins.  This would allow DotNetZip to avoid the bit-3 encoding in the case of a zipentry that uses a regular filesystem file as input, and NO Compression is used.   In general in the DotNetZip implementation, I tried to NOT use bit-3 encoding because of the compatibility issue I described above.  Unfortunately, I don't recall if I implemented this particular tweak.  It should be easy to spelunk a zipfile generated this way, to find out.

 

Mar 9, 2011 at 4:44 PM

Hi, Cheeso.

Many thanks for this.  It's very helpful.  I'm gonna have to go away and look at this though since it's a bit involved!  One thing I HAVE noticed is that the zip files I'm making with Ionic are NOT recognised by the default MacOS reader.  From what you say the only fix to this is to insert the actual length in the ZipEntry header.  Given that I can determine the length of the file (blob - I'm not sure the stream representation includes length) by external means and the file is not compressed is there some way I could explicitly set the 'compressed' length manually to at least make it Mac compatible?

Cheers

Iain

Coordinator
Mar 10, 2011 at 12:34 AM

Hi, Iain,

yes, your observation about the MacOs reader agree with other reports I've gotten from other users.

The way around that is to produce zip files that do not use the bit-3 encoding. In DotNetZip the way to do this is to NOT write to a non-seeakable stream. If you're producing the zip in an ASPNET app, then writing to Response.OutputStream will produce a zip with bit-3 encoding, which will not be readable on MacOS. If you write to a filesystem file, you will not get bit-3 encoding, and the result will be readable by the MacOS reader. Using this approach you'd Save() the zip to a temporary filesystem file, then read the file and stream it directly to the client, using the System.IO.FileStream class.

After writing the contents of the filesystem file to Response.OutputStream, you'd then optionally delete the teporary zip file. If your app is smart, you may find it possible to re-use these zip files, which would avoid creating the zipfile for subsequent requests for download. But your app would have to keep track of the cached zip files, and the contents of same.

What you have suggested above is to doctor the zip data to insert the actual compressed and uncompressed lengths into the ZipEntry headers. This is certainly possible, but it would seem to defeat the main reason to use a full streaming approach, which is the simplicity of handling. If you're gonna have to complicate things, you may as well NOT custom-modify the zip file; just use the zip library and the System.IO.FileStream methods to write to the download.

A better way, if you want to avoid creation of a filesystem file, might be to just save the zipfile to a MemoryStream. This would obviously only work if the zipentry content is small, where "small" is relative to your particular server and the amount of memory you have.  A MemoryStream is seekable, and so you'd not get the bit-3 encoding, which means you'd create a zipfile that was readable by the MacOs reader. After saving to the MemoryStream, you'd write the contents of the MemoryStream to Response.OutputStream, and bob's your uncle.

In summary, you have options. I think manually modifying the zip data is the least attractive option.