CalculateSize(ZipOutputStream, CalculationType)

Jul 8, 2009 at 11:52 PM
Edited Jul 9, 2009 at 12:02 AM

I think this feature request deserves a discussion of its own, although we discussed this already on the streaming discussion.

 

///<summary>
/// Calculates requested sizes
///<summary>
///<param name="zipOutput">Zip stream to calculate on</param>
///<param name="calculationType">Choice of calculation type</param>
///<returns></returns>
///<remarks>
/// See callculation types for calculation behavior
///</remarks>
public long CalculateSize(ZipOutputStream zipOutput, CalculationType calculationType)...



///<summary>
/// Enumeration of calculation types
///</summary>
enum CalculationType
{
///<summary>
///Calculate output size for current NON DEFLATED files in stream
///</summary>
///<remarks>
/// Ignores any deflated files. Gives size as will be, when saved.
///</remarks>
OutputSize_NonDeflatedFiles,


///<summary>
///Calculate self extractor size for current NON DEFLATED files in stream
///</summary>
///<remarks>
/// Ignores any deflated files. Gives size as will be, when saved.
///</remarks>
SelfExtractorSize_NonDeflatedFiles,

///<summary>
///Calculate estimate of output size for current files in stream
///</summary>
///<remarks>
/// Gives estimate for size as will be, when saved.
///</remarks>
OutputSize_Estimate,

///<summary>
///Calculate the size of extracted files
///</summary>
///<remarks>
/// Gives size as will be, when saved, and then extracted
///</remarks>
ExtractedSize
}

 

<abbr></abbr>

<input id="ctl00_ctl00_MasterContent_Content_PostRepeater_ctl02_PostId" name="ctl00$ctl00$MasterContent$Content$PostRepeater$ctl02$PostId" type="hidden" value="210208" />

If you don't have time, or do not wish to develop it, could you give me some pointers as to how you would do the first calculation type?
I'm willing to develop it, and send it to you, if you decide you do want it.
Or, to pay for the development by you, even though its open source. (I need it in the very near future).

Thanks, Moshe (pashute g-mail)

Jul 8, 2009 at 11:57 PM
Edited Jul 9, 2009 at 12:05 AM

Thanks(advance)

Coordinator
Jul 9, 2009 at 1:18 AM
Edited Jul 9, 2009 at 1:19 AM

Yeah,. worthy of discussion.
(Did you write this code, or is this from an existing source, like maybe SharpZipLib?)

First, what is a ZipOutputStream?  Is that a finished zip file?   If so I think you could build what you want mostly pretty easily.  If it is not a finished zip file, then I don't know how to calculate those things without actually doing the compressing and/or uncompressing.  As you know, that can be expensive, even without actually creating the files.  There's lots of bit manipulation involved and checksums, so for a large file (let's say 1gb) it will take some time - on the order of 1-5 munutes.  More for larger archives.    

Assuming you have an intact zip file,...

  • ExtractedSize: This is the aggregate size of all entries in the zip. Calculating the extracted size is easy: you just enumerate through the ZipEntry items and tally up the UncompressedSize property on each one. You'll need an int64 quantity of course.
  • OutputSize_NonDeflatedFiles:  I don't know exactly what you intend here, but it sounds to me like the size of all entries in the archive that are NOT compressed or deflated.  If they are not compressed or deflated, then the compressed size is the same as the uncompressed size.  You can use the same approach as with ExtractedSize, but only adding up entries that have CompressionMethod = 0.
  • SelfExtractorSize_NonDeflatedFiles: - same as above. 
  • OutputSize_Estimate: I think this is just the answer to the question "how big is the zip file" and this is easy.  Just new a FileInfo() for the file and get the length.

Supposing you DON'T have an intact zip file, you cannot calculate these quantities.  Here's why:

  • ExtractedSize: Just for review, I think this is intended to be the aggregate size of all entries in the zip archive, I guess, right?   DotNetZip allows you to create ZipEntry items from stream input.  The library does not know the size/length of the uncompressed bytestream for such an entry, until it reads it.  And according to the model DotNetZip uses, it does not read the input stream until it is writing the output stream - in other words saving the zip archive.  It learns of the "Uncompressed Size" for each entry at the same time it learns of the "Compressed Size" of each entry:  after it has completely read the stream and compressed the bytes.  At that point the library has the final total for bytes read in (Uncompressed) and bytes written out (Compressed).  Does that make sense?  These quantities - CompressedSize and UncompressedSize - are attached to the ZipEntry instance and also stored in the zip archive in the metadata for the entry in the zip file. Without a zip file, I don't know how to produce this number. 
  • OutputSize_NonDeflatedFiles: Here you have the same problem as that described above, - the numbers are not known until the streams are read. That is enough to make it impossible to calculate, but there is a second issue: whether or not a given stream is deflated is not decided until the file is saved. DotNetZip is adaptive, and will use no compression on some streams that don't compress well. Also, DotNetZip can defer to the application - let the app decide at Save time, whether to use compression or not. This also represents an obstacle.
  • SelfExtractorSize_NonDeflatedFiles: - same as above.  There's no way to know the size of the stream without reading it.
  • OutputSize_Estimate: The library doesn't have any good way to estimate compressibility of a particular byte stream.  Here again, DotNetZip (and the application that uses it) learns of the compression factor when the compression is done.  It might be possible to use a heuristic to estimate the compressed size. For example, you could compress the first 10% of a stream or entry, and then guess that the compression factor recorded for the first 10% will be the same as the compression factor for the full stream.  I don't know how valid this assumption would be, but it might be a good first take.  The problem here is that input streams in general are read-only - they are read once.  Imagine an ftp stream or an HTTP stream - you cannot back up.  You can only read it.  So, reading and compressing the first 10% of a stream will render the stream useless when the actual zip file is saved. 

You've asked about this several times in different ways, and I keep answering pretty much the same way: I don't know a good way to get these numbers, without doing the compression. You *could* do it if you constrain the problem significantly. Like, for example, if you suppose that all input streams are seekable. And, you could also keep a historical record of "average compressed sizes" , so you could make some good guesses about compressibility based on the size of the file, and say, the file extension. These constraints would really distort DotNetZip for its core usage scenarios though.  Just to document how to use it, would take me days , and if it takes me that long to write down an explanation, who's going to read it?   And how much additional complexity gets rolled into the library in order to accomodate this size estimate thing?   For all these reasons it is outside of what I would like to add to DotNetZip.

Maybe a more fruitful approach is to just do the compression, and you will know the numbers.  No estimates.  Given enough CPU and IO bandwidth, you can have the answer as fast as you like, regardless of the size of the file. You can compress to a bitbucket (Stream.Null), so you don't need to allocate disk storage to know the numbers.    It would still take time, but you would eliminate a bunch of IO. For small archives (hundreds of files, totalling 50mb or less) it would be 2-3 seconds. 

Last thing: any work I choose to do on DotNetZip has an opportunity cost. If I spend time on one thing, that is time I cannot spend on something else.  One of the things I would like to do is build a parallel compressor, which will use multiple threads (and hence multiple CPUs) to compress entries in a zipfile. This has the potential to speed up zipping by 2x. And it would be usable by *everyone* immediately, and transparently.  There would be no additional complexity in the API - The same exact code would just run faster.  And because virtually every PC comes with multiple cores, the benefit would be available to everyone.   Comparing the cost and benefit of the parallel compressor to building a size estimator?  In my mind, it's not even close.  I can see you think it would be nice to have an estimate, but given the cost and complexity required to build what I think you want (with predictive algorithms, pattern matching, AI, historical averages and so on), and also the relatively low reliability of the resulting numbers, I don't see how it would ever be justified ahead of the parallel compressor work.

As a sort of side benefit, if I do get the parallel deflate thing to work, you will be able to produce actual numbers, twice as fast. 

I'm happy to discuss this more, but based on my understanding of what you want, I don't know how to build what you want, simply.

 

 

Jul 9, 2009 at 2:29 PM

>Yeah,. worthy of discussion.
>(Did you write this code, or is this from an existing source, like maybe SharpZipLib?)
Yes. I'm not at my work PC, this is pseudo code, which I wrote from memory (so the sharpZipLib came to my mind by mistake). 
Meant the Zip object - ZipInfo? Or whatever you write in the Using(... = new...)

>First, what is a ZipOutputStream?  Is that a finished zip file?  If so I think you could build what you want mostly pretty easily. 
Nope, not after saved.
>If it is not a finished zip file, then I don't know how to calculate those things without actually doing the compressing and/or uncompressing. 
>As you know, that can be expensive, even without actually creating the files. 
>There's lots of bit manipulation involved and checksums, so for a large file (let's say 1gb) it will take some time - on the order of 1-5 munutes. 
>More for larger archives.   
Exactly, thats why I want it for NON DEFLATED, and for deflated I only want a Guestimate

>Assuming you have an intact zip file,...

  • >ExtractedSize: This is the aggregate size of all entries in the zip. Calculating the extracted size is easy: you just enumerate through the ZipEntry items and tally up the UncompressedSize property on each one. You'll need an int64 quantity of course

Exactly!  Enumerate through the ZipEntry and tally up the UncompressedSize prop. Right a "long".

  • >OutputSize_NonDeflatedFiles:  I don't know exactly what you intend here, but it sounds to me like the size of all entries in the archive that are NOT compressed or deflated.  If they are not compressed or deflated, then the compressed size is the same as the uncompressed size.  You can use the same approach as with ExtractedSize, but only adding up entries that have CompressionMethod = 0.

No! I need to know the output zipfile size before extracted, so I need to know the extra size of the zipfile header and footer, and also the size of each ZipEntry's header.
Maybe add a parameter  Calculate...(... out string problem)
Two behaviors are possible: Count all files (regardless of CompressionMethod), and the user can then use the value for statistics,
Or maybe throw an exception if there are entries that are deflated (compressed)
For streams, either ignore (with warning?) or throw exception.

In any case This is the one that I really need.
And I can show you in various fora that many programmers have requested this, for streaming uncompressed files, which are aggragated to one download.

  • >SelfExtractorSize_NonDeflatedFiles: - same as above.  

Same as above, OutputSize_NonCompressed + Self Extractor stub size.

  • >OutputSize_Estimate: I think this is just the answer to the question "how big is the zip file" and this is easy.  Just new a FileInfo() for the file and get the length. 

No, as above, we didn't call the zip.Save(...) so can only get an estimate.  See further:


Supposing you DON'T have an intact zip file, you cannot calculate these quantities.  Here's why:

  • >ExtractedSize: Just for review, I think this is intended to be the aggregate size of all entries in the zip archive, I guess, right?  
    DotNetZip allows you to create ZipEntry items from stream input. 
    The library does not know the size/length of the uncompressed bytestream for such an entry, until it reads it. 
    And according to the model DotNetZip uses, it does not read the input stream until it is writing the output stream - in other words saving the zip archive.  It learns of the "Uncompressed Size" for each entry at the same time it learns of the "Compressed Size" of each entry:  after it has completely read the stream and compressed the bytes.  At that point the library has the final total for bytes read in (Uncompressed) and bytes written out (Compressed).  Does that make sense?  These quantities - CompressedSize and UncompressedSize - are attached to the ZipEntry instance and also stored in the zip archive in the metadata for the entry in the zip file. Without a zip file, I don't know how to produce this number.

So as I said, throw an UnsupportedException("Cannot calculate undetermined size of input stream: " + entry.Name)   if there are streams
Or ignore and warn in the out problem param: "Calculation of size ignored undetermined size of input stream: " + entry.Name  

  • OutputSize_NonDeflatedFiles: Here you have the same problem as that described above, - the numbers are not known until the streams are read. That is enough to make it impossible to calculate, but there is a second issue: whether or not a given stream is deflated is not decided until the file is saved. DotNetZip is adaptive, and will use no compression on some streams that don't compress well. Also, DotNetZip can defer to the application - let the app decide at Save time, whether to use compression or not. This also represents an obstacle.

As explained, ignore streams or throw exception if there are stream entries.
And about the deflation or not: What needs to be given here is ExtractedSize + zip header/footer + entry headers.
If what you are saying is that also files may be compressed, even if compression method was set to zero, in that case, I would say: throw an exception if there is no ForceNoCompression

  • SelfExtractorSize_NonDeflatedFiles: - same as above.  There's no way to know the size of the stream without reading it.

Same as above, + stub size.

  • OutputSize_Estimate: The library doesn't have any good way to estimate compressibility of a particular byte stream.  Here again, DotNetZip (and the application that uses it) learns of the compression factor when the compression is done.  It might be possible to use a heuristic to estimate the compressed size. For example, you could compress the first 10% of a stream or entry, and then guess that the compression factor recorded for the first 10% will be the same as the compression factor for the full stream.  I don't know how valid this assumption would be, but it might be a good first take.  The problem here is that input streams in general are read-only - they are read once.  Imagine an ftp stream or an HTTP stream - you cannot back up.  You can only read it.  So, reading and compressing the first 10% of a stream will render the stream useless when the actual zip file is saved. 

See my next remark.

>You've asked about this several times in different ways, and I keep answering pretty much the same way:
>I don't know a good way to get these numbers, without doing the compression.
>You *could* do it if you constrain the problem significantly. Like, for example, if you suppose that all input streams are seekable. And, you could also keep a historical record of "average compressed sizes" , so you could make some good guesses about compressibility based on the size of the file, and say, the file extension.
>These constraints would really distort DotNetZip for its core usage scenarios though. 

>Just to document how to use it, would take me days , and if it takes me that long to write down an explanation, who's going to read it?   And how much additional complexity gets rolled into the library in order to accomodate this size estimate thing?   For all these reasons it is outside of what I would like to add to DotNetZip.

>Maybe a more fruitful approach is to just do the compression, and you will know the numbers.  No estimates.  Given enough CPU and IO bandwidth, you can have the answer as fast as you like, regardless of the size of the file. You can compress to a bitbucket (Stream.Null), so you don't need to allocate disk storage to know the numbers.    It would still take time, but you would eliminate a bunch of IO. For small archives (hundreds of files, totalling 50mb or less) it would be 2-3 seconds. 

Thank you for the detailed answer! Yes, I am talking of a heuristic. Lets leave this for now, I'll do a little research, perhaps there may be (and I suppose there are) various discussions on this. A heuristic would be important for choosing automatically between compression methods dynamically...  But as I wrote above, this is not really important. 
The idea of doing a fast compression is interesting. In that case I would not call it Estimate but OutputSize_CompressedFiles and warn the programmer in the documentation that this does a real compression, so that it may take a few seconds. Also if the input size includes streams, it should fail with NotSupported or ignore and warn.


>Last thing: any work I choose to do on DotNetZip has an opportunity cost. If I spend time on one thing, that is time I cannot spend on something else. 
>One of the things I would like to do is build a parallel compressor, which will use multiple threads (and hence multiple CPUs) to compress entries in a zipfile.
>This has the potential to speed up zipping by 2x. And it would be usable by *everyone* immediately, and transparently. 
>There would be no additional complexity in the API - The same exact code would just run faster.  And because virtually every PC comes with multiple cores, the benefit would be available to everyone.  
>Comparing the cost and benefit of the parallel compressor to building a size estimator?  In my mind, it's not even close. 

>I can see you think it would be nice to have an estimate, but given the cost and complexity required to build what I think you want (with predictive algorithms, pattern matching, AI, historical averages and so on), and also the relatively low reliability of the resulting numbers, I don't see how it would ever be justified ahead of the parallel compressor work.

I don't need the estimate, only the size of the output with the headers. If you have any idea of how to get the headers size, a pointer would help me, and I could do it.
So what I really need is one method:
CalculateZipSizeForNoCompressionFilesOnly( ZipInfo zipInfo )...

>As a sort of side benefit, if I do get the parallel deflate thing to work, you will be able to produce actual numbers, twice as fast.
OK OK, We'll donate a nice sum.

>I'm happy to discuss this more, but based on my understanding of what you want, I don't know how to build what you want, simply.
I hope now I have been clearer. Originally you had to send three times your explanations till I "got" them, I hope I have been more attentive now.

Moshe

Jul 13, 2009 at 9:52 AM

Or in other words:

Where do I get the size of the zip header and footer, as well as each file's zip header size, before creating the zipfile,
for a ZipFile not saved yet, with no streams in it, and set with ForceNoCompression?

Meanwhile I'm trying your solution, to create an actual zip to a bit bucket null-stream. I'll tell you how it goes.

Thank you for all your fast replies and detailed answers,
and good luck with the DotParallelZip,


Moshe

 

Coordinator
Jul 13, 2009 at 5:41 PM

You can look in the AppNote.txt document for the ZIP specification, which describes the structure of a zip file in detail. 

There is no "zip header".  There's no header for a zip file.    Typically a zip file begins with the first Zip entry.

There is a header for the zip entry.  For a vanilla ZIP entry, the header is 30 bytes long, plus the filename length, plus the comment length. If you add in "extra" fields, as defined in the zip spec, the header will be longer.  The NTFS file times, for example, will add 32 bytes.  ZIP64 adds an extra field, of variable length. WinZip AES encryption adds an extra field of 9 bytes, if memory serves me correctly.  Therefore if you'd like a general solution, then you need to consider the sizes of all of those extra fields, which are considered to be part of the entry header. 

The size of the entry footer varies depending on the way you stream out the data.  There's an optional part of between 16 and 28 bytes. 

There is a central directory, whose size varies depending on the number of entries in the zip file, and the type of each.  Following that there is a end-of-central-directory record which also can vary in size.

 

 

Jul 14, 2009 at 10:51 AM

Back to the bitbucket idea:
If I want to see the BytesWritten of the ZipFile - after the Save(Stream.Null) was called - (so we have a valid CountingStream)
is there any existing way other than me adding a ZipFile.BytesWritten property?


Would using you example of:

 else if (e.EventType == ZipProgressEventType.Saving_EntryBytesRead)

do any good. I think the event is only sent for input streams, right?

I must say: This is BEAUTIFUL code. Clear and clean!
Moshe

Coordinator
Jul 14, 2009 at 12:31 PM

Well I guess you could make your own CountingStream and wrap it around Stream.Null . 

  using (var zip = new ZipFile())
  {
    zip.AddFiles(...);
    var x = new MosheCountingStream(Stream.Null);
    zip.Save(x);
  }

I didn't make CountingStream public because I didn't think it would be very useful for anyone else and didn't want to document and support it.