extracting large files (>4GB) from zip file

Jul 22, 2009 at 12:27 PM
Edited Jul 22, 2009 at 12:29 PM

I have some large zip files which contain files that are over 4GB, in the current case 16GB, when extracted. I can extract the files no problem using Winzip 8.1 and WinRar 3.60, but when I try using DotNetZip I am getting the follwoing error once 4GB has been extracted:

CRC error: the file being extracted appears to be corrupted. Expected 0x33384163, Actual 0x7C445495

I am using a Script Task in SSIS and the code used is as follows:

    Private Sub Decompress()

        Using m_ZIP As ZipFile = ZipFile.Read(Me.ZIPFileName)
            Dim m_ZipEntry As ZipEntry
            For Each m_ZipEntry In m_ZIP
                Try
                    m_ZipEntry.Extract(Path.GetDirectoryName(Me.ZIPFileName), ExtractExistingFileAction.OverwriteSilently) 'True)
                Catch ex As Exception
                    MessageBox.Show(ex.Message)
                End Try
            Next
        End Using
    End Sub

 

        Using m_ZIP As ZipFile = ZipFile.Read(Me.ZIPFileName)

            Dim m_ZipEntry As ZipEntry

            For Each m_ZipEntry In m_ZIP

                Try

                    m_ZipEntry.Extract(Path.GetDirectoryName(Me.ZIPFileName), ExtractExistingFileAction.OverwriteSilently) 

                Catch ex As Exception

                    MessageBox.Show(ex.Message)

                End Try

            Next

        End Using

    End Sub

 

 

Could you let me know if there is something I am doing wrong or if it is possible to perform this extraction? The zip file is supplied by a 3rd party and having had other dealing with them I doubt I will be able to get information from them as to how it was created.

Thanks in advance

 

Coordinator
Jul 22, 2009 at 2:20 PM

Your code looks fine.

What version of DotNetZip are you using?  Earlier versions had some problems with very large ZIP files.

How do you know the problem happens with 4gb extracted - how can you tell?

 

Jul 22, 2009 at 9:57 PM

Thanks for the confirmation of the code.

The version I am using is the latest version downloaded from here last week, 1.8 (not sure of the exact release as it is on my work pc and I am currently at home).

I can see it arrives to 4GB (or thereabouts) as I have the folder being extracted to visible, and updating, during the extraction.

Coordinator
Jul 22, 2009 at 11:05 PM

Hmmmm, ok.  I will run some tests. 

Jul 23, 2009 at 6:40 AM

Cheeso

Just to confirm the version I am using is 1.8.4.5.

Also the attached screenshot might help you. As you can see the file (Datoscsmo_ufd_20090627.txt) in the zip file shows as being 4GB, but when extracted it is 16GB. Might this be part of the problem?

Regards

Coordinator
Jul 23, 2009 at 11:01 AM

Yes - the size mismatch you mentioned could be part of the problem.

Where's the screen shot?  is the screen shot produced by DotNetZip?  Or WinZip?

Is the zipfile produced by DotNetZip, or some other tool?

Can you restate the problem for me.  Originally the problem was, DotNetZip was throwing when you were extracting, and when the app had extracted more than 4gb on a single file.  Is that still the problem?

 

Jul 23, 2009 at 11:18 AM

I attached a screen shot in my previous e-mail reply. I don't seem to be able to paste into here so if you can't get it from the previous message let me know how I can get it to you if it is necessary. It was taken from Winzip, although I get the same results from WinRar and stepping through the code DotNetZip only recognizes the file size as 4,048,964 kb in size compared to the 16,536,980 KB of the unzipped file in Windows Explorer.

I have no idea as to how the zip file is produced as it is supplied by a 3rd party and, as mentioned before, they have been singularly unhelpful in helping us out with this data previously. Saying that I would be pretty certain that the file is created with an application such as WinZip.

The original problem is that DotNetZip throws the error "CRC error: the file being extracted appears to be corrupted. Expected 0x33384163, Actual 0x7C445495" on extraction upon reaching 4GB. At present I only have one file of this size and if I catch the error and move on to the next file within the zip file that extracts okay.

Thanks.

 

 

 

 

Coordinator
Jul 23, 2009 at 11:37 AM

ok, clear.  There's a gateway that posts the email that you send, to the forums.  When I respond to you I am using the forums on codeplex.com, not sending/receiving email.   In the forums UI, it is just HTML and you can embed a picture.  It needs to be available somewhere.  I often use tinypic.com for the purpose.  The steps are:  take screenshot, save screenshot, upload to tinypic, get URL, edit HTML on codeplex forum, paste URL to picture.

Are you saying that WinRar and WinZip can successfully extract the file? in the screenshot is it clear whether the 4gb size is referring to uncompressed or compressed size?  Which is it?

When reading or extracting a zip archive, DotNetZip first looks in the zipfile for the metadata for the entry.  This metadata includes the uncompressed size, the compressed size, the filename, the compression method, the CRC, a comment, and other stuff.   If the app then calls ZipEntry.Extract(), dotnetzip will decompress the bytes in the zip file, until reaching the "compressed size" number of bytes.  It then compares the CRC in the metadata with the resulting CRC of the uncompressed data.

Normally this all works nicely.  But what appears to be happening in your case is the CRC check is failing.  This can happen if

  1. The CRC stored in the metadata is invalid
  2. the compressed size stored in the metadata is invalid

If the problem is that there is not enough compressed data in the file, you should get a different error - a length check should fail.  Supposing the original compressed size was 4.8gb, and the zip file had only 4.2gb, the decompress will throw when it runs out of data.  If there is a problem with corruption of the compressed data in the zip file, you will get yet another different error, a ZLIB protocol error. 

Not all unzip tools verify the CRC of an entry on extract, but I believe WinZip does.

It sounds to me that you may have the latter case - the compressed size may not be valid.

 

Jul 23, 2009 at 11:51 AM

Thanks for your quick reply. I will keep notes of how to attach an image for future reference

As you say it seems to be the metadata compressed size, or CRC, that is causing the problem. We download a file from the 3rd party on a monthly basis and looking at the old files the compressed size stored in the metadata never equals the size of the extracted file, so something in the way that the zip file is produced is not working correctly but although I will get in touch regarding the problem it is causing us I don’t expect a reply from them.

It looks like for this file only (your library works really well for all other zip files we receive) I’ll have to look at an alternative method of extraction (probably by directly calling Winzip from my app)

Kind regards

Coordinator
Jul 23, 2009 at 12:29 PM

??

Just one thing - you said the "compressed size stored in the metadata never equals the size of the extracted file" .  In a zip, the uncompressed size stored in the metadata should equal the size of the extracted file.

As for alternative methods of extraction, if you use DotNetZip and open a read stream on the entry, the CRC check is not done by the library.  The app is expected to do that itself.  Check out the doc for ZipEntry.OpenReader().   That might work for you if the only problem is a CRC check.

 

 

Jul 24, 2009 at 10:44 AM

Sorry I got my Uncompressed and Compressed mixed up. The compressed size does show correctly, the uncompressed doesn’t (this depends a bit on how you class correctly, as it does show the same uncompressed size as displayed when the zip file is opened in Winzip/Winrar, but this is not the actual size of the final extracted file). Either way it looks as though it is more a problem with the actual file than your library.

I will try the OpenReader method if I get a chance.

Thanks for all you help.

Coordinator
Jul 24, 2009 at 11:14 AM

if winzip, winrar and DotNetZip all agree on the uncompressed size of an entry, and that size does not match the size of the actual file used to produce the entry, then it seems like there was an error during the creation of that zip file.

 

Aug 19, 2009 at 9:24 AM

Cheeso

I have tried using the OpenReader method and have pasted my code below, but I am still unable to fully extract the large files from the zip file. Further below I have added a screen shot which shows the zip file and the extracted file.

The file that is causing me problems is Datoscsmo_ufd_20090801.txt. As you can see the uncompressed size of the file within the zp file is 1,142,685,138 bytes, which is the final value of totalBytesRead when the code is completed. However extracting the file using Winzip leaves me with a file of over 13GB (14,027,587,026 bytes according to the file properties).

This may be something I am unable to resolve using the DotNetZip library, but I would welcome any comments you may have.

Thanks

 

 

        Using zip As New ZipFile(Me.ZIPFileName)
            Dim e1 As ZipEntry
            For Each e1 In zip
                sNombreFichero = Path.Combine(Path.GetDirectoryName(Me.ZIPFileName), Path.GetFileName(e1.FileName))
                Using swWrite As Stream = File.Create(sNombreFichero)
                    Using s As Ionic.Zlib.CrcCalculatorStream = e1.OpenReader
                        Dim n As Integer
                        Dim buffer As Byte() = New Byte(4096) {}
                        Dim totalBytesRead As Integer = 0
                        Do
                            n = s.Read(buffer, 0, buffer.Length)
                            swWrite.Write(buffer, 0, buffer.Length)
                            totalBytesRead = (totalBytesRead + n)
                        Loop While (n > 0)
                    End Using
                End Using
            Next
        End Using



Aug 19, 2009 at 11:36 AM

use

Dim totalBytesRead as Long

i.e. use a 64 bits integer

 

Aug 19, 2009 at 12:51 PM

Gilles

Good spot that, but at the moment that is not causing the error. Even using a Dim totalBytesRead as Long, the final extracted file size with DotNetZip is 1,142,685,138 bytes compared to over 13GB when extracted using Winzip or Winrar. Then problem seems to be that DotNetZip is only expecting that amount of data to be extracted and stops at that point, despite the fact there must be more data within the zip file.

 

 

 

Aug 19, 2009 at 1:53 PM

When the file is uncompressed (with the inflate algorithm), the end of the file can be detected from the data itself (when Read returns 0 bytes). If the library stops when having reading uncompressedsize bytes, the CRC is probably wrong, if there is actually more data.

It let me think that .zip files can conversely contain holes with data not described by entries; it is not pleasant from a security point of view. But this is how self extracting archives work, as I understand.

Coordinator
Aug 19, 2009 at 3:26 PM

Lew, when you do

  swWrite.Write(buffer, 0, buffer.Length)

you probably want, instead,

  swWrite.Write(buffer, 0, n)

This is because the preceding stream read can be partially filled, and only the first n bytes are valid.

Secondly, what WinZip is showing and doing seems confusing to me. It reports an uncompressed size of 1gb, but then when you unpack, it generates a file of size 13gb. IS that right? Am I understanding correctly? If so, that indicates a problem in WinZip, to me.

What does DotNetZip tell you the uncompressed size is? You seem to be telling me that when extracting with DotNetZip, you get a file of size X, where X is the UncompressedSize of the ZipEntry. This seems to be correct behavior on the part of DotNetZip, unless I am misunderstanding something.

The surprise is that WinZip's reported UncompressedSize does not match the actual Uncompressed size of a file, when extracted by WinZip. This seems to be a problem in WinZip, if I am understanding the situation correctly.

Gilles: yes, the zip file can contain arbitrary data, and there is no requirement that the data for each zip entry immediately follows the prevvious. It is not only self-extracting archives that work this way. And yes, this is a way that viruses have been delivered. Scanning engines must check zip files for malicious payloads, because of this.

Aug 20, 2009 at 7:11 AM

Cheeso

Thanks for you reply, but I still get the same result with DotNetZip whereby the size of the extracted file is equal to the UncompressedSize of the ZipEntry.

With both Winzip and Winrar when viewing the zip file within the respective application shows that the zipped file has a size of 1GB, yet both applications extract a file of size 13GB.

Basically your understanding of my situation is correct. Unfortunately, as mentioned before the file in question is one we receive on a monthly basis from a 3rd party who have, up to now, been wholly uncooperative when we have requested help. I have been able to find a work around using something other than DotNetZip but for consistency was hoping to use DotNetZip for these files as well.

Thanks for your help

 

Coordinator
Aug 20, 2009 at 3:06 PM

Lewshouse, I don't understand the situation I think. 

You're telling me that DotNetZip is creating a file with THE EXPECTED SIZE.  The UncompressedSize IS the expected size of the file after de-compression.  And DotNetZip is producing a file of that size.  And you're telling me, you don't want that.  This is what I don't understand. 

It seems to be correct behavior on the part of DotNetZip.  I'm not trying to be difficult here. I understand that WinZip and WinRar both produce a file with 13gb size, but it does not seem to be correct behavior.  Even WinZip shows the UncompressedSize as 1.1gb, so why should it produce a file of size 13gb?  

If I have a file that is 1000 bytes normally, and I zip it up into 400 bytes, then I get an entry that has UncompressedSize = 1000, and CompressedSize = 400.  When I extract that file I expect it to be 1000 bytes in the filesystem.  This is what is happening with you, though the numbers are higher.  DotNetZip is producing a file upon extraction that has a size equal to the UncompressedSize.  This seems like correct behavior.

I don't know where the 13gb is coming from. 

If you're telling me that the original file is or was 13gb, and when zipped, the zipfile says the UncompressedSize was around 1.1gb (as I think you mentioned), then it seems like the zip file is broken.  And if the zipfile is broken, there needs to be a change by whoever is producing it.  And yes, I understand they're not responsive, but ... (shrug). 

I also understand that its frustrating that WinZip produces a 13gb file.  But like I said, it seems to be broken behavior. Something isn't right.

Sometimes a zipfile can be internally inconsistent.  There are two places to store the metadata for each entry - the name, the uncompressed and compressed sizes, and so on.  One is called the "entry header", and the other is called the "Central directory".   In some cases the entry header and the central directory get out of sync.  There is a CheckZip method on the ZipFile class that can correct these problems, in some cases.  You might try running your zipfile through that method.  It may help.

 

 

 

 

Aug 21, 2009 at 7:24 AM

Cheeso

Thanks for your replies and attempts to help in this matter. Unfortunately although DotNetZip functions correctly and extracts a file of the size expected from the UncompressedSize property I need to extract the whole, in this case, 13GB, and need to be able to do it programmatically without recourse to directly calling an application such as WinZip.

Since I have started this job I have received four versions of the zip file from the 3rd party and in each case the file extracted by WinZip/WinRar is significantly larger than the uncompressed size they show, so something is obviously occurring during the creation of the zip file, but as I cannot get any response from the supplier (a large Spanish electricity supplier) am unable to find out what this is.

For now I will have to look at alternatives which is a shame as if I change from DotNetZip for this supplier I will have to do so for the other 4 we use and therefore re-write code that functions perfectly.

Aug 21, 2009 at 8:13 AM

Cheeso,

would it make sense to read the deflated stream until its end (read returning 0) , not taking into account the stored uncompressed size?

It seems other utilities are doing that way, and it would not modify the behavior for correct files.

Coordinator
Aug 21, 2009 at 8:44 AM

GM - cannot really do that. The problem is, if I keep reading past the "uncompressed size" I am reading metadata in the zipfile for the next entry.  It will break.

LH - what about trying the CheckZip method?  Have you considered that?  It's a simple API call.  Worth a try. 

Aug 21, 2009 at 9:23 AM

Cheeso

I’ve tried the CheckZip method but that returns true.

Thanks

Lewshouse

Coordinator
Aug 21, 2009 at 3:52 PM

well then I'm at a loss for how to help.

you can't get me the zip file because of privacy concerns.

By your description, DotNetZip appears to be working correctly, although I understand it is not working the same as WinZip or WinRar.

I don't know how to fix this?

Good luck.

 

Aug 24, 2009 at 8:12 AM

Perhaps the compressedsize is smaller than the actual deflated data in entry, and Winzip continues inflating after compressedsize bytes have been read, until the inflate algorithm returns 0 bytes.

Aug 24, 2009 at 2:26 PM

if you can recompile the code from Cheeso, try the following change for OpenReader to see if the problem is acyually here:

<font size="2">

 

</font>

private Ionic.Zlib.CrcCalculatorStream InternalOpenReader(string password)

 {

///...

// change

 

 

<font size="2" color="#0000ff"><font size="2" color="#0000ff">

return 

</font></font><font size="2" color="#0000ff">

 

</font>

new Ionic.Zlib.CrcCalculatorStream((CompressionMethod == 0x08) ? new Ionic.Zlib.DeflateStream(input2, Ionic.Zlib.CompressionMode.Decompress, true): input2, _UncompressedSize);

// into

<font size="2" color="#0000ff"><font size="2" color="#0000ff">

return

</font></font><font size="2" color="#0000ff">

 

</font>

(CompressionMethod == 0x08)? new Ionic.Zlib.CrcCalculatorStream(new Ionic.Zlib.DeflateStream(input2, Ionic.Zlib.CompressionMode.Decompress, true)): new Ionic.Zlib.CrcCalculatorStream(input2,_UncompressedSize);

<font size="2">

 

</font>

 

Nov 22, 2009 at 12:56 PM

hi every time i extract any game or some big file it always used to show that the cabinet file require for the installation is crrupt it is problem with this pack or cd/dvd may be corrupt

and the file get easily be extracted on met friends pc kindly help me pls

Coordinator
Nov 22, 2009 at 1:39 PM

Sorry, I can't understand what you're asking.

Also, I don't know anything about cabinet files. I think maybe you're in the wrong place.

If you really are using DotNetZip, open a new discussion, ..pls.