Dealing with Broken >4Gig Java.util.zip Files

Dec 2, 2010 at 6:10 AM

The brain-damaged Java.util.zip generates ZIPs containing >4Gig files without ZIP64 headers.  (If they didn't want to support ZIP64, then the library should have been set up to throw an exception instead of silently producing corrupt output.)

We're trying to read some files from a vendor that, while they have acknowledged they have a problem and are working on a fix, have not yet managed to do so.  Their recommended work-around is to extract the files with a tool that can deal with the damage (WinRAR or Info-Zip).  Since we are using DotNetZip to stream the decompressed data directly from the ZIP file into a database through a SqlBulkCopy, the suggestion would be a bit painful to implement (having to extract by spawning an external EXE and then either having to do nasty things between software layers to read from individual files or to create a new ZIP file with proper headers and then stream the result).

Is there some sane way to get DotNetZip to read such files?  Is there code somewhere that can repair >4Gig ZIP files that lack ZIP64 headers (if there is even enough information in the file to recreate the ZIP64 headers)?

Currently, DotNetZip seems to read the actual file length modulo 2^32 bytes then stops (then our code fails with a CRC mismatch).

Thanks.

Coordinator
Dec 3, 2010 at 3:24 PM

Hmm, interesting situation.  Currently DotNetZip cannot read such files, because of the violation of the zip spec.

>   Is there code somewhere that can repair >4Gig ZIP files that lack ZIP64 headers (if there is even enough information in the file to recreate the ZIP64 headers)?

I think this would be the best approach.  I don't know of code that exists to do this, but it shouldn't be difficcult to build it pretty quickly, assuming it's a tactical tool - built specifically for this situation - and you're not intending to release it for use by the general public.

As to whether you could create the zip64 information given the "broken" zip, I think the answer to that is "probably."   A zip file isn't too complicated, and it should be easy to walk through it to search for zip entries.  All you'd need to do is identify where the zip entries start and end, and then you could compute all the zip64-required information. The start of an entry is marked by a "signature" - a well-known sequence of 4-bytes (0x4b 0x50 0x04 0x03). Usually the first entry is at the start of the file, and the start of the successive zipentry directly follows the end of the prior zipentry.  If this is the case, then it should be almost trivial to identify the breaks in the zip file. But the zip spec allows zipentries to be separated by "junk" - there's no guarantee that the zipentries are consecutive.  If there is junk in the file, then it would be harder to determine where the actual entries started and stopped, which means it'd be harder to generate the zip64 information.

If you're comfortable with doing byte I/O and analysis, you could build this tool yourself with a powershell script or similar - PERL, C#, anything that can read a buffer of bytes, and seek through a file.  If it sounds too hard, then you might want to consider hiring someone who knows the zip spec, to do it for you.  Know anyone like that?

I'd need a copy of a zip file that exhibits the problem.  I could charge you for my time, it'd probably take 4-8 hours.

 

Dec 4, 2010 at 12:43 PM

I started looking at the zip APPNOTE.TXT but felt ill after reading about the optional signature for the data descriptor.

Taking a step back and looking at what I really need to accomplish, the problem is actually a little simpler than the general case since the zip file itself is much less than 4 GiB (~800MB).  The only value that is wrong in the ZIP file is the uncompressed length field for one file that is larger than 4 GiB. The offsets, compressed sizes, and such are all correct.

I added a length limiting Stream class and had _ExtractOne() pass an instance to GetExtractDecryptor() with a limit of this._CompressedFileDataSize bytes, instead of looking at the number of bytes read (LeftToRead).  Limiting the file read seems a more direct way of preventing bad things from happening due to corrupt compressed data than limiting it after decompression.  One should probably at least issue a warning if the total decompressed bytes is different from what the headers claim.  In cases of large files with AWOL ZIP64 headers, one could perhaps verify that the 32-bit uncompressed size is either ~0 or equal to the actual length modulo 2^32 ("Java-damaged mode")...?

I can't imagine the overhead of one more stacked Stream in the read path is meaningful (one more virtual function call per low-level read).  <shrug>  I can send a diff if you are curious.

This change lets me get at the data I was looking for (with valid CRCs).

There is a typo in the comment inside ZipEntry.Extract.cs' _ExtractOne(): "coould"

Coordinator
Dec 5, 2010 at 8:38 PM

ok, good.

It sounds like you've got things sorted.  Thanks for the tip about the typo.

Regarding the fix for the broken archives coming from java, ... I'd hate to insert a stack of code in the library to handle a special case I've seen just once (in about 4 years) and without a guarantee that it's going to happen again.  I think you're on the right track that we want the library to be resilient in the face of non-compliant zip files.  I'll have to look at your specific suggestions to see if it's feasible to put more resiliency into the library.

But, as you can imagine, there are eleventy-seven ways for a zip file to be damaged or malformed.  So which ways do I handle? 

I hate to put it this way, but a sample of one error isn't really a compelling justification to re-jigger the streaming.