GZip stream ending prematurely

Mar 23, 2009 at 6:57 PM
Edited Mar 23, 2009 at 9:21 PM
I have a .gz file that is a 10G file compressed down to 900Mb.

When I read it in using GZipStream in finishes after just a little over 1G of data.

Both WinRar and WinZip extract a 10G file from this gz.

The file is located here. This file is generated by a vendor, so I have no control over what software creates it.

The same behavior happens with System.IO.GZipStream. Is there some idiosyncrasy that WinRar and WinZip handle that the "normal" code doesn't?

Thanks!

- Brian
Coordinator
Mar 23, 2009 at 10:14 PM
I will have to have a look...
Mar 24, 2009 at 10:51 AM
Ok, thanks!
Mar 31, 2009 at 2:50 PM
Hi Cheeso,

Have you had any chance to look into this?

I appreciate it.

- Brian
Coordinator
Mar 31, 2009 at 6:41 PM
I'm downloading it now, Brian.  I'll let you know what I find.
Apr 3, 2009 at 2:20 PM
Hi Cheeso, were you able to reproduce this problem?

- Brian
May 24, 2012 at 2:33 PM
Edited May 24, 2012 at 4:22 PM

Hi there, I have the same problem too; with a file located here. Here's what I know about the file:-

  • It's created on a Unix system outside my control. I'm trying to decompress it on Windows 7 64bit.
  • It uncompresses in Winzip, 7-Zip and WinRAR just fine
  • When I look at the file with gzip.exe from www.gzip.org (download win32 exe here) using the command line gzip -l -v tm.xml.gz it reports the uncompressed size to be 1981 and the compression ratio to be -9415.2% (obviously incorrect). 
  • Also, testing it with gzip using the -t -v command line arguments it says the file is "OK"
  • The actual uncompressed size should be about 807,183
  • It always decompresses the file to a size of 4,103 bytes.
  • I have run your code through on the VS debugger and it seems the Inflate.InflateFast returns ZlibConstants.Z_STREAM_END too early, based on the evaluation on line 1367, else if((e & 32) != 0) In other implementations I have seen, usually this condition would indicate an end of block, rather than end of stream, however, this is a guess!

After a while in debugging I had to stop because I don't have enough knowledge of GZip.

UPDATE: I have found multiple GZip header signatures throughout the file (202 to be precise).  I'm thinking this may be a Multipart Gzip File as described here (BGZF "Blocked GNU Zip Format" format)

UPDATE_2: I have split out the file into 202 separate chunks and each chunk is indeed a valid gzip file.  The file as a whole is compliant with the GZip spec, however, the Java and DotNet GZipStream classes don't seem to support multi-part "Blocked GZip" files. 

It would be really great to either support it, or detect that it's a multi-part GZip file and return a meaningful error message.  At the moment, it just decompresses the first GZip block and ignores the rest of the file. I'm now going to look into how I can alter the code to support this, or have some kind of pre-processor which concatenates all the chunks and strips out the Block level GZip headers.

thanks

Kris