GZipStream isn't making the output file Consistent

Aug 10, 2011 at 9:00 PM

Below is sample code to reproduce the issue.  

If you take the exact same unmodified source file and using GZipStream to compress the file, the compressed files hash will not stay consistent.  If the settings on the compression don't change and the source file is unmodified the output file should be identical every time. When watching the Hashing go by on the screen it changes about every second to a new hash, so is GZipStream embedding current time stamp in the file or something? 

I tested with 7zip GUI app for GZip and the file hash for the compressed file remains constant as long as the source file is unmodified and the compression settings are the same.

Maybe I am doing something wrong with the GZipStream or 7Zip is doing something wrong or non-standard.  Is this normal for compressed files, as I would think it is not. Please advise.

class Program
{
    private const int WORKING_BUFFER_SIZE = 4096;

    static void Main(string[] args)
    {
        string fileToCompress = Path.Combine(Environment.GetEnvironmentVariable("windir"), "notepad.exe");
        HashSet<string> uniqueHashes = new HashSet<string>();
        int workingSet = 1000;
            
        for (int i = 0; i < workingSet; i++)
        {
            string outputFile = Path.Combine(@"C:\", Guid.NewGuid() + ".gz");
            Compress(fileToCompress, outputFile);
            string hash;
            using (Stream steam = File.OpenRead(outputFile))
            {
                hash = ComputeSha1(steam);
                uniqueHashes.Add(hash);
            }
            Console.WriteLine("{0}", hash);
            File.Delete(outputFile);
        }

        Console.WriteLine("{0} unique hashes generated out of {1} compression attempts.", uniqueHashes.Count, workingSet);
    }

    private static void Compress(string fileToCompress, string outputFile)
    {
        using (Stream input = File.OpenRead(fileToCompress))
        using (Stream output = File.Create(outputFile))
        using (GZipStream compressor = new GZipStream(output, CompressionMode.Compress, CompressionLevel.BestCompression, false))
        {
            byte[] buffer = new byte[WORKING_BUFFER_SIZE];
            int n;
            while ((n = input.Read(buffer, 0, buffer.Length)) != 0)
                compressor.Write(buffer, 0, n);
        }
    }

    public static string ComputeSha1(Stream stream)
    {
        SHA1 hasher = SHA1.Create();
        stream.Seek(0, 0);
        byte[] hashBytes = hasher.ComputeHash(stream);
        return BitConverter.ToString(hashBytes).Replace("-", string.Empty);
    }
}

Aug 10, 2011 at 9:20 PM

Just to test against 7zip, I used the official release of GZip from gzip.org and it changes the hash about every second as well, so this may be working as designed, and 7z isn't doing something right, just compatible.

I noticed 7zip doesn't add the CRC, Host, or original and packed size correctly to the file.  

I also noticed DotNetZip GZipStream does set all those correctly EXCEPT for Host which it sets to Unknown when it should be NTFS.

Anyway this timestamp that seems to cause this may be part of the standard or something and normal.  Just makes it harder to do certain types of tests.

Coordinator
Aug 11, 2011 at 4:59 AM

You are correct - the output of the gzip will change, the way you are doing it. The RFC says that GZIP includes a MTIME quantity, for last modified time of the source of the contents of the archive. The RFC is specific in saying that if no "last modified" time is available, then the compressor should use the time that compression started, in other words, DateTime.Now.  The precision on the MTIME field is 1 second, so it makes sense that it would change about every second. You can explicitly set the LastModified property on the GZipStream before making the first Write() to it, in order to set the MTIME.  It would make sense to use the Last Modified time of the input file, for that property. If you set it this way explicitly , then you should get the same hash value for the output, over repeated runs.

The OS is set to 0xFF which the RFC says is "unspecified".  I suppose the stream could set it to 11 for "NTFS", which is not an OS anyway, except, I always viewed that bit of metadata as irrelevant. Ostensibly, it was intended to allow decompressors to figure out what to do with the newlines in the file, but I think there are better ways of doing that.  So the GZipStream in DotNetZip just sets the OS to "unspecified."  I don't think it's wrong, nor would 11 be wrong... except that DotNetZip runs on non-NTFS platforms as well, so 11 wouldn't be right sometimes.  So that's why it's "unspecified."

Aug 11, 2011 at 3:23 PM

I agree on the OS/host setting is not critical these days, I do find it as well strange that they seem to set it to file system and not OS even the official GZip impl.  

I am curious that GZipStream is not getting the LastModified time from the stream it is compressing, shouldn't that be available and it wouldn't have to use DateTime.Now? Or is it the LastModified time of the Archive being created, which if new was just created so its is basically DateTime.Now?

Coordinator
Aug 11, 2011 at 3:56 PM

> I am curious that GZipStream is not getting the LastModified time from the stream it is compressing, shouldn't that be available and it wouldn't have to use DateTime.Now?

The GZipStream compresses during calls to Write(). The GZipStream has no idea where you are getting the data you pass to Write().  The LastModified time is not "available" to GZipStream unless you set it, explicitly.  

> Or is it the LastModified time of the Archive being created, which if new was just created so its is basically DateTime.Now?

No, that's not it.

 

Aug 11, 2011 at 3:59 PM

Thanks for the clarification.

Coordinator
Aug 11, 2011 at 4:02 PM

Like this:

 private static void Compress(string fileToCompress, string outputFile)
    {
        using (Stream input = File.OpenRead(fileToCompress))
        using (Stream output = File.Create(outputFile))
        using (GZipStream compressor = new GZipStream(output, CompressionMode.Compress, CompressionLevel.BestCompression, false))
        {
            var fi = new FileInfo(fileToCompress);
            compressor.FileName = fileToCompress; // <---- 
            compressor.LastModified = fi.LastWriteTime; // <-----
            byte[] buffer = new byte[WORKING_BUFFER_SIZE];
            int n;
            while ((n = input.Read(buffer, 0, buffer.Length)) != 0)
                compressor.Write(buffer, 0, n);
        }
    }