How to write a good incremental backup strategy?

Nov 2, 2009 at 12:58 AM
Edited Nov 2, 2009 at 1:03 AM

Hello to everybody!

I'm writing a class in C# for a backup utility with incremental backup option, but my strategy is terribly slow. How I can optimize it? Am I doing something wrong?

For my tests, I used:

  • a folder (the Thuderbird Profile folder; mine is about 660MB on disk; up to 6500 files and 86 folders);
  • a Zipped file of Thuderbird Profile folder, but not updated

First and complete zipping takes 3 minutes and 23 seconds (compressed zip is about 340MB)

Incremental (and partial) backup takes 2 minutes and 35 seconds (compressed zip of incremental backup is about 160MB): it loses one minutes and 17 seconds to compare the files and its ModifiedTime (so, one minutes and 17 seconds before Save() method is invoked).

Well, now let me show you my code:

ZipFile OldZip = ZipFile.Read(nameOfZipFile);
ZipFile NewZip = new ZipFile();

NewZip.UseZip64WhenSaving = Zip64Option.AsNecessary;
NewZip.CompressionLevel = CompressionLevel.Level2;
NewZip.UseUnicodeAsNecessary = true;

foreach (FileInfo currentFileOnDisk in FileList)
{
ZipEntry candidateZipEntry = OldZip[currentFileOnDisk.FullName.Replace(mailclientsClass.thunderbirdProfilePath, "ThunderbirdProfile")];
if (candidateZipEntry != null && currentFileOnDisk.LastWriteTimeUtc > candidateZipEntry.ModifiedTime) 
NewZip.AddFile(currentFileOnDisk.FullName, Path.GetDirectoryName(currentFileOnDisk.FullName.Replace(mailclientsClass.thunderbirdProfilePath, "ThunderbirdProfile")));
else if (candidateZipEntry == null)
NewZip.AddFile(currentFileOnDisk.FullName, Path.GetDirectoryName(currentFileOnDisk.FullName.Replace(mailclientsClass.thunderbirdProfilePath, "ThunderbirdProfile")));
}

NewZip.Save(nameOfZipFile+incrementalCounter.ToString());

Yes, I know, this code is not complete for a good incremental backup strategy, because I need more istructions to check removed files & C. but for now I'm in trouble for its slowness. Is it possible to optimize it?

 

Many thanks!

Bye,

Alessandro.

Coordinator
Nov 4, 2009 at 9:22 AM
Edited Nov 4, 2009 at 9:48 AM

Ciao Alessandro,

I have a couple of suggestions.

  1. The ZipOutputSteam can be slightly faster to use than the ZipFile class.  In my tests on a selection of regular filesystem files, the difference is about 5%-10%.  It is less powerful though, and cannot update zip files.  The compression is equivalent.  You may want to check it out.
  2. you can fiddle with the CodecBufferSize and BufferSize properties on the ZipFile instance.  These can significantly affect performance.  The optimal setting depends on the sizes of the files you are zipping, the memory you have available, and the relative speed of your disk access.  I try to make a good guess for what they should be, but you may find some improvements by altering them.
  3. In v1.9.0.29, which I have just released, there is a multi-threaded deflate implementation, which can cut the time to compress a file by 45% on a simple dual-core laptop, or more on a 4p machine.  There is one new programming interface - the property called ParallelDeflateThreshold., which allows you to set the filesize for which multi-threaded deflate is used.  The multi-threaded deflate is beneficial when the file being compressed is larger than about 300k.  Below that and it is detrimental.  But this varies by machine.  You can set the threshold as appropriate for your scenario.  This multi-threaded deflate also works with ZipOutputStream.  It tends to be less effective at lower levels of compression.
  4. You may also save some time by doing a File.GetLastWriteUtc instead of constructing a FileInfo on each file.  FileInfo contains other information that you may not need, and it may be faster to simply get the modification time, if that is all you want.

 

Nov 5, 2009 at 9:38 AM

Wow! Wow! Wow! Mr. Dino, the 1.9.0.29 is really amazing!

I repeated my test (my PC is a DELL Inspiron 6400 with a Core 2 Duo processor) and it takes only 1 minutes and 52 seconds with a gain of 80% (Yes, I know: you will not believe me, but the 1.8.4.26 completed the test in 3 minutes and 23 seconds).

I added only these options:

 

NewZip.ParallelDeflateThreshold = 2097152; //2MB
NewZip.BufferSize = 65536 * 8;

 

But I will do more tests, because I think that I can get more performance from your library! For example, I will try ZipOutputStream and your precious advices.

I'm going to use a catalog in SQLite too, to trace the last modified files (now I open each ZIP and I check the modified time of each zip entry).

 

P.S: I will keep you informed if I will get more performance with your advices!

Bye,

and...

while(true)

{

Console.WriteLine("THANKS SO MUCH!");

}

 

Coordinator
Nov 5, 2009 at 11:21 AM
Edited Nov 5, 2009 at 11:29 AM

Wow, that's a nice gain in performance.

Listen, Alex, about the ParallelDeflateThreshold - it's a new part of DotNetZip, maybe I haven't documented it clearly enough.  I want to make sure you understand what it is doing.

By setting it to 2mb, you are saying that only files larger than 2mb should be deflated with multiple threads.  In my experience, on my machine, any file larger than 512k showed a significant performance advantage when using the parallel deflate.  So you may get more of a gain if you lower that number from 2mb to maybe 1mb or 875k.  But maybe you already tested it and found that 2mb is the right number.

Also see my other post about the performance analysis of v1.9.0.29.  It shows the effects I measured when modifying the CodecBufferSize and (IO) BufferSize .

Good luck!

 ps: the Parallel deflate capability is an ON/OFF thing for ZipOutputStream.  On the ZipFile, the library knows the size of the files it is zipping, and so it can make a decision on whether to use the parallel deflate based on the size of the file.  In the ZipOutputStream, because of the different programming model, there's no way for the library to know how large the file is going to be, before it is written.  So, with ZipOutputStream, the only meaningful values for the ParallelDeflateThreshold property are 0 (always use parallel deflate) or -1L (never).  Any other value implies (never).

One last thing: be sure you are using RELEASE dlls for DotNetZip.  The difference between  a DEBUG dll and a RELEASE dll can be 40-50%.  I have both RELEASE and DEBUG dlls on the downloads page.  For performance, you want RELEASE.

 

Nov 6, 2009 at 11:22 AM

Thanks so much for explanations!

Yesterday I tried ParallelDeflateThreshold set to 2MB for a quick test, but after I read your explanations, I tried ParallelDeflateThreshold property set to 512KB.

I tried also ParallelDeflateThreshold set to 750KB but I didn't observe any difference, instead with it set to 2MB, it seems to be slightly slower (2 seconds)

I set up BufferSize to 128KB too. Now, with ParallelDeflateThreshold set to 512KB and BufferSize set to 128KB, it seems to have the best performance on my machine. 

Thanks to your DotNetZip 1.9, now my program is faster then KLS MailBackup. You have done a really great job with parallel deflate!

P.S: I downloaded this file: DotNetZipLib-Runtime-v1.9.zip and I use the Reduced DLL. I think that it is the RELEASE version, isn't?

Thanks so much for your job!

Coordinator
Nov 6, 2009 at 12:12 PM

I'm glad it's working for you!

About the RELEASE version - yes, you have the right one. In the Devkit download, there is a DEBUG dll and a RELEASE dll . The Runtime download, which is the one you grabbed, includes only the RELEASE version.