Stream directly to BinaryReader

Jun 17, 2010 at 8:34 PM
Edited Jun 17, 2010 at 8:52 PM

Hi Cheeso,

Could I use DotNetZip to stream directly into a BinaryReader?


 Dim fs As FileStream = New FileStream(bakfile, FileMode.Open)
        Dim br As BinaryReader = New BinaryReader(fs)
        Dim lngStartPos As Long = 3756

        br.BaseStream.Seek(lngStartPos, SeekOrigin.Begin)

I only need to stream a small amount (as above) so could save quite a bit of time compared to extracting the file then streaming it to the BinaryReader?

Thanks.

Coordinator
Jun 17, 2010 at 10:55 PM

I don't know what you mean by "stream into" a BinaryReader.  It's my understanding that a BinaryReader is for reading data, so an application wouldn't ever "stream into" a BinaryReader.  An application can only read from a BinaryReader.

I also don't get your goal.  You said you only need to "stream a small amount", but I don't know what that means either.  Are you trying to read a zip file, or write one?

maybe you are trying to read a zip file, the contents of which are embedded within a larger filesystem file.  If the zipfile is very small, like less than 16k,. I'd suggest reading the zipfile content into a MemoryStream, then instantiating the ZipFile instance from that MemoryStream.  If the zip file content is larger and you want to use a streaming approach, then I can imagine defining a OffsetStream type, that would read from the offset you desire.  You could also specify the read limit for such a stream.  There is an OffsetStream like this in the DotNetZip source base, but it's not a public  class.  It's not documented and it's not something that I "support" for users.  It was designed and written to support internal use only. But you might be able to gain some insight for your own purposes by looking at that code.

 good luck!

 

Jun 17, 2010 at 11:10 PM

 

I read a file using the FileSteam until a specific point, then read the Bytes to work out the version of the file:

    Dim fs As FileStream = New FileStream(bakfile, FileMode.Open)
        Dim a As Byte
        Dim b As Byte
        Dim c As Integer
        Dim br As BinaryReader = New BinaryReader(fs)
        Dim lngStartPos As Long = 3756

        br.BaseStream.Seek(lngStartPos, SeekOrigin.Begin)
        a = br.ReadByte()
        b = br.ReadByte()

Now rather than extract this file from the zip and reading it in I thought it might be possible to stream the file from the zip to the BinaryReader to this point?
The file extraction destination is based on this value so doing this would exclude extracting to a temporary location and then moving the (sometimes large) file to its final destination.
These files (SQL backup files) can be several GB in size, so no option of steam reading the whole file into memory.

So the only way to steam to a position is by using OffsetStream?

Thanks again.

Coordinator
Jun 17, 2010 at 11:21 PM

it depends on the structure of the data in the file.  The metadata inside the zipfile format contains offsets.  If those offsets are relative to the original file, then things will just work.  Just pass the original FileStream into the ZipFile.Read() method.   If the offsets are relative to the mid-stream position, then you'll need something like the OffsetStream to wrap the file.

Jun 17, 2010 at 11:35 PM

Theres no metadata, I'm reading binary data byte by byte (seeking from start position 3756) from a binary file within the zip.

Thanks.

Coordinator
Jun 17, 2010 at 11:45 PM

I'm not clear on what you are doing.

If you have a file that contains only zip content, then you can read it with ZipFile.Read(), passing a FileStream or a file name.  If at that point you want to read from one or more of the compressed entries in the zipfile, you can call ZipEntry.OpenReader().    I'm not clear on what you are doing, reading 2 bytes from the zip file, but as best as I can tell, it does not affect how you are reading the zipfile.

 

Sep 12, 2010 at 3:01 AM

Hi Cheeso, hope you are well?

I'm trying to read the .bak file at the position below and then test this value to determine the file version.
When not zipped this is fairly easy using a stream reader - Would it be possible to do this with Zipped up files?

 

 

    Dim ZipToUnpack As String = "C:\Test\SQL2000.zip"
        Using zip As ZipFile = ZipFile.Read(ZipToUnpack)
            For Each backup In zip

    Dim wFile As System.IO.FileStream
    Dim ee As ZipEntry = Zip("SQL2000.bak")
                ee.Extract(wFile)

    Dim a As Byte
    Dim b As Byte
    Dim c As Integer
    Dim br As BinaryReader = New BinaryReader(wFile)
    Dim lngStartPos As Long = 3756

                br.BaseStream.Seek(lngStartPos, SeekOrigin.Begin)
                a = br.ReadByte()
                b = br.ReadByte()
                c = b * 256 + a
                br.Close()
                wFile.Close()

                Select Case c
                        Case Is >= 539
                        MessageBox.Show("SQL2000")
                        Case Else
                        MessageBox.Show("Unknown")
                End Select
            Next
        End Using

c#

 

string ZipToUnpack = "C:\\Test\\SQL2000.zip";
	using (ZipFile zip = ZipFile.Read(ZipToUnpack)) {

		foreach ( backup in zip) {
			System.IO.FileStream wFile = default(System.IO.FileStream);
			ZipEntry ee = Zip("SQL2000.bak");
			ee.Extract(wFile);

			byte a = 0;
			byte b = 0;
			int c = 0;
			BinaryReader br = new BinaryReader(wFile);
			long lngStartPos = 3756;

			br.BaseStream.Seek(lngStartPos, SeekOrigin.Begin);
			a = br.ReadByte();
			b = br.ReadByte();
			c = b * 256 + a;
			br.Close();
			wFile.Close();

			switch (c) {
				case  539:
					MessageBox.Show("SQL2000");
					break;
				default:
					MessageBox.Show("Unknown");
					break;
			}
		}
	}

 

Coordinator
Sep 19, 2010 at 6:15 PM

You can read a file from within a zip archive, without extracting it, using the ZipEntry.OpenReader() method.  That gives you a read-only stream, which returns the decompressed content for that entry.  At that point you can seek in that stream to the desired position, and read the data there. 

 

Sep 20, 2010 at 10:34 PM

Hi Cheeso,

Thanks for the reply. the OpenReader (CrcCalculatorStream) method doesn't appear to support seek.
I get a method not implemented when trying to seek via:


    Using zip As New ZipFile("C:\Test\SQL2005.zip") Dim e1 As ZipEntry = zip.Item("SQL2005.bak") Using s As Ionic.Zlib.CrcCalculatorStream = e1.OpenReader Dim a As Byte Dim b As Byte Dim c As Integer Dim br As BinaryReader = New BinaryReader(s) Dim lngStartPos As Long = 3756 br.BaseStream.Seek(lngStartPos, SeekOrigin.Begin) a = br.ReadByte() b = br.ReadByte() c = b * 256 + a br.Close() Select Case c Case Is >= 539 MessageBox.Show("SQL2000") Case Else MessageBox.Show("Unknown") End Select End Using End Using

Thanks.

Coordinator
Sep 22, 2010 at 11:48 AM
Ahh yes. The CrcCalculatorStream is a read-forward stream that calculates a crc as it reads. The resulting crc is used to verify the contents of the zipped entry. For that reason it was not straightforward to implement Seek(). If you want to seek, you could try one of three things: 1. Implement your own Seek() that simply reads and discards bytes until it gets to the position it wants. 2. Wrap the CrcCalculatorStream in a BufferedStream, with a buffer size large enough to support your Seek() calls. 3. Just read the entire contents of the CrcCalcStream into a memory buffer or MemoryStream, then seek on that buffer.
Jan 6, 2011 at 5:57 PM

I am planning on making the ZipEntry stream seekable.

I would flip a boolean property if the consumer of the stream wants to seek (perhaps automatically switch if the consumer seeks).  Then internally to the CrcCalculator stream it would check to see if the mode is Seekable and then only calculate the CRC lazily i.e. when the consumer asks for the Crc seek to the beginning of the entry and read/calculate.  As mentioned above to mitigate against performance issues the Crc stream could gather up Crc in chunks and merge them together.

 

Coordinator
Jan 6, 2011 at 6:12 PM
Edited Jun 13, 2011 at 9:32 PM

I believe that the Crc is referenced implicitly when an app calls OpenReader() and then reads the stream to its end.  In that case the library does an implicit CRC check and throws if there is no match.  If a reader doesn't read to the end of the stream, then there's no way to check the CRC so this isn't done.  I'm not looking at the code right now, but if I recall correctly, that is how it is designed.

The problem with your seekback and re-calc the CRC approach is that a stream may be very large.  There are zipfiles with entries that comprise 5g.  Your idea will work for small files, but re-reading 5g of data that you have just read, in order to calculate a CRC seems like a lose.  At the very least it would be very surprising for applications.  Also you'd have duplication on Read events thrown by DotNetZip.  I'm not sure this is the best way to address the problem you're trying to solve.  And actually, I'm not clear on exactly what the problem is.

It may be easier for you, if, rather than modifying the ZipEntry stream class (CrcCalculatorStream), you just wrap it into a BufferedStream.  This would give you seek capability, would require no changes to existing library code, and would give you explicit control over the size of the buffer to maintain. 

 

Jan 6, 2011 at 8:20 PM

The DotNetZip will be slower in the case where somebody uses it to calculate a CRC on a large Entry but only if they seek (not if they read*)

Using a BufferedStream as a wrapper around the Entry could cause the consumer of the Entry to allocated 5 Gig of memory on the memory stream.  Not Good.

The Use Case:

1. Seek, seek then read around a large ZipEntry without copying the stream and don't use the CRC.

I have a change set.  Will you take a look at it?  Or do you really want a complete solution as per crc_merge parts and all the accounting that goes along with it?

 

Coordinator
Jan 7, 2011 at 12:39 PM

In short, the change you're proposiing will introduce more problems than it solves.  It will also produce surprising performance behavior when Seek is used.  for those reasons I'm not interested in adopting your proposed changes into DotNetZip.

I believe the problem you are solving for is your application problem, I don't think it needs to be solved inside DotNetZip. I believe the best approach in that case is for your application to use the exisitng .NET Framework building block classes, like BufferedStream if appropriate. A bufferedstream will not allocate 5g, unless you tell it to.  Another way to solve this for yourself is to produce a wrapper stream that, when Seek() is called, it reads data into a bitbucket until it gets to where the caller wants to be.

Seeking in a compressed stream is a strange thing. Suppose the caller asks to seek forward 1000 bytes, how far forward must the stream seek in the compressed stream?  400 bytes?  300?   In fact there is no direct mapping that tells us which compressed byte refers to which uncompressed byte.  the only way to seek forward 1000 bytes in the uncompressed stream, is to read forward, and decompress, counting the output bytes, and when 1000 is reached, then the seek has completed.  You must decompress to seek.  Likewise there is no way to know how far to seek backward.  When seeking backward in a compressed stream, the only way to know is to start at the beginning of the compressed stream, and actually decompress all the intervening bytes again.  Decompression is., by far, the most CPU intensive operation when reading data out of a zip file.  The crc calculation is minor in comparison to the decompression.  So implementing a seekable decompression stream would be a disaster for performance reasons. 

It will hide the actual behavior, which is that the stream must read and decompress all the bytes that get traversed by a Seek. 

This is why it's not implemented in DotNetZip. If someone wants to seek in the stream that gets decompressed out of the zipfile as it is read, they are free to implement something slim on top of the existing CrcCalculatorStream.  I gave you a couple suggestions above. I'm sure there are other options.

Good luck!