Confused about filename encoding

Jun 15, 2011 at 4:47 PM
Edited Jun 15, 2011 at 4:53 PM

Hi,

I'm rather confused about the encoding (and decoding) of filenames in dotnetzip. dotnetzip is being used both to compress and extract.

If I create a ZipFile using ZipFile(Encoding.Unicode), that means the ProvisionalAlternateEncoding is UTF16, and any filename that cannot be encoded with IBM437 will be encoded with UTF16, correct? i.e. UTF16 is the alternate encoding to IBM437 (instead of UTF8)

On the other side, ZipFile.Read(stream, Encoding.Unicode) is used, and I'm getting some weird behavior. The documentation seems to indicate that the use of encoding is backwards from the compress side... It says that the provide encoding is used on any entry where UseUnicodeAsNecessary is false. That seems to suggest UTF16 is the alternate encoding to UTF8 (instead of IBM437).

The weird behavior I am getting is that a filename consisting of regular english (IBM437 encodable) characters is coming out the other side as Chinese characters. o_O' That actually seems to be consistent with the documentation, so perhaps it's being encoded as IBM437 then interpreted as UTF16...

So how do I do this properly? Should I use ZipFile.Read(stream), then set ProvisionalAlternateEncoding to UTF16? (is ZipFile.Read(stream, encoding) specifically for compatibilty?) Oh, also I take it that I don't have to set UseUnicodeAsNecessary if I set the ProvisionalAlternateEncoding?

Thanks

Coordinator
Jun 18, 2011 at 5:58 AM

I don't understand all the questions.

If you want unicode, why not use UTF8, which is the standard way of doing this in ZIP files, according to PKWARE.  Why use UTF16? 

If you set UseUnicodeAsNecessary before calling ZipFile.Save() the first time, any entries saved into the zip file that require unicode will be encoded with UTF8 (according to PKWARE's specification).  When I say "require unicode" that means, any entry for which the filename has characters that cannot be properly encoded (roundtripped) in IBM437. 

Subsequently reading a zip file generated this way requires only ZipFile.Read(filename) or ZipFile.Read(stream).  Because UTF8 is part of the ZIP Specification, DotNetZip knows how to properly read a zip file with UTF8 filenames.

I hope this clarifies.  If not, maybe you could take a step back and tell me, just what is it that you are trying to do, what your particular scenario involves.  Maybe I can try to answer a specific question.