8

Closed

Support Unicode File/Folder Name

description

I tested this component using AddXXX methods. Once a file name is unicode e.g. Chinese, its name in the zip file will be unrecognized.

It seems that it does not support unicode file/folder name.

Thanks,

Tao
 
(NB: Unicode is supported by the pkzip format, via UTF-8. )

file attachments

Closed Oct 8, 2008 at 3:50 PM by Cheeso
fixed in v1.6 prelim release.

comments

Graymalkin wrote Mar 18, 2008 at 3:03 AM

The same happens to Portuguese, where something like "Configurações.xml" becames "Configuraþ§es.xml" in the zip file.

Thanks!

codedenmark wrote Mar 31, 2008 at 11:43 AM

Same with Danish, "æøå1024.jpg" becomes "µ°Õ1024.jpg".
Thanks.

Cheeso wrote Apr 3, 2008 at 8:51 PM

I think unicode is a great thing to add, but unfortunately, I don't know a thing about it. Anyone want to pitch in and help on this?

Cheeso wrote Apr 3, 2008 at 8:54 PM

I understand that the APPNOTE.TXT from pkzip has a way to support Unicode since September 2006, according to
http://en.wikipedia.org/wiki/.zip_file

I will have to look into it.

Cheeso wrote Apr 3, 2008 at 8:55 PM

Whoops, no, that is September 2007.

herreruud wrote Jun 9, 2008 at 9:22 AM

Same characters (æøåÆØÅ) fail in Norwegian. Is there any way to change the encoding for the filenames, maybe to ISO-8859-1 or some such?

/Thanks

herreruud wrote Jun 9, 2008 at 9:34 AM

Hi,

Since you requested help with this issue I looked quickly into your source and saw that in order to save filenames you use System.BitConverter.GetBytes(char Value) to get the bytes from a char.
Instead, you could do this on an encoding; for instance:

System.Text.Encoding.Unicode.GetBytes(car[] chars / string s) or
System.Text.Encoding.GetEncoding("ISO-8859-1").GetBytes....

Unless I'm really missing something here, I think that would be all it takes.

/C

Cheeso wrote Jun 9, 2008 at 3:12 PM

It seems simple doesn't it? But there's a bunch of work testing the various combinations. For example, Java uses a modified UTF-8 encoding, and in order to be a good citizen, this library would have to handle Java's encoding, as well as normal UTF-8 encoding produced by WinZip or the "Windows compressed folders" thing. Beyond UTF-8, there is the topic of supporting arbitrary unicode code pages, which is what is being asked for in this issue. I think that would be necessary in order to support Norwegian, Chinese, Hebrew, etc. (I am no unicde expert, so please tell me if that is not so. ) The Zip spec does not clarify exactly how to use arbitrary code pages. So there would be some research involved there, some reverse engineering, and then a bunch of additional testing.

I think it would be easy to do a ISO-8859-1 encoding that worked only with this library on both ends of the compression. The challenge is compatibility and interoperability - that is what people really want.

Cheeso wrote Jul 9, 2008 at 6:33 PM

I looked further into the situation. On my Vista SP1 machine, if I declare a file with the any of the names offered in the comments, like "Configuraþ§es.xml" or "æøå1024.jpg" , and then using Explorer try to drag-n-drop those files into a compressed (zip) folder, it does not work. Vista tells me to please rename the file, because it has a character that cannot be used in a compressed folder! How about that?

I also read the zip spec again, and it does not support Unicode in general, but rather UTF-8, which, as I understand it, would not be sufficient to handle Chinese characters. It would work for Danish, Norwegian and Portugese (etc), but not Chinese.

Cheeso wrote Jul 10, 2008 at 6:28 PM

I got confirmation that WinVista, in its "compressed folders" feature, does not support Unicode or even UTF-8 characters in the filename to be zipped.
I can modify DotNetZip to produce a correctly encoded zip file containing files that have names with UTF8 characters. But I don't know how to then test the resulting zip file.
How do I determine if it is actually legal and compatible, if Vista will just puke on it. I don't have access to Mac, Linux, and 10 other zip utilities.
Can anyone help here?
Suggestions? Without a way to test what I've done, I don't know that I'll be able to deliver this capability.

herreruud wrote Aug 29, 2008 at 1:48 PM

Hi,

Sorry I'm a bit late in replying.
If you can modify the lib as suggested I will be happy to test it.

/C

kallex wrote Sep 2, 2008 at 8:48 PM

Hi!

Not sure why the files in comments don't work, but I attached an example zip file created in Vista (tested with XP SP2 as well) including Finnish/Swedish characters åäö.

Br,

Kalle

kallex wrote Sep 2, 2008 at 8:53 PM

Here's another example with the earlier mentioned Portugese and Danish included in the zip (with Vista SendToZipped folder).

Br,

Kalle

Cheeso wrote Sep 17, 2008 at 5:48 AM

I fixed this in change set 24455. I will need lots of testing from everyone!

DomZ wrote Oct 9, 2008 at 2:04 PM

Hi,

Thanks for this great project.

I have done some tests on unicode handling with 1.6 preview release.
I used the WinForm example provided with the sources (I add the following line to support Unicode) :
zip1.UseUnicode = true;

That's working fine on file names, but not on folder names.

I've attach a rar with 6 files :
  • [ansi] original files.rar => Contains a folder tree with directories and files that used only ANSI char
  • [ansi] Test with DotNetZip 1.6 preview.zip => Contains the encoded output of the ANSI test case done with DotNetZip 1.6 preview (You see that Windows can extract the file, so names are not encoded with Unicode, Perfect !)
  • [unicode] original files.rar => Contains a folder tree with directories and files that used a mix of Unicode and other charsets (Arabic, Chinese, Japan, Sverige, Hebrew, Greek, Russian, ...)
  • [unicode] Test with 7zip 4.60.zip => The output done with 7zip 4.60 beta (Unicode is correctly handled, Windows can't unpack it because Unicode is used)
  • [unicode] Test with WinRAR 3.80.zip => The output done with WinRAR 4.80 (Unicode is correctly handled, Windows can't unpack it because Unicode is used)
  • [unicode] Test with DotNetZip 1.6 preview.zip => The output done with DotNetzip 1.6 preview (Unicode is correctly handled on files names BUT not in folder names, Windows can't unpack it because Unicode is used)
Cheeso,

Can you try to see why the folder names are not correctly handled when using Unicode.

Thanks a lot

Regards
Dominique