Really need help on Multi-byte char sets

Coordinator
Aug 21, 2008 at 7:30 PM
I'd like to work on workitem 3152 - support for Unicode.
http://www.codeplex.com/DotNetZip/WorkItem/View.aspx?WorkItemId=3152

Actually the name of the workitem is incorrect; the zip format does not support unicode.  Since September last year, the zip format does support UTF-8 character sets for filenames and comments on zip entries.

A bunch of people have asked for UTF8 support, and I would like to add that into DotNetZip, but I am not confident in my ability to test what I produce.

I don't have any experience testing for multiple languages. Windows Vista,  in its "compressed folders" feature, does not support UTF-8 characters in the filename to be zipped.  I don't know how to verify that what I produce will be usable by anything other than DotNetZip.

So I'm looking for help.
People who have suggestions on how I can test what I produce.
People who can actually help test what I produce.

let me know!
Sep 2, 2008 at 6:41 AM
Dear Cheeso,

I would love to be a tester...
If you need a tester, let me know, please
Sep 3, 2008 at 7:26 PM
Have you used http://www.7-zip.org/ or winrar to create test archives?
Coordinator
Sep 3, 2008 at 9:44 PM
Thanks for volunteering jawc.  No, I have not used 7zip or winrar.  Are those the archivers that you use for handling zipfiles containing files with UTF-8 filenames?
Sep 4, 2008 at 12:19 AM
I haven't had a task to work with such archives. Those are just some of the good archivers you can use. 7zip is OSS (don't know the license) and should have a C# wrapper if you need.

On Thu, Sep 4, 2008 at 12:44 AM, Cheeso <notifications@codeplex.com> wrote:

From: Cheeso

Thanks for volunteering jawc. No, I have not used 7zip or winrar. Are those the archivers that you use for handling zipfiles containing files with UTF-8 filenames?
Sep 4, 2008 at 5:04 PM
Dear Cheeso,

In fact, I am using Winrar to test...
UTF-8 doesn't work...(I have a filename with chinese character)
Sep 6, 2008 at 11:18 PM
Hi!

I spent some time debugging the current implementation and came to the conclusion that might be the reason why its not UTF-8 but still its not working...


There is the extension set to ASCII, that managed (.NET) does not recognise without specifically told so. The extension set depends on the user's regional settings; for instance in Finland our character set seems to work with codepage 850.


To get the thing going, I got the characters required to us to work doing the following:

 

protected internal static string StringFromBuffer(byte[] buf, int start, int maxlength) {

 

 

return Encoding.GetEncoding(850).GetString(buf, start, maxlength);

 

}

Even while having the earlier suggestion of "ISO-8859-1" on the other line(s):

 

bytes[i + j] = System.

BitConverter.GetBytes(c[j])[0];


But I suppose proper implementation for those portions as well would be to use the system codepage.

Now I did this 850 selection as hard-coded, but the there is a PInvoke call that you can use to get the OEM codepage and then create the encoding based on that.

I hope this helped.

Br,

Kalle

 

Coordinator
Sep 9, 2008 at 10:01 PM

jawc, what do you mean "UTF-8 does not work" ?

Do you mean that if you, using Winrar,  zip up a file with chinese characters in the name, the operation fails?  And that is because chinese characters require UTF-16, yes?  I think UTF-8 is not sufficient for Chinese.  Is that right?  (sorry I am somewhat of a novice on Unicode.)

If you mean something else, please elaborate. 

Coordinator
Sep 9, 2008 at 10:05 PM

Kalle, thanks. 
It is not simply a matter of obtaining the proper bytes within the code page. 

I spent some time debugging the current implementation and came to the conclusion that might be the reason why its not UTF-8 but still its not working...

I don't know quite what you mean by "why it's not UTF-8". I never build DotNetZip to do UTF-8. Actually DotNetZip was created before PKWare specified how to handle UTF-8 in a zip archive. So UTF-8 is a new feature request.

I think it is not a huge problem to build the UTF-8 encoding - it is a matter of following the rules in the PKWare spec. The problem is, I don't have a definitive zip engine against which to test. My Windows is English, and does not allow UTF-8 in zip archives. If I try to open a zipfile that contains UTF-8 chars in Windows Explorer, Windows Explorer chokes. I don't know any other tool that implements UTF-8. So ... I can build it but have no way to verify that what I build is interoperable with anything else.

Sep 12, 2008 at 3:18 AM
Dear Cheeso,

===>Do you mean that if you, using Winrar,  zip up a file with chinese characters in the name, the operation fails? 
yes, you are right..

===>And that is because chinese characters require UTF-16, yes?  I think UTF-8 is not sufficient for Chinese.  Is that right?  (sorry I am somewhat of a novice on Unicode.)
well, I try the following code.  Both of them don't work as well.


        internal static string StringFromBuffer(byte[] buf, int maxlength)
        {
            //return Encoding.UTF8.GetString(buf); ==> failed

            //return Encoding.GetEncoding(950).GetString(buf); ==> works

            return Encoding.Unicode.GetString(buf); ==> failed
        }

I thought Winrar use UTF-8....I still don't understand why UTF-8 doesn't work ....

PS. Traditional Chinese use code950

PS. I am using English version of Win2003 server for developing...
However, there is an option which is called "install files for east asian language" at regional in controlpanel.
Therefore, you can read chinese, japanese, korea in your system...
then I can send you a zip file with chinese chararcter...


Coordinator
Sep 12, 2008 at 4:02 PM
It sounds like neither of us knows the answer.
I think that UTF-8 is required for languages like Portuguese and Spanish and Danish.
But UTF-16 is required for Chinese.
But I am not certain.

What I do know is that the zip format supports UTF-8, but not UTF-16.

Also, I think that Windows Vista compressed folders do not support zip files with UTF-8 filenames.  The zip spec changed AFTER Vista was finalized.

As a result I think if I create a library that can read & write zip files with entries that contain UTF-8 filenames, nothing else - no other application or system besides an app built on DotNetZip - will be able to read the zip files.

Coordinator
Sep 17, 2008 at 7:20 AM
Unicode support is now ready for test!
It is available in the v1.6 prelim release, as of one minute ago.
Please test.
---
I did some homework.
It's not true that UTF-16 is required for Chinese.  UTF-8 is sufficient for all Unicode characters.

After doing a bit of thinking and playing, I have extended the library so that it now does UTF-8 encoding of filenames and comments.  There is one change to the interface :  the UseUnicode bool property on the ZipFile class.  Set this flag to true before adding files, and if those files have names that need unicode encoding, they will be encoded properly. 
Check the doc on this flag for full details.

Unicode, though, is not interoperable with Windows Explorer (compressed folders).  So I have also modified the library to use IBM437 encoding, rather than ASCII encoding, when not using Unicode. This will cover some of the common scenarios, and is compatible with Windows Explorer. 

Sep 17, 2008 at 12:11 PM
Dear Cheeso,

I have downloaded the lastest version which is "DotNetZip-24459".
I still get the same problem when I try to extract the file with filename in chinese ( use winrar to build the zip file)

However, if I use DotNetZip to zip and unzip the file.  Everything is fine.  I can even zip the file by DotNetZip and unzip it by Winrar without any problem. 

That's my test report ~
Coordinator
Sep 17, 2008 at 2:44 PM

Hey Jawc, can you send me a zip file produced by winrar that includes files with chinese characters?

I think winrar may be using the infozip format, which I did not implement. (there are 2 ways to do UTF-8 inside a zip file and I imlemented only one way).

thanks!

If necessary open a workitem and attach the zipfile there.

 

Sep 18, 2008 at 2:43 AM
Dear Cheeso,

I have opened a workitem.
Check out~

If you need other help, let me know, please
Thanks
Coordinator
Sep 18, 2008 at 4:30 AM
K, thanks. Will check it out.
Coordinator
Oct 4, 2008 at 1:35 AM

jawc, did you see I posted some code that handles your .zip file?

Can you try it out for me?   I'd love to get some feedback on it.

Oct 4, 2008 at 4:15 AM
Dear Cheeso,

Yes, I did...
I have leaved my comment here.

http://www.codeplex.com/DotNetZip/WorkItem/View.aspx?WorkItemId=6199
Coordinator
Oct 4, 2008 at 11:41 PM

@Jawc, yes, correct - in the way the winrar archive was written,

I may be wrong about this, because I am answering based on my inspection of a single archive.  It is possible the zip produced by winrar complies with a specification and somewhere in that spec, it describes how to specify the encoding that was used when the archive was produced.   But I did not see that in the zip file from winrar.

The zip spec says that there are 2 encodings to use:  UTF8 or IBM437.   Your zip was produced with 950.  I found no indication inside the zip, that this was the codepage used. 
There is a way, in the zip spec, to attach extended "metadata" information to each entry in the archive.  I thought perhaps that winrar would include the codepage in this extended metadata section.  But I found no extended metadata.

To summarize:
according to the zip spec, there is no way to specify an arbitrary encoding used when encoding the names of entries in a zipfile.  This means a zip library cannot automatically detect which encoding to use as it reads the zip. 

it is possible that winrar has done something "outside the spec" in which case I could modify DotNetZip to accomodate that.  There are some obvious places to look, within the zipfile, for the encoding that was used.  But when I inspected the zipfile produced by winrar, I found no indication of the codepage used when encoding the filename.

I will look to see if I can find a specification for how winrar does it - if the codepage really is not included in the zipfile metadata at all.