too many encoding definitions

Apr 28, 2009 at 7:05 AM
we can find many iso-8859-1, IBM437, UTF-8 encodings here and there through out the code.

it's better to have only one concentrated static encoding definition for our developer to set
Coordinator
Apr 28, 2009 at 5:02 PM
Edited Apr 28, 2009 at 9:59 PM
Thanks for your comment.  I agree with the general principle that simpler is better.
I've tried to follow that in the zip library in general.   Relating to Unicode, my "simplicity is better" approach led me, in versions prior to v1.5, to provide no unicode support at all in the library!  Very simple!

But some users wanted more power, at the expense of simplicity.  They wanted unicode.  In fact it was the #1 requested feature.  In response, the first unicode support came in v1.6 of DotNetZip.  This introduced new properties and new behaviors, which meant the code was not as simple, and it was more complicated to document.  The default behavior was the same, except for a few cases.  The balance between simplicity and power shifted a bit. 

Then some users who had tried WinRar found that zips generated by WinRar did not comply with the ZIP spec but instead used the default code page on the computer.  Therefore ZIPs generated by DotNetZip would not be read by WinRar and vice-versa.  So I introduced some new capability that complied with the reality of the marketplace as well as with the rules in the spec.  And this meant more code, more properties and more subtle possibilities in behavior.  Imagine, a property named "ProvisionalAlternateEncoding"!!   And there is quite a lot of explaining in the doc for just that one property.  Why so complicated?  I'm still trying to balance simplicity/usability versus power, along with correctness and compatibility.  These are all in tension: a strictly correct library will not be compatible, and a strictly compatible library will not be correct.   The very simplest library will lack power.  As always, I am taking direction from the community, and I tried to strike the right balance with the Unicode changes for v1.7. 

Other users wanted GZIP capabbility, so I added that as well, which required the use of iso-8859-1, in accordance with the GZIP spec.

I can understand the general lament that "unicode is complicated".  But that is a reality I cannot change.  I am only trying to produce a zip library that recognizes that reality in the best way.

If on the other hand you say, "there are too many uses of encoding in the library" I am not sure what to do with that information.  It's like saying "This music has too many notes."  Which notes should be removed? 

If you have some more specific ideas on how to optimize or simplify the interface of DotNetZip, I am open to hearing them. 

If you are suggesting the internals of the code could be more clean, that is quite true. Do you have specific suggestions there?

If you have more specific questions on how to use the unicode stuff, I'll see if I can answer those as well.

In general, here are the usage guidelines for Unicode in DotNetZip:
  • don't use any of the unicode properties if your file names are ASCII (or at least IBM437-compliant). If you don't use the Unicode-related properties in your app, then the original IBM437 encoding is used exclusively.  This seems safest and is often what you want.
  • If you use a single code-page, and want compatibility with WinRar and other tools that create or read zip files using the machine's default code page, provide a specific code page with ProvisionalAlternateEncoding (eg, big5, iso-8859-1). 
  • If you have files with names that use various code-pages, specify UseUnicodeAsNecessary = true and the right things will happen, but understand that Windows Explorer compatibility will suffer. 

I guess I should put that in the doc.

Apr 28, 2009 at 8:51 PM
Unicode is the only way to have zip fully compatible with worldwide languages.
This is something that is in ZIP specifications and it's working very well in DotNetZip (which is a major advantage of this library).

Thanks Cheeso for supporting this feature !
Apr 29, 2009 at 1:53 AM
Edited Apr 29, 2009 at 1:57 AM
firstly, I would really thank Cheeso, for your really excellent code.

secondly, I would thank Cheeso, for your kind and detail reply.

thirdly, I would apologize for not explaining clear enough: what I mean is that: there are quite a few public/static System.Text.Encoding fooEncoding = IBM437/UTF8/ISO_8859_1 or something like here and there. which really confuse me.

yes, I should read the doc carefully. sorry about tthat. but the encoding really cause problems in my computer, I am not using English version of windows, I am using the Simplified Chinese version, which means using IBM437/ISO_8859_1 will cause file names and text content not correctly handled.

currently, I changed all the IBM437 and ISO-8859-1 encodings to Encodings.Default, which solves my problem, but, if there is a "central" property for the end user to set in the real world business application's option window, it would be really nice.

and yes, I know you are working really hard on those encodings, I am really really appreciated that.


Coordinator
Apr 29, 2009 at 6:43 PM

Hello unruledboy. 

ISO-8859-1 is specified by the GZIP format.  It should never be Encodings.Default, or you will generate a GZIP stream that is not compatible with any other GZIP library or tool. It could be that you don't use the GZIP function in the library at all, in which case it does not matter.
IBM437 is specified as the default encoding for zip files by PKWARE in the ZIP specification.  UTF-8 is also specified in the ZIP specification. 
Changing any of these encodings to Encodings.Default in the source code will produce a library that does not comply with the specifications.  It may produce zipfiles that are unreadable by other tools.

I think that you want to be able to produce and read zipfiles with entries that have filenames containing Simplified Chinese characters.
If this is true, you can accomplish what you want through the documented, public interface of DotNetZip, without modifying the library source code.

If you want to create zip files that contain entries with filenames using a particular encoding, there are ctor's for ZipFile that accept an Encoding.  You can pass Encodings.Default there. This will produce a zipfile that uses the Simplified Chinese code page, and your filenames will be properly stored.  You must be careful to unpack such a zipfile using the same code page. Check the doc on DotNetZip for more information on that.

You can also set the ZipFile.UseUnicodeAsNecessary property, and it will produce zipfiles encoded in UTF-8, which will also allow Chinese characters.   However, as I described in the doc, a UTF-8 encoded zip file, while it is correct, may not be compatible with Windows Explorer. 

The published interface that supports Unicode is sufficient to allow programs to create and read zipfiles that have entries with filenames containing Simplified Chinese characters, or characters from any code page.  You do not need to change the library source code to handle Simplified Chinese.    

I hope this helps you out!