This project is read-only.

Force UTF-8 in filenames

May 25, 2011 at 8:03 PM
Edited May 25, 2011 at 8:03 PM

Hello everybody,

Is there a way to force IonicZip to always use UTF-8 when storing filenames?

As far as I understood from UseUnicodeWhenNecessary's documentation and

also from ProvisionalAlternateEncoding's documentation, UTF-8 is only

used if IBM437 does not suffice. I need to override this codepage with UTF-8,

no matter what.





May 26, 2011 at 1:04 PM

could you explain WHY you need UTF-8 "no matter what"?

What specific problems are you experiencing with UseUnicodeWhenNecessary = true ? 
Do you have test code that demonstrates these problems.


May 26, 2011 at 1:12 PM
Edited May 26, 2011 at 7:55 PM

I have an application that does some processing on ePub files. An ePub book is a standard zip

file with a specific structure and contains HTML or XHTML files. The links in these (X)HTML files

(to other files in the book) are UTF-8 encoded, the de facto standard in (X)HTML. So, I need

to also encode the zip file names with UTF-8, otherwise the ePub readers will not be able

to navigate the books. Defaulting to IBM437 whenever possible is not an option as I cannot

alter the (X)HTML files, they contain copyrighted material and I'm not allowed to change

even a dot.

May 27, 2011 at 7:50 PM

> So, I need to also encode the zip file names with UTF-8,

I didn't follow that leap.  Can you explain to me why, if the file is UTF8 encoded, the filename itself must also be UTF8 encoded?

Is it true that epub readers cannot read the files in question?  The ones I tried were able to read them.    ???  Is this conjecture, or have you seen actual failures?  If you have an actual failure, can you describe it please - what tool are you using and what error or exception do you get?

May 27, 2011 at 8:56 PM
Edited May 29, 2011 at 1:08 AM

It is an error, here is the actual situation. I have an XHTML file called Forhæng.xhtml. The ePub's toc.ncx references the file as Forh%C3%A6ng.xhtml, that is, UTF-8. But that specific character is perfectly encodable with IBM437, so when stored in zip, the file name ends up Forh%91ng.xhtml, that is, IBM437. Also, another file inside the book has an anchor with the href attribute using also UTF-8 encoded file name.

When navigating the toc, the ePub Reader cannot find the file inside the zip, as it is already named diferently. I cannot touch toc.ncx as I have it from the publisher. So, the only option is to force the zip file name to be UTF-8 encoded so it will match the one from toc.


UPDATE(2:30 hours later): just got a new ePub file (that is, ZIP archive) which contains a file named Sværd.xhtml (same character as above), with the zip entry name UTF-8 encoded. I cannot explicitly access that file because DotNetZip encodes the entry name parameter with IBM437. So, forcing the encoding to UTF-8 has become even more needed.


The lack of being able to always encode entry names with UTF-8 is a problem with both reading and writing ZIP files. And other ZIP tools are already doing it, this new file I got proves it. We live in a UTF-8 world already :)


UPDATE2 (a day later): found the method public static ZipFile Read(string fileName, TextWriter statusMessageWriter, System.Text.Encoding encoding) which allows me to specify an Encoding. So the reading problem is solved.

I've changed the DotNetZip source code myself to force UTF-8 encoding when saving zip files. Added a property bool ZipEntry.ForceProvisionalAlternateEncoding which causes to override the default IBM437 behavior. I don't have the luxury of time :)

May 29, 2011 at 9:35 PM

ah, I see. Well you've gone through a bunch of details there.

A Zip file that is compliant with the spec is either IBM437 or UTF8.  It is marked in the zip file, which one. If it is not marked, it's a non-compliant zip file.   You should NEVER have to specify UTF8 encoding when reading a zip file, because UTF8 encoding is already marked in the zip. In your case perhaps the zip file was not properly marked.

I'm glad you looked into the code and all, but still I think there is no need to change the source code to accomplish what you want. Its very easy to save a zipfile with UTF8 encoding for filenames - specify filenames that require UTF8 encoding. In your case reading the epub file, which I suspect was broken, gave you zipEntry names that were not UTF8 encoded. This is a sign that the epub file was not properly constructed. Even in this case, it seems to me, all you need to do is use the overload to ZipFile.Read(), specifying UTF8, and then you have UTF8 encoded filenames. You should not need to modify any DotNetZip source to accomplish this.


May 29, 2011 at 9:56 PM

Well, I'm not sure about the reading problem, I didn't actually studied the code

since I found that method which allows me to force the encoding. So, I can't realy

tell if the zip file was not properly built, I'll have a look when the time will allow.


But, about writing UTF-8 encoded file names, here is a code snippet from ZipEntry.Write.cs,

the _GetEncodedFileNameBytes() method, which is used to generate the name stored in ZipEntry's header:

// workitem 6513: when writing, use the alternative encoding only when ibm437 will not do.
byte[] result = ibm437.GetBytes(s1);
// need to use this form of GetString() for .NET CF
string s2 = ibm437.GetString(result, 0, result.Length);
_CommentBytes = null;
if (s2 == s1)
    // Cannot encode with ibm437 safely.
    // Therefore, use the provisional encoding
    result = _provisionalAlternateEncoding.GetBytes(s1);
    if (_Comment != null && _Comment.Length != 0)
        _CommentBytes = _provisionalAlternateEncoding.GetBytes(_Comment);

    _actualEncoding = _provisionalAlternateEncoding;
    return result;

As you can see, the provisional encoding is used ONLY IF ibm437 is not enough.

That special character from my previous post is encodable with ibm437 so there

is no way to force it into UTF-8 with the original code.

May 31, 2011 at 8:19 PM

Yes, I know the code, I'm very familiar with how it decides to use UTF8 or not.

IF the character is encodable with IBM437, then I don't understand why it needs to be UTF8.

Is there something in the epub spec that says "must use UTF8" ?   I did not see anything like that when I read about epub, but I may have mnissed something. Could you please point me to the appropriate reference?

May 31, 2011 at 8:30 PM
Edited May 31, 2011 at 8:35 PM

No offense intended, but you're not reading my posts carefully. It's about the links in the xhtml files (that make the epub).

(X)HTML standard doesn't care about IBM437. It only cares about ASCII and UTF-8. So any character, specified in a

link, which is not ASCII will be encoded with UTF-8. That link points to another file in the zip archive, so the link and zip

entry name must match. The former is UTF-8, the latter is IBM437, so they don't match when it comes to non-ASCII chars.

This has nothing to do with the epub itself, but with XHTML versus zip.


My third post in this thread has a very clear example about it.

Jun 2, 2011 at 2:34 PM

Well I don't understand.

I can see the need to explicitly request a UTF-8 encoding of filenames, though.  That's a good request. 

Jun 2, 2011 at 2:34 PM
This discussion has been copied to a work item. Click here to go to the work item and continue the discussion.