AddFileFromString() method saves text encoded in UTF-8

Jun 5, 2009 at 8:13 AM

Hello.

First, congratulations, your DotNetZip is a great piece of code!

I’m a programmer from Spain (Europe) and I have an issue with your AddFileFromString() method:

The string passed is in Unicode (16 bits-per-character), but you write it (intentional or unintentionally) encoded as UTF-8 into the zipped file. In Spain, we use (as you in the USA) the Windows CP-1252 code page (Western), but in Spanish we use frequently letters beyond A-Z (as the letter «Ñ» in «España», ‘Spain’ in Spanish; other European languages, as French, German, Portuguese, Danish, etc., also use such characters). The ISO-8859-1 part of Windows CP-1252 code page is an 8-bit encoding, so we need only a single byte to store a single character. But when your method transcode the string into UTF-8, character codes 160 to 255 (A0h to FFh in hexadecimal) are stored as two bytes. So the resulting file is longer in bytes than in characters. When those text files are read with programs that expect only ANSI text, strange things happen with the UTF-8 encoded characters: they appears as “España” instead of “España”.

I tried to convert strings to MemoryStream objects and use the AddFileStream() method, and it works, but this forces to me to mantain a lot of MemoryStreams open while adding a lot of text files in the ZIP file (I’m talking of thousands of text files per ZIP file, which is my case), while the AddFileFromString() method not.

Well, UTF-8 is “text” after all, but not ANSI text. I suggest you to implement an ANSI version of your method (or to control its behaviour with a new property, or a new parameter) in order to ensure an ANSI output from an ANSI string. This way, resulting files will be output exactly as expected in size and encoding. An Unicode version-or-behaviour-or-parameter would be welcome as well, as you could wish to store Unicode strings (Arabic, Cyrillic, Chinese, etc.) as 16 bits-per-character text files (this case will also need the both big-endian and little-endian flavours), but personally I don’t need this.

Another suggestion (not incompatible with the former) is to implement a new AddFileFromByteArray() method, in which you could pass an arbitrary binary-content Byte array the same way the AddFileFormString does with a string (that is, without the need to maintain open MemoryStream objects until to call the Save() method).

And another suggestion more, you could to implement a “ZipItem.RealizeStream()” method by which you compress and/or encrytp the ZipItem associated input stream in memory and releases the original input stream, in order to free the resource before the program calls the Save() method.

Here are an ANSI string if you want to test (the full range of ISO-8859-1, or ANSI Latin-1, “high” characters are given; the first “space” is the Non-Break Space character):

 ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

Meanwhile, I’ll deal with the MemoryStream objects.

If all of these is many work for you, I’ll try to get the source code and to change it, in order to get the good ANSI behaviour that I need, but this could be an adventure to me.

Thank you very much in every case, your DotNetZip library makes things much more easier to me!

Yours,

Ricardo Cancho

Madrid – Spain

 

Coordinator
Jun 5, 2009 at 1:09 PM
Edited Jun 5, 2009 at 2:04 PM

Hello Ricardo, and thanks for your mail, very clear and well conceived.

First, there is a workaround, available in v1.8.  In v1.8, DotNetZip allows you to specify null for the stream (Nothing in VB), in the call to AddFileFromStream().  In this case, your app can open the stream in the SaveProgress event, just at the moment the stream is being read, when saving the zip file.  It can then close the stream just after the library has finished reading, in the same way.  I have called this "just-in-time stream provisioning", and it allows an application to keep only one stream open at a time, avoiding the scale issue you described as the number of files and streams becomes large.

How this works is described, with example code, in the doc: http://cheeso.members.winisp.net/DotNetZipHelp/html/daf87dcb-ac4c-58c8-7f8b-0a03a7e586b4.htm .   (Looking at the doc, I can see that there is not enough information describing this behavior in the reference for AddFileFromStream() - a big oversight.  I will correct the doc.)   I am interested to see if you believe this feature will satisfy your requirement. 

I think your suggestion to add an override for AddFileFromStream, to allow an Encoding parameter, is also a good one, and as you say, is not incompatible or redundant with what is already in the interface.

I am hesitant to introduce the AddFileFromByteArray method, because I think this can be better accomplished using the existing streams support.   But on this I could be convinced.

Coordinator
Jun 5, 2009 at 1:58 PM
This discussion has been copied to a work item. Click here to go to the work item and continue the discussion.
Coordinator
Jun 5, 2009 at 2:10 PM

Ricardo, you pointed out that the AddFileFromString() method always encodes strings using UTF-8.  This is true, though the documented behavior of AddFileFromString() is to add the string in the default text encoding.  Therefore this is a bug.  I've opened up a workitem to change AddFileFromString to behave as documented.  In addition, on that workitem, I will introduce a new overload to the AddFileFromString to allow a text encoding.  I have also opened a separate workitem to track the request for better documentation of the just-in-time stream provisioning capability.  Here are the links for your convenience.

AddFileFromString behavior: http://dotnetzip.codeplex.com/WorkItem/View.aspx?WorkItemId=7858

Better doc: http://dotnetzip.codeplex.com/WorkItem/View.aspx?WorkItemId=7859

 

Jun 5, 2009 at 3:14 PM

Thank you for your interest on the subject!

Hum... the "just-in-time stream provisioning" workaround is an interesting technique, no doubt, but (if I understand your explanations) it doesn't fit my requirements.

I explain, The thousands of text files my program zips into single one (with your library) are synthesised in memory in real time while reading on-fly from already encrypted files, which are encrypted with a proprietary and non-standard algorithm to increase security. I don't have (and I *must* don't have) any .NET IO Stream-derived objects to manage the encrypted source files. So the easier and quick way to perform the in-memory zipping is to read every file into a single string and using the AddFileFromString() method, no doubt.

But the UTF-8 issue changed my plans. Now, I read into MemoryStream objects and employ the AddFileStream method, that works fine, except for the fact that I must to maintain them already open until calling the Save() method. An even this is passable yet (our customer is satisfied with this), but the text files are in plain text in memory more time (minutes) than I would wish; I would prefer to synthesize the string, zip it, overwrite it and dispose it as quickly as possible. I'm not specially obsessed with security by myself, but I must deal with files under legally special considerations in Spain, and I must to treat them with extreme care; I don't want to go to jail! X-D

One of the reasons I choose your DotNetZip library was that it supported AES-128 encryption, beyond many (if not every) of other ZIP related products. And I must to say that, for this purpose, your product works as expected. Amazing.

As I am glad with DotNetZip, my contribution was made in order to improve your fantastic (even *magic*) library. I know that, due using English, many I18N issues are not discovered until the products made in the USA are employed in foreing countries, as is the case. So an universal DotNetZip will be welcome in every corner of the planet!

See you.

Ricardo Cancho

Jun 5, 2009 at 4:00 PM

About an hypothetical new AddFileFromByteArray() method:

I think that the possibility to synthesize any arbitrary binary content (lets say, some encrypted text or a JPEG image) exclusively in memory and dump it directly into a ZIP file would be an excellent resource, better than to dump first into a temporary file on disk and then to zip it with the AddFile() method, specially in cases in which data security considerations blast at your door!

Of course, again a MemoryStream object could be used for this purpose, if it (or they) could be flushed quickly into the ZIP file, and its memory disposed right after.

So the need won't be merely "to write binary data into the ZIP file" but "to write binary data into the ZIP file RIGHT NOW!". Lets say, "on-the-fly".

Maybe a good implementation of the "just-in-time stream provisioning" technique could accomplish with that but... What could be more simple than a simple call? :-D

Perhaps a new "NoBuffering" property or so could control the way you want the in-memory strings, byte arrays and (of course) MemoryStreams that you add as ZipItems will be flushed into the ZIP file. If you want to open another work item for this, up to you... I'm already happy with DotNetZip library as is. The issues exposed are in every case, to me, minor bugs.

See you soon.

Ricardo Cancho.

 

Coordinator
Jun 5, 2009 at 5:18 PM

Ricardo - hmm, about that NoBuffering suggestion.   Maybe a new property called "AutoSave" on the ZipFile, which, when set, tells the ZipFile to call Save() after every AddXxxx().  That could be interesting, and would mean applications would not need to keep strings, byte arrays, and streams around.   

Maybe AddFileFromByteArray is not such a bad idea after all.

Coordinator
Jun 5, 2009 at 5:21 PM
This discussion has been copied to a work item. Click here to go to the work item and continue the discussion.
Coordinator
Jun 6, 2009 at 1:38 AM
Edited Jan 10, 2010 at 3:53 PM

Ricardo, I just released v1.8.3.17.

It has an AddFileFromString method that accepts a System.Text.Encoding.  Also I documented the just-in-time stream provisioning.

You can get these fixes now.

Addendum: In v1.9, the ZipFile.AddFileFromString() method was renamed to ZipFile.AddEntry()

Jun 8, 2009 at 1:13 PM

Thanks the Lord! At the end, good news in the Monday morning...

Thank you very much. I'll try it as soon as possible.

Yours.