Problem with russian language - getting abrakadabra (rubbish) instead of letters

Aug 4, 2011 at 12:19 PM
Edited Aug 4, 2011 at 12:19 PM

I am using DotNetZip v1.9.1.6. Everything is cool, but when I extracted ZIP, there is abrakadabra instead of russian letters.
I tried all encodings and tried different cultures, but everything is useless. Is there any solution how to correctly extract russian names? Thanks beforehand!

Here's code.

    Sub Main()

        Dim sourceFolder = New DirectoryInfo(SOURCE_FOLDER)

        Call PrepareFolders()

        Console.WriteLine("Culture: {0}", Thread.CurrentThread.CurrentCulture.Name)

        For Each f In sourceFolder.EnumerateFiles
            Using zip = New ZipFile(f.FullName)
                zip.AlternateEncoding = Encoding.Unicode
                zip.AlternateEncodingUsage = ZipOption.Always
                Console.WriteLine("Unzipping file: {0}", f.Name)
                zip.ExtractAll(UNZIPPED_FOLDER)
            End Using
        Next

        Console.WriteLine()

    End Sub
Coordinator
Aug 4, 2011 at 1:12 PM

Hmm, well. I see some problems with the code you have.

Your setting of the AlternateEncoding property is ineffectual, because the zipfile has already been read by the time you do that, and therefore setting the property has no effect on the entries that have been read.

Also: the point is moot anyway, because the encoding you are specifying, Unicode, is part of the zip specification.  If the zipfile is encoded with Unicode, then the zipfile itself will store that factoid, and when DotNetZip reads the zipfile, it will use Unicode as appropriate. 

If you are getting junk (I think that's what "abrakadabra" means) for filenames on the entries, then the problem is likely that the zipfile was not encoded with Unicode when you saved it.  How was the zipfile generated?  It must be properly generated and encoded in order for you to successfully read it. If you used DotNetZip to create the zip, did you specify the AlternateEncoding property before adding any entries?

If DotNetZip did not create that zip, then it is possible that it does not use Unicode encoding.  Maybe it uses a russian code page, or THE russian code page.  If this is the case, then instead of using the new ZipFile() constructor to read the existing ZipFile, use ZipFile.Read(), and use the Read() overload that allows specification of a ReadOptions argument. In the ReadOptions specify the Russian code page.

clear?

-----

Regarding your setting AlternateEncoding - I can see that it is confusing to try to figure out when the various properties apply.  Some of them apply during read, some of them apply during Save, and so on.  I'm working on refactoring the classes to make the proper usage clearer.

 

Aug 4, 2011 at 2:16 PM

Those zip files weren't created with DotNetZip. I received them from another organization. Here's what I found out. The problem was in encoding. Both ZipFile.Read and New ZipFile worked. I changed encoding - and all went fine! Thanks! As you see, 866 code page was used for these zip files. Here's working code:

    Enum CodePage
        OEMCyrillic = 855
        CyrillicDOS = 866
        CyrillicWindows = 1251
        CyrillicKOI8R = 20866
        CyrillicRussian = 20880
        CyrillicISO = 28595
    End Enum

    Sub Main()

        Dim sourceFolder = New DirectoryInfo(SOURCE_FOLDER)

        Call PrepareFolders()

        Try
            For Each f In sourceFolder.EnumerateFiles
                Console.WriteLine("Unzipping file: {0}", f.Name)
                Using zip = New ZipFile(f.FullName, Encoding.GetEncoding(CodePage.CyrillicDOS))
                    zip.ExtractAll(UNZIPPED_FOLDER)
                End Using
            Next
        Catch ex As Exception
            Console.WriteLine(ex.Message)
        End Try

    End Sub
P.S. This program is great! Thanks a lot! :)
Coordinator
Aug 4, 2011 at 3:18 PM

Super - I'm glad it worked for you.

To summarize the key point here:  If the ZIP file has been created with a particular encoding, and that encoding is neither IBM437 (the default for zip files) or UTF-8 (the other allows option int he zip spec), then the reading application needs to specify the code page used to encode the filenames in the zip file, at the time of reading.  The code page used to encode zipfiles is often the default code page of the machine, but not always.  Using a code page other than IBM437 and UTF-8 is not allowed by the zip specification, but it happens anyway. 

Aug 5, 2011 at 5:41 AM

So, this is the problem - "specify the code page used to encode the filenames in the zip file, at the time of reading". I surfed internet, and found out that it's impossible to detect file encoding. Following code gives UTF8 (which is not what I expect):

    Sub Main()

        Dim f = "C:\ZipFile.zip"

        Using sre = New StreamReader(f, True)
            Console.WriteLine(sre.CurrentEncoding)
        End Using

        Console.WriteLine()

    End Sub
 
To define encoding, I gotta open it as FileStream and read first bytes.... But I doubt that I ever can define Cyrillic DOS encoding. :)
Coordinator
Aug 5, 2011 at 5:07 PM

No....not quite.

>  I surfed internet, and found out that it's impossible to detect file encoding.

Right, but that is a slightly different issue.    It is similar but not exactly the same one you are encountering with the zip file. 

here's what I mean: Your internet search, I guess, told you that, given an encoded text file, it is impossible to detect the encoding used to produce that file.  But a zip file is not an encoded text file.  Instead, a zip file is a binary file (and thus is not "encoded").   But, within the zip file, there are sections that do represent encoded text.  Specifically, the names of the entries (files) in the zip are encoded text values. The default encoding used for the entry names in a zip file is IBM437; there is an option to use UTF-8.  There is a single bit in the zipentry, which if set, says that UTF-8 was used. If not set, then the reader application is to assume that IBM437 was used.  So there is no "detection" (or maybe more precisely, "deduction") of the encoding that occurs upon reading a valid zip file that complies with PKWare's specification.   A valid, spec-compliant zipfile itself tells the reading application (DotNetZip) explicitly and clearly which of the two legal encodings is used. No deduction.

The zipfile you are using is non-compliant, I believe.  It uses neither IBM437 nor UTF-8 to encode the entry names. The "UTF-8 bit" is not set, which means, the zip file *should* use IBM437 for the entry names.  But it does not, which is why you get "abrakadabra" instead of valid names when you use IBM437. The zip specification does not provide a way to explicitly indicate the encoding used for these entry names, if it is neither IBM437 nor UTF-8. And, just as it is impossible to deduce or detect the encoding used on an encoded text file (as your internet search showed you), it is also impossible to deduce how an arbitrary stream of bytes was encoded. This is what I meant when I wrote,  "specify the code page used to encode the filenames in the zip file, at the time of reading."  The reading application needs to "know" what encoding was used by the writing application, if the zip does not strictly comply to the specification.  It is not possible to deduce the encoding. 

Reading the first bytes of the zipfile will not tell you what encoding it uses for the entry names.  In fact there is no algorithmic way to learn the encoding used for the entry names, if the zip does not comply with the specification - in other words if the zipfile itself has used something other than IBM437 or UTF-8. 

Most "violations" of the spec in this regard just use the default code page for the local computer. This is a reasonable, if unnecessary approach;  such a zip can be successfully read, if the reader application simply observes the same convention - use the default code page.   It will not work if the zipfile has been transmitted across cultural borders. If you create a zipfile in St Petersburg, which implicitly uses the Russian code page,  and then send the zip to someone in Caracas, Venezuela, then that someone will not be able to successfully read the zip, unless you also tell that someone to explicitly use the Russian code page to read it. Also you need a zip library that is flexible enough to use an arbitrary code page during reading.  DotNetZip is flexible enough; I don't think other libraries and tools can do this. If you use WinZip, I don't believe there is a way to tell WinZip to use a specific code page to decode the entry names in an arbitrary zip file.

Does that make sense?

As for "not possible to deduce" - one approach you could take is to try reading the zipfile using various encodings and then determine if the filenames "look right".  But the problem is, a given stream of bytes may decode into a legal Portugese filename, and may also decode into a legal Russian filename, a legal Japanese filename, and so on.  There is no way to know which of the successful decodings is the "correct" one.  Whether it "looks right" is difficult to conclude, in software.  One reasonable heuristic may be for your application to use the local default code page (Cyrillic), and if that works, then you're done.  If it does not work, then fail.

 

Aug 8, 2011 at 5:44 AM

Oh, Dino! Thanks for such a thorough explanation! :) Finally, I found out the code page used in zip files - it's Cyrillic DOS (866)! Now all works like a brilliant! Your program is great! With it, it's super-mega-easy to create zip files! Thanks you a lot! :)

Eugene