Tuesday, December 24, 2013

Character encoding



When files are moved between different operating systems, or stored in a common file system such as AFS, you may sometimes find that characters such as ÅÄÖ are shown incorrectly.
A character encoding determines which binary sequence is used to represent each letter, or other character. Many different ways to encode text have been used throughout the years. CSC's Unix systems have traditionally used “Latin-1” (ISO-8859-1), which contains the letters used in western European languages. Other operating systems have used other encodings, e.g. “Mac Roman” on Mac OS, “CP-1252” on MS Windows, or “CP-437” on MS DOS. All of these are extensions of ASCII (basically, American letters, digits and punctuation), which means that such characters are displayed correctly. But accented letters differ. In particular, the Swedish letters ÅÄÖ are not displayed correctly
These days, most OSs can use some form of UTF-8, but you may need to configure the applications to use it. To do so you choose a locale, which defines formatting many settings specific to a language and region, for example:
  • Number formatting (e.g. using “1 234,5” or “1,234.5”)
  • Date and time formatting
  • String collation (i.e. sort order, so that “ångström” is sorted under A in English but Å in Swedish)
The locale is written as «language»_«variant».«encoding», e.g. “en_US.UTF-8” (American English, UTF-8) or “en_GB.ISO8859-1” (British English, latin-1).

Converting a file

To convert the contents of a file, you can open it in a locale-aware editor, and “save as...”
a different encoding, or use the iconv command-line tool:
iconv -f iso8859-1 -t utf-8 < original.txt > new.txt
When logging in remotely (with SSH), you can normally configure your local settings to be forwarded. Unfortunately, not all SSH servers support this. Currently (as of November 2010), CSC's Solaris SSH server does not permit forwarding of environment variables, which is needed for this to work. The relevant locales (en_US.UTF-8, sv_SE.UTF-8) are available on Solaris, and you can set them manually, but they won't be used by default.

Problem: ÅÄÖ shown as ���

Your application uses latin1 characters, but your terminal (or editor) tries to display them as UTF-8. Configure your application to use UTF-8 (see below), or change your terminal settings to use ISO-8859-1.

Problem: ÅÄÖ shown as åäö

Your application uses UTF-8, but they are displayed as latin1. Configure your application to use ISO-8859-1 (see below), or change your terminal settings to use UTF-8.

Problem: ÅÄÖ shown as ���

Your application is printing U+FFFD, the Unicode replacement character (�, usually displayed as a question mark on inverted background). This is then converted as if it were in latin1 to UTF-8 (a U+FFFD character in UTF-8 uses three bytes). Check the settings for all applications — including the terminal window — to ensure that they all agree on which encoding to use.

Select locale (application settings)

If your application is locale aware (most are, but not some legacy CSC applications), then you can select the locale by
export LC_ALL=en_US.UTF-8 ## bash
setenv LC_ALL en_US.UTF-8 ## tcsh
and then run your application. To only configure the character encoding, change the LC_CTYPE environment variable instead.
You can also select which locale to use when you log in locally, but this may cause trouble when you use a different operating system. We recommend that you use the default settings and re-configure the applications instead.

We know different encoding styles like UTF-8, ISO-8859-1. 

Java Charset problem on linux

In short, don't use -Dfile.encoding=...
    String x = "½";
Since U+00bd (½) will be represented by different values in different encodings:
windows-1252     BD
UTF-8            C2 BD
ISO-8859-1       BD
...you need to tell your compiler what encoding your source file is encoded as:
javac -encoding ISO-8859-1 Foo.java
Now we get to this one:
    System.out.println(x);
As a PrintStream, this will encode data to the system encoding prior to emitting the byte data. Like this:
 System.out.write(x.getBytes(Charset.defaultCharset()));
That may or may not work as you expect on some platforms - the byte encoding must match the encoding the console is expecting for the characters to show up correctly.

FileInputStream fis = new FileInputStream(new File(fileName));
UnicodeReader ur = new UnicodeReader(fis, "UTF-8");
BufferedReader in = new BufferedReader(ur);

No comments:

Post a Comment