Monday, April 7, 2014

Character Encoding

A character encoding system consists of a code that pairs each character from a given repertoire with something else—such as a bit pattern, sequence of natural numbers, octets, or electrical pulses—in order to facilitate the transmission of data (generally numbers or text) through telecommunication networks or for data storage. Other terms such as character set, character map, codeset, and code page are used almost interchangeably, but these terms have related but distinct meanings described below.

Code unit
A code unit is a bit sequence used to encode the characters of a repertoire.
With US-ASCII, code unit is 7 bits.
With UTF-8, code unit is 8 bits.
With EBCDIC, code unit is 8 bits.
With UTF-16, code unit is 16 bits.
With UTF-32, code unit is 32 bits.

A character encoding tells the computer how to interpret raw zeroes and ones into real characters. It usually does this by pairing numbers with characters.

There are many different types of character encodings floating around, but the ones we deal most frequently with are ASCII, 8-bit encodings, and Unicode-based encodings.

ASCII is a 7-bit encoding based on the English alphabet.

8-bit encodings are extensions to ASCII that add a potpourri of useful, non-standard characters like é and æ. They can only add 127 characters, so usually only support one script at a time. When you see a page on the web, chances are it's encoded in one of these encodings.

Unicode-based encodings implement the Unicode standard and include UTF-8, UTF-16 and UTF-32/UCS-4. They go beyond 8-bits and support almost every language in the world. UTF-8 is gaining traction as the dominant international encoding of the web.

The first step of our journey is to find out what the encoding of your website is. The most reliable way is to ask your browser:

Mozilla Firefox
Tools > Page Info: Encoding

Internet Explorer
View > Encoding: bulleted item is unofficial name

Internet Explorer won't give you the MIME (i.e. useful/real) name of the character encoding, so you'll have to look it up using their description. Some common ones:

IE's Description Mime Name

Arabic (Windows) Windows-1256
Baltic (Windows) Windows-1257
Central European (Windows) Windows-1250
Cyrillic (Windows) Windows-1251
Greek (Windows) Windows-1253
Hebrew (Windows) Windows-1255
Thai (Windows) TIS-620
Turkish (Windows) Windows-1254
Vietnamese (Windows) Windows-1258
Western European (Windows) Windows-1252

Arabic (ISO) ISO-8859-6
Baltic (ISO) ISO-8859-4
Central European (ISO) ISO-8859-2
Cyrillic (ISO) ISO-8859-5
Estonian (ISO) ISO-8859-13
Greek (ISO) ISO-8859-7
Hebrew (ISO-Logical) ISO-8859-8-l
Hebrew (ISO-Visual) ISO-8859-8
Latin 9 (ISO) ISO-8859-15
Turkish (ISO) ISO-8859-9
Western European (ISO) ISO-8859-1

Chinese Simplified (GB18030) GB18030
Chinese Simplified (GB2312) GB2312
Chinese Simplified (HZ) HZ
Chinese Traditional (Big5) Big5
Japanese (Shift-JIS) Shift_JIS
Japanese (EUC) EUC-JP
Korean EUC-KR
Unicode (UTF-8) UTF-8