Wednesday, May 5, 2010

Indian language computing - from GIST to Unicode


This paper presents an overview of language technologies that are proprietary and standard. There is a specific emphasis on solutions for Indic computing challenges. The evolution of proprietary solutions is discussed in detail with regards to their significance and also their shortcomings. A compelling case for much wider usage of the Unicode standard is made. Finally, technology solutions embodied in the Aksharamala product suite are presented in comparison with the proprietary technologies.
  1. Synopsis
  2. Preamble
  3. Indic computing initiatives thus far
  4. UNICODE - A giant step towards global standardization
  5. Advantages of Unicode
  6. ISCII and other initiatives - A history and current perspective
  7. ISCII vs. Unicode
  8. Disadvantages of Unicode
  9. Conclusions
  10. References
Synopsis:

This paper presents an overview of language technologies that are proprietary and standard. There is a specific emphasis on solutions for Indic computing challenges. The evolution of proprietary solutions is discussed in detail with regards to their significance and also their shortcomings. A compelling case for much wider usage of the Unicode standard is made. Finally, technology solutions embodied in the Aksharamala product suite are presented in comparison with the proprietary technologies.

Preamble:

Proprietary technologies abound in the fertile field of Indic computing. The reference here is to all those proprietary and non-standard encoding schemes such as Shusha that were well-meaning but being non-standard are not scalable, flexible and portable. In using these, the user is roped into specific schemes and usage behaviors that are highly susceptible to change, causing users to be wary of such solutions. This does pose significant roadblocks in increasing the use of Indian Languages and has the disadvantage of alienating large native populations from harnessing computers for their benefit and progress. In recent years the Unicode standard has been a significant influence and an able guide for standardization and universality in this domain. This endeavor has far-reaching and permanent consequences. It virtually eliminates proprietary strangleholds, provides a generic platform for application developers to aim for and reduces the costs of keeping up with changing flavors. We present herewith an explanation and exposition of this concept and its impact.

Indic computing initiatives thus far

The need for Indic computing predates the Unicode initiative. Vendors in their efforts to offer solutions in this domain devised several techniques of representing Indic character sets. Given the pervasive and popular ASCII-model of language representation the efforts were focused on using the ASCII and Extended ASCII space for the Indic character sets. The solutions were quite creative and path breaking, given the fact that there was no tangible OS support that they could derive an advantage from. This induced a variety and unfortunately, a proprietary tinge to the proceedings. This process however, turned out to be heterogeneous, in that each vendor went about it in their own fashion resulting in a situation that is irretrievably lost to standardization efforts. The consistency of design was even lacking in versions released by the same vendor. The basic design guide was neither glyph-based nor code-based. With such subjectivity and given the competition among vendors to get to market at the earliest, it is no surprise that a standards-based approach is inconceivable. A few silver linings to this multiple front push are:

  • Indic computing was made possible albeit in subjective ways
  • Creative approaches were utilized in the encoding schemes and design methods
  • Precursors of subsequent standards emerged
  • The need for standardization surfaced and demanded attention

It can be safely concluded that there was no common thread in the implementation process of these vendors. The following substantiate and endorse this conclusion.

  • Several versions (using different encoding) of same font (many times even with same name)
  • Font creators squeezed in indic glyphs into single byte space (such as devanagari 'a' one character and 'aa' is 'a' + sign aa thus by saving one space)
  • Problems arose with newer versions of popular apps due to misusage of special code-points (of ASCII & Ext ASCII)
    • Major issue to mention is broken text when Microsoft released Internet Explorer 5.5
    • Text transforms as the apps do not have the knowledge of overlapped mapping of these fonts in the ASCII space
    • Browser incompatibilities caused new mappings for the same glyphs

UNICODE - A giant step towards global standardization

Meanwhile, efforts were underway, via the Unicode consortium, to arrive at a conducive, lasting and generic solution to address the representation of all the world's languages. The overriding principle of this watershed effort is (Quoted verbatim from the Unicode site)

" Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language
"

This is an impressive goal that has necessarily been achieved. The latest version (Version 4.0) significantly improves on its predecessor and has made Unicode a must for any new software, operating system, and application etc., to support in its entirety.

Advantages of Unicode

Unicode offers some significant advantages that make a lot of business sense.

  • Single versions of software instead of language-specific versions that increase complexity and development costs
  • Standard supported by most OS and application vendors. This ensures platform, vendor and application independence
  • Incorporating Unicode into applications and websites offers significant cost savings than proprietary solutions
  • It allows data to be transported through several different systems without corruption
  • Since there is a uniqueness built for every number, character combination the representation results in a true standard

ISCII and other initiatives - A history and current perspective

The following segment outlines the efforts and a brief explanation of ISCII, ISFOC and INScript initiatives. While this body of development is indeed path breaking, they are not necessarily the best solutions possible. Given the lack of any precedents these standards have precipitated serious advances in Indic computing. However, with the rest of the world moving towards global standards these approaches could face the possibility of going out of circulation. In this context Unicode technologies augur well for the future.

ISCII (Indian Standard Code for Information Interchange)

With the advent of computerization considerable work has been undertaken to facilitate the use of Indian languages on computers. These activities were generally limited to specific languages and were independent exercises of various organizations thus making data-interchange impossible. In such a scenario, it was important to have a common standard for coding Indian scripts. In 1991, the Bureau of Indian Standards adopted the Indian Standard Code for Information Interchange (ISCII), the ISCII standard that was evolved by a standardization committee, of which C-DAC was a member, under Department of Electronics during 1986-88. The ISCII document is available as IS13194: 1991 from the BIS offices.

ISFOC Standard for Fonts

C-DAC evolved the character-slice (glyph) coding standards for Indic fonts and evolved the ISFOC standards. Unlike ISCII, these code charts are different for each script and are represented in 8-bits only. Only users of C-DAC products on Windows have these fonts already on the desktop.

INSCRIPT Keyboard Layout

When it comes to the use of computers, the options available for data entry are a major concern. For the data entry in Indian languages, the default option is INSCRIPT (INdian SCRIPT) layout. This layout uses the standard 101 keyboard. The mapping of the characters is such that it remains common for all the Indian languages (written left to right). This is because of the fact that the basic character set of the Indian languages is common.

The characters of Indian language alphabets can be categorized into Consonants, Vowels, Nasals and Conjuncts. Every consonant represents a combination of a particular sound and a vowel. The vowels are representations of pure sounds. The Nasals are characters representing nasal sounds along with vowels. The conjuncts are combinations of two or more characters. The Indian language alphabet table is divided into Vowels (Swar) and Consonants (Vyanjan). The vowels are divided into long and short vowels and the consonants are divided into vargs.

The INSCRIPT layout takes advantage of these observations making the design simple. In the Inscript keyboard layout, all the vowels are placed on the left side of the keyboard layout and the consonants, on the right side. The placement is such that the characters of one varg are split over two keys.

ISCII in a nutshell

  1. ISCII is the result of a student's research into finding a suitable mechanism for encoding Indian languages. It is closely modeled on the line of ASCII.
  2. ISCII was a pioneering effort that gave impetus to Indic computing
  3. There are several versions of ISCII. They are:
    1. PC-ISCII (Also known as, GIST - Graphics and Intelligence based Script Technology)
    2. ISCII 89
    3. ISCII 91
    4. Script dependent ISCII (Defined for a specific language)
  4. ISCII is, by all reckonings, a proprietary standard. Several vendors still shy away from using it as a standard.

Disadvantages of ISCII

  1. Typical database applications fail to handle sort, as the collating sequence is not in line with alphabetic sequence of ASCII (read English)
  2. No major OS & application support forcing developers to take extreme amounts of effort for reinventing text rendering and input support. This causes developers to include their implementations for these functions causing bigger code footprints as well as increased execution times for programs. Further, usage of the resulting reinvented code creates a code base that is
    1. Not reusable
    2. Non-portable
    3. Unwieldy
    4. Expensive to create, extend and maintain
  3. ISCII is also not fully endorsed by major vendors and developers of Indic software and it therefore causes serious data portability issues.

Perhaps, these are the main reasons there are no Indian language versions of popular software titles from Indian vendors.

ISCII vs. Unicode

1. The Unicode standard's encoding model for Indian Languages is based on the Indian Standard Code for Information Interchange (ISCII), but the two standards are currently incompatible. That is to say that, conversion between ISCII, Unicode and back being identical cannot be guaranteed. This is because ISCII contains letters not in Unicode (e.g. Bengali letter Va) and Unicode contain letters not in ISCII (e.g. Oriya letter Wa). However, the Unicode standard mostly encodes characters for these scripts in the same relative positions as those coded in positions A0-F4 of the ISCII-1988 standard. It is important to note that the Unicode standard for Indian Languages uses ISCII-88 and not ISCII-91 that is the latest official standard. This implies a clear-cut divergence that may not necessarily be bridged.

2. Certain combining secondary forms of constant can be explicitly encoded in ISCII. These forms cannot be explicitly encoded using the current specified Unicode Indic encoding model. It would be reasonable to presume Unicode will supersede its ancestor in the coming releases of the standard.

3. There are forms of letters in Indian languages that do not find a place in ISCII at all. Some vendors, who have implemented ISCII rather unwillingly, have inconsistently encoded such letters. The encoding of such letters will propel the Unicode standard beyond that of the now outdated ISCII.

4. The Unicode standard is continually upgraded and improved for appropriateness and completeness. This makes it incompatible with outdated and incompletely specified encoding systems such as ISCII.

Keyboard Layouts

These non-standard approaches and solutions gave rise to vendors creating variations of standard keyboards such as those from Godrej, Remington etc., This induced and continues to induce discouragement on the part of the average computer user trying out these technologies not to mention the increased costs of training and deployment.

Disadvantages of Unicode

Unicode, though it has all the makings of a true global standard, does have its issues in addressing the needs of those it aims to represent. A few are discussed as under.

  1. Requires OS support
    • Unicode requires full Operating System support to work efficiently.
    • It also requires significant extensions to tools and libraries. C-Library requires new functions with new names that work with Unicode data. This can bifurcate the API set of an operating system, i.e., Win32 splits its API subset of functions using strings into two parts.
  2. Lack of Input methods
    • While most current versions of operating systems do support Unicode, and do bundle Unicode fonts as part of their product suites, the lack of proper and widely used Input methods is a significant drawback. In the case of Indic languages, the solutions are still proprietary, ASCII-based, and application specific. Aksharamala is the first truly Unicode-enabled, application-independent Input mechanism.
  3. Lack of Application compatibility
    • Several popular software applications do not necessarily support Unicode as yet. This does pose a problem when data (vernacular most typically) is sought to be exported or imported from such an application.
  4. It takes more space to store plain text using Unicode. Each character is derived from ISO standards and requires an extra 8-bits of space over and above its single byte representation. Of course, compression techniques for example, for UTF-8 have pretty good compression on ASCII, up to about 50%.
  5. Transmission of Unicode data can also use more bandwidth as a result of its requirement for an additional byte over ASCII. Despite this drawback, Unicode, in the form of UTF-8, is becoming a popular mechanism on the Web.

Conclusions

Just as most technologies see convergence with regards to standards, the representation of languages is also moving inevitably towards Unicode. Vendors need to take this into account, if they already have not, in making their product offerings standards-based. This will, more importantly, provide the end-users a convenient, uniform interface, which definitely makes business sense.  Initial initiatives such as ISCII, still the core of many a vendor product, are outdated and outmoded.

It is also a fact that Unicode has its shortcomings in addressing Indic support. The choice for any vendor seems like a choice between a blind horse (ISCII) and a lame horse (Unicode), the lame horse is a better bet as its direction is correct and there is every chance that it might reach its destination in due course. Making a choice that is standard has a lot of social and commercial implications also. A uniform acceptance of a standard will give incentive to vendors and developers to base their products / services on that standard creating a critical mass for vernacular software to make it to the mainstream. This will definitely mean a lot more market reach in the now rapidly developing Indic computing space.

From a social perspective, the digital divide between the English-speaking elite and the Indian masses will be definitely bridged. This is possible because the e-governance initiatives taking seed in several states of the Indian Union can be in the vernacular and so can serve and be taken advantage of by the rural masses, who constitute a major percentage of Indian society. One-of-a-kind initiatives such as eSeva from the state government of Andhra Pradesh are excellent paradigms for empowering the masses and technology permeation for the good of society. The reach of such programs multiplies if delivered in the vernacular using global standards. Standards also mean vendors can compete in a fair system and users are not locked into proprietary prisons. The importance of standards, and initiatives in the vernacular cannot be over-emphasized and constitute the bedrock of progress in several fields of human endeavor.

While the authors acknowledge that some of these possibilities are probably remote they have been listed to underscore and highlight the crying need for standards and resulting critical mass, given such stupendous possibilities. We sincerely hope and wish our initiative, Aksharamala, will be a precursor to what we would like to see happen.

Product Showcase:

One of the first products totally supporting Unicode and addressing the Indian language space is Aksharamala. It comes with a rich set of features that not only give conducive and standard interfaces but also provide tools and technologies for:

  1. Conversion of legacy content to Unicode
  2. API layer for creating application software in Indian languages
  3. Creation of web-based applications and Web pages in Indian languages (using the IE companion bar).
  4. Support for Chat and Email that enable interaction in Indian languages.
  5. Creation of content in Indian languages using familiar tools such as FrontPage, Microsoft Office product suite etc., (See the application compatibility list for more information).
  6. Easy Extensibility for enhanced keyboard support, languages, fonts and transliteration schemes.

References:

The Unicode consortiumhttp://www.unicode.org
How is Unicode beneficial?http://www.i18nguy.com/UnicodeBenefits.html
Devanagari Fonts http://www.nepali.info/nepali/fonts/
Gaiji: Characters, Glyphs, Both, or Neither?http://www.imug.org/pastevents03.html#010816
About ISCIIhttp://acharya.iitm.ac.in/multi_sys/exist_codes.html
Indic Scripts and Languageshttp://www.unicode.org/faq/indic.html
Proposed Changes in UNICODEhttp://tdil.mit.gov.in/pchangeuni.htm
ISCII Resources http://www.cdacindia.com/html/gist/standard/iscii.asp
http://tdil.mit.gov.in/standards.htm
Unicode for Indian Languages: A discussion http://acharya.iitm.ac.in/multi_sys/unicode/intro.html

1 comment:

  1. how to convert GIST-Ajay or other hindi font to unicode online/offline

    ReplyDelete