Tamil All Character Encoding
Keyboard drivers and Fonts
The Keyboard driver for this encoding scheme are available in Tamil Virtual University website for free. It uses Tamil99 and Tamil Typewriter keyboard layouts, which are approved by Tamil Nadu Government, and maps the input keystrokes to its corresponding characters of TACE16 scheme. To read the files which are created using TACE16 scheme, the corresponding Unicode Tamil fonts for this encoding scheme are also available in the same website. These fonts not only has mapping of glyphs for characters of TACE16 format, but also has mapping of glyphs for the present Unicode encoding for both ASCII and Tamil characters, so that it can provide backward compatibility for reading existing files which are created using present Unicode encoding scheme for Tamil language.
|Newly added. Not present in Unicode_v6.3.|
|Allocated for researches(NLP)|
|For future use|
Analysis of TACE16 over present Unicode standard for Tamil language
||The neutrality of this section is disputed. (January 2015)|
Issues with the present Unicode for Tamil language
- Unicode code Tamil has code positions only for 31 out of 247 Tamil Characters. These 31 characters include 12 vowels, 18 agara-uyirmey and one aytham. Five Grantha agara-uyirmey are also provided code space in Unicode Tamil. The other Tamil Characters have to be rendered using a separate software. Only 10% of the Tamil Characters are provided code space in the Present Unicode Tamil. 90% of the Tamil Characters that are used in general text interchange are not provided code space.
- The Uyir-meys that are left out in the present Unicode Tamil are simple characters, just like A, B, C, D are characters to English. Uyir-meys are not glyphs, nor ligatures, nor conjunct characters as assumed in Unicode. ka, kA, ki, kI, etc., are characters to Tamil.
- In any plain Tamil text, Vowel Consonants (uyir-meys) form 64 to 70%; Vowels (uyir) form 5 to 6% and Consonants (meys) form 25 to 30%. Breaking high frequency letters like vowel-consonants into glyphs is highly inefficient.
- This type of encoding which requires a rendering engine to realize a character while computing is not suitable for applications like system software developments in Tamil, searching and sorting and Natural language processing(NLP) in Tamil, It consumes extra time and space, making the computing process highly inefficient. For such applications Level-1 implementation where all the characters of a language have code positions in the encoding, like English is required.
- This encoding is based on ISCII - 1988 and therefore, the characters are not in the natural order of sequence. It requires a complex collation algorithm for arranging them in the natural order of sequence.
- It uses multiple code points to render single characters. Multiple code points lead to security vulnerabilities, ambiguous combinations and requires the use of normalization.
- Simple counting letters, sorting, searching are inefficient
- It requires ZWJ/ZWNJ type hidden chars.
- It needs exception table to prevent illegal combinations of code points.
- Unicode Indic block is built on enormous, complex, error-prone edifice, based on an encoding that is NOT built to last.
- Very first code point says “Tamil Sign Anusvara - Not used in Tamil”.
- Assumed collation was same as Devanagari - incorrectly uses ambiguous encoding to render same character.
- It encodes 23 Vowel-Consonants (23 consonants + Ü) and calls them as consonants, against Tamil grammar.
- Unnatural for Speech to Text/Text to Speech.
- Inefficient to store, transmit and retrieval(For example, File reading and writing, Internet, etc.).
- Complex processing hinders development.
- Need normalization for string comparison.
- A sequence of characters may correspond to a single glyph, that is, ச + ெ◌ + ◌ா = ெசா. Characters are not graphemes. According to Unicode ெசா is a grapheme; but ச, ெ◌, ◌ா are characters.
- Requires Dynamic Composition - a text element encoded as a sequence of a base character followed by one or more combining marks.
- There are two methods of rendering the Vowel Consonants. This leads to ambiguity in rendering characters.
- The present Unicode is not efficient for parsing. For example, let us count the letters in the name திருவள்ளுவர். Even a Tamil child in a primary school can say that this name has Seven letters. According to Unicode this name has twelve characters: த ◌ி ர ◌ு வ ள ◌் ள ◌ு வ ர ◌
- To properly count the letters in this name, an expert developer had to write a complex program and present it as a technical paper in a Tamil computing conference. To compare, counting letters in an English word is an exercise left to a beginning programmer. Such problems are triggered because a simple script such as Tamil is treated as a complex script by Unicode. This is provided, for example in Python library open-tamil, by function tamil.utf8.get_letters.
- The Unicode standard policy is to encode only characters, not glyphs. However,https://ezhillang.wordpress.com/2014/01/26/open-tamil-text-processing-%E0%AE%89%E0%AE%B0%E0%AF%88-%E0%AE%AA%E0%AE%95%E0%AF%81%E0%AE%AA%E0%AF%8D%E0%AE%AA%E0%AE%BE%E0%AE%AF%E0%AF%8D%E0%AE%B5%E0%AF%81/ because Unicode Tamil standard includes the vowel signs as combining characters. These signs that have no meaning to a Tamil reader would be displayed as is by character shaping engines that detect a blank space between them and a base character. Thus Unicode introduces the dotted circle as a Tamil character.
- Unicode Tamil is not fully supported in many platforms primarily because Tamil is treated as a complex script that requires complex processing.
- Since all the above mentioned inefficiencies consumes extra processing cycles of a processor(which in turns the consumption of electricity) for a machine than needed, it will increase the overall lifetime power usage(electricity) by a machine which processes Unicode Tamil and might reduce the lifetime of that machine. For example, take a very simple instance of processing a single Tamil character kI(கீ), it has to process both consonant and vowel modifier, which doubles the consumption of processing cycles of a processor(which in turns the consumption of electricity). If we consider all the machines and servers across the whole world which processes the Unicode Tamil characters, the extra processing power consumption will be huge.
Analysis of TACE16 over Unicode Tamil
- TACE16 is efficient over Unicode Tamil by about 5.46 to 11.94 percent in the case of Data Storage Application.
- TACE16 is efficient over Unicode Tamil by about 18.69 to 22.99 percent in the case of Sorting Index Data.
- TACE16 is efficient over Unicode Tamil by about 25.39% when the entire data is of Tamil. The default collation sequence followed (Binary) while using the code space values in the New TACE16 is not as per Tamil Dictionary order. Some of the uyir-meys (Agara-uyirmeys) are taking precedence over vowels and other Uyirmeys in the New TACE16, the vowels and agarauyir-meys being in the 0B80 - 0B8F block and the other Uyir-meys being in the 0800 to 08FF. Because of this reason, sorting Unicode data looks better than TACE16 data.
- TACE16 is faster in sorting over Unicode Tamil by about 0.31 to 16.96 percent.
- Index creation on TACE16 data is faster by 36.7% than Unicode.
- For Full key Search on Indexed Fields, TACE16 performed better than Unicode Tamil by up to 24.07%. In the case of non-indexed fields also TACE16 performed better than Unicode Tamil by up to 20.9%.
- Rendering of static Tamil Data was fine with TACE16.
Advantages of TACE16 over Unicode Tamil
TACE16 character encoding scheme not only overcomes all the issues with the present Unicode encoding standard for Tamil language which are mentioned above, but also provides additional advantage over major performance improvements in both processing time and processing space which are the major factors in affecting the efficient and speedy execution of any computer based program. This system has the following additional advantages:
- The encoding is Universal since it encompasses all characters that are found in general Tamil text interchange.
- The Collation is sequential in accordance with the code value.
- The encoding is unambiguous.
- Any given code point always represents the same character.
- There is no ambiguity as in the present Unicode Tamil.
This system has the following advantages for computer programming:
- The basic software design to accommodate Tamil characters and their processing are simplified.
- Sorting and searching is very simple.
- For a machine, TACE16 takes less processing cycles of a processor(which in turn takes less electricity) than Unicode Tamil. Basically, TACE16 is greener than Unicode Tamil.
- TACE16 allows to do programming based on Tamil grammar, which is not very easy in Unicode Tamil(needs extra framework development).
- The encoding is very efficient to parse. By simple arithmetic operation the characters can be parsed. In computer programming, second method is very efficient in terms of performance over large character set. Also, these methods follows the basic Tamil grammar that Consonant+Vowel=Vowel-Consonant(UyirMei) which is not followed in Unicode Tamil.
Method 1(By simple arithmetic operations): க் + இ = கி E210(க்) + E203(இ) = 1C413 1C413 - E200(Constant) = E213(கி) Method 2: க்(E210) + இ(E203) = கி(E213) E210(க்) | ( E203(இ) & 000F(Constant) ) = E213(கி)
- It is very efficient to divide a vowel-consonant(UyirMei) character into its corresponding vowel and consonant. This is very efficient in terms of performance over large data.
/* To get Vowel */ E213(கி) & 'F20F(Constant)' = E203(இ) /* To get Consonant */ E213(கி) & 'FFF0(Constant)' = E210(க்)
- It is very efficient to find whether a character is vowel or consonant or vowel-consonant(UyirMei) or numbers.
c = the TACE16 encoding for a Tamil character /* To check whether a character is vowel */ ( ( c >= E201 ) && ( c <= E20C ) ) == true // => Vowel /* To check whether a character is consonant */ x = ( c & '000F(Constant)' ) ( ( x == 0 ) && ( ( c > E200 ) && ( c < E390 ) ) ) == true // => Consonant /* To check whether a character is Vowel-consonant(UyirMei) */ x = ( c & '000F(Constant)' ) // => Unique number for each vowel starting from 1 ( ( ( x >= 1 ) && ( x <= 12 ) ) && ( ( c >= E211 ) && ( c < E38D ) ) ) == true // => Vowel-Consonant(UyirMei) /* To check whether a character is Tamil number */ x = ( c & '000F(Constant)' ) ( ( c & 'E18F(Constant)' == c ) && ( x <= 12 ) ) == true // => Tamil Number
- It is very easy to convert numbers to Tamil numbers(new Tamil number format) and vice versa(same as Unicode Tamil).
/* To convert a number to new format of Tamil number and vice versa, direct digit to digit conversion is enough */ /* To convert a number to new format of Tamil number */ n = single digit number(0-9) ( n & 'E18F(Constant)' ) // => Tamil Number ( n | 'E180(Constant)' ) // => Tamil Number /* To convert new format of Tamil number to a number */ c = single digit Tamil number character(௦-௯) ( c & '000F(Constant)' ) // => Number
The open-tamil project provides many of the common operations, e.g. to extract letters from Unicode UTF-8 encoded string, sorting, searching etc, whereby we achieve the Level-1 compliance of Tamil text processing without using TACE16.
#!usr/bin/python # -*- coding:UTF-8 -*- import codecs,os import tamil.utf8 as utf8 with codecs.open('singl','w',encoding='utf-8') as ff: letters = utf8.get_letters(u"கூவிளம் என்பது என்ன சீர்") for letter in letters: ff.write(unicode(letter)) print unicode(letter) ff.write('\n') ff.close()
generates the output, output: கூ வி ள ம் எ ன் ப து எ ன் ன சீ ர்