Tamil All Character Encoding
This article contains content that is written like an advertisement. (June 2016)
Keyboard drivers and fonts
The Keyboard driver for this encoding scheme are available in Tamil Virtual University website for free. It uses Tamil99 and Tamil Typewriter keyboard layouts, which are approved by Tamil Nadu Government, and maps the input keystrokes to its corresponding characters of TACE16 scheme. To read the files which are created using TACE16 scheme, the corresponding Unicode Tamil fonts for this encoding scheme are also available in the same website. These fonts not only have mapping of glyphs for characters of TACE16 format, but also for the present Unicode encoding for both ASCII and Tamil characters, so that they can provide backward compatibility for reading existing files which are created using present Unicode encoding scheme for Tamil language.
|Newly added. Not present in Unicode_v6.3.|
|Allocated for researches(NLP)|
|For future use|
Analysis of TACE16 over present Unicode standard for Tamil language
Analysis of TACE16 over present Unicode standard for Tamil language:
Issues with the present Unicode for Tamil language
- Unicode code Tamil has code positions only for 31 out of 247 Tamil Characters. These 31 characters include 12 vowels, 18 agara-uyirmey, one aytham, not including five Grantha agara-uyirmey which are also provided code space in Unicode Tamil. The other Tamil Characters have to be rendered using a separate software. Only 10% of the Tamil Characters are provided code space in the Present Unicode Tamil. 90% of the Tamil Characters that are used in general text interchange are not provided code space.
- The Uyir-meys that are left out in the present Unicode Tamil are simple characters, just like A, B, C, D are characters to English. Uyir-meys are not glyphs, nor ligatures, nor conjunct characters as assumed in Unicode. ka, kA, ki, kI, etc., are characters to Tamil.
- In any plain Tamil text, Vowel Consonants (uyir-meys) form 64 to 70%; Vowels (uyir) form 5 to 6% and Consonants (meys) form 25 to 30%. Breaking high frequency letters like vowel-consonants into glyphs is highly inefficient.
- This type of encoding, which requires a rendering engine to realize a character while computing, is not suitable for applications like system software developments in Tamil, searching and sorting and Natural language processing (NLP) in Tamil. It consumes extra time and space, making the computing process highly inefficient. For such applications Level-1 implementation where all the characters of a language have code positions in the encoding, like English, is required.
- This encoding is based on ISCII (1988) and therefore, the characters are not in the natural order of sequence. It requires a complex collation algorithm for arranging them in the natural order of sequence.
- It uses multiple code points to render single characters. Multiple code points lead to security vulnerabilities, ambiguous combinations and requires the use of normalization.
- Simple counting letters, sorting, searching are inefficient.
- It requires ZWJ/ZWNJ type hidden characters.
- It needs exception table to prevent illegal combinations of code points.
- Unicode Indic block is built on enormous, complex, error-prone edifice, based on an encoding that is NOT built to last.
- Very first code point says "Tamil Sign Anusvara - Not used in Tamil".
- Assumed collation was same as Devanagari - incorrectly uses ambiguous encoding to render same character.
- It encodes 23 Vowel-Consonants (23 consonants + Ü) and calls them as consonants, against Tamil grammar.
- Unnatural for Speech to Text/Text to Speech.
- Inefficient to store, transmit and retrieval (For example, file reading and writing, Internet, etc.).
- Complex processing hinders development.
- Need normalization for string comparison.
- A sequence of characters may correspond to a single glyph, that is, ச + ெ◌ + ◌ா = ெசா. Characters are not graphemes. According to Unicode ெசா is a grapheme; but ச, ெ◌, ◌ா are characters.
- Requires Dynamic Composition - a text element encoded as a sequence of a base character followed by one or more combining marks.
- There are two methods of rendering the Vowel Consonants. This leads to ambiguity in rendering characters.
- The present Unicode is not efficient for parsing. For example, the name திருவள்ளுவர் looks like it should have seven letters. However, according to Unicode, this name has twelve characters: த ◌ி ர ◌ு வ ள ◌் ள ◌ு வ ர ◌
- To properly count the letters in this name, an expert developer had to write a complex program and present it as a technical paper in a Tamil computing conference. To compare, counting letters in an English word is an exercise left to a beginning programmer. Such problems are triggered because a simple script such as Tamil is treated as a complex script by Unicode. For example in Python library open-tamil, which uses present Unicode Standard for Tamil, in order to count the number of Tamil letters in the given text, the function tamil.utf8.get_letters is first used to parse the text into a List and then returns the length of the list as the count of the number of letters. This type of complex programming logic or extra additional layer of framework requirement is needed when a simple script such as Tamil is treated as a complex script.
- The Unicode standard policy is to encode only characters, not glyphs. However, because Unicode Tamil standard includes the vowel signs as combining characters. These signs that have no meaning to a Tamil reader would be displayed as is by character-shaping engines that detect a blank space between them and a base character. Thus Unicode introduces the dotted circle as a Tamil character.
- Unicode Tamil is not fully supported in many platforms primarily because Tamil is treated as a complex script that requires complex processing.
- Since all the above-mentioned inefficiencies consume more processing cycles of a processor for a machine than needed, it will increase the overall lifetime power usage (electricity) by a machine which processes Unicode Tamil. For example, when processing a single Tamil character kI (கீ), it has to process both consonant and vowel modifier, which doubles the consumption of processing cycles of a processor.
Analysis of TACE16 over Unicode Tamil
- TACE16 is efficient over Unicode Tamil by about 5.46 to 11.94 percent in the case of Data Storage Application.
- TACE16 is efficient over Unicode Tamil by about 18.69 to 22.99 percent in the case of Sorting Index Data.
- TACE16 is efficient over Unicode Tamil by about 25.39% when the entire data is of Tamil. The default collation sequence followed (Binary) while using the code-space values in the New TACE16 is not as per Tamil dictionary order. Some of the uyir-meys (Agara-uyirmeys) are taking precedence over vowels and other Uyirmeys in the New TACE16, the vowels and agarauyir-meys being in the 0B80 - 0B8F block and the other Uyir-meys being in the 0800 to 08FF. Because of this reason, sorting Unicode data looks better than TACE16 data.
- TACE16 is faster in sorting over Unicode Tamil by about 0.31 to 16.96 percent.
- Index creation on TACE16 data is faster by 36.7% than Unicode.
- For full key search on indexed fields, TACE16 performed better than Unicode Tamil by up to 24.07%. In the case of non-indexed fields also TACE16 performed better than Unicode Tamil by up to 20.9%.
- Rendering of static Tamil Data was fine with TACE16.
Advantages of TACE16 over Unicode Tamil
The TACE16 character-encoding scheme not only overcomes all the issues with the present Unicode encoding standard for Tamil language which are mentioned above, but also provides major performance improvements in both processing time and processing space. This system has the following additional advantages:
- The encoding is Universal since it encompasses all characters that are found in general Tamil text interchange.
- The Collation is sequential in accordance with the code value.
- The encoding is unambiguous.
- Any given code point always represents the same character.
- There is no ambiguity as in the present Unicode Tamil.
The Unicode Tamil encoding had many issues and there was a proposal to reencode Tamil. This was rejected by Unicode, who said that the reencoding would be damaging and there was no convincing evidence Unicode Tamil encoding is bad.
This system has the following advantages for computer programming:
- Software to accommodate Tamil characters and their processing is simplified.
- Sorting and searching are very simple.
- For a machine, TACE16 takes fewer CPU cycles (and so uses less electricity) than Unicode Tamil.
- TACE16 allows to do programming based on Tamil grammar, which is not very easy in Unicode Tamil (needs extra framework development).
- The encoding is very efficient to parse. By simple arithmetic operation the characters can be parsed. In computer programming, second method is very efficient in terms of performance over large character set. Also, these methods follows the basic Tamil grammar that Consonant+Vowel=Vowel-Consonant(UyirMei) which is not followed in Unicode Tamil.
Method 1 (By simple arithmetic operations): க் + இ = கி E210 (க்) + E203 (இ) - E200 (Constant) = E213 (கி) Method 2: க் (E210) + இ (E203) = கி (E213) E210 (க்) | (E203 (இ) & 000F (Constant)) = E213 (கி)
- It is very efficient to divide a vowel-consonant (UyirMei) character into its corresponding vowel and consonant. This is very efficient in terms of performance over large data.
/* To get Vowel */ E213 (கி) & 'F20F (Constant)' = E203 (இ) /* To get Consonant */ E213 (கி) & 'FFF0 (Constant)' = E210 (க்)
- It is very efficient to find whether a character is vowel or consonant or vowel-consonant (UyirMei) or numbers.
/* | - Bitwise OR * & - Bitwise AND * ! - Bitwise NOT * ^ - Bitwise XOR * ||- Conditional OR * &&- Conditional AND */ c = the TACE16 encoding for a Tamil character /* To check whether a character is vowel */ /* Method 1 */ ((c >= E201) && (c <= E20C)) == true // => Vowel /* Method 2 - If code positions E200, E20E, E20F are not used for any other purpose*/ (((c & 'E20F (Constant)')==c) && (c != E20D)) == true // => Vowel ((!((c & 'E20F (Constant)')^c)) && (c != E20D)) == true // => Vowel /* To check whether a character is consonant or Vowel-consonant (UyirMei) */ x = (c & '000F (Constant)') // If c is Vowel or Vowel-Consonant, then x = Unique number for each vowel starting from 1 (((c >= E210) && (c <= E38C)) && (x == 0)) == true // => Consonant (((c >= E210) && (c <= E38C)) && ((x >= 1) && (x <= 12))) == true // => Vowel-Consonant(UyirMei) /* To check whether a character is Tamil number */ /* Method 1 */ ((c >= E180) && (c <= E18C)) == true // => Tamil Number /* Method 2*/ //If code positions E18D-E18F are not used for any other purpose (c & 'E18F (Constant)') == c // => Tamil Number (!((c & 'E18F (Constant)')^c)) == true // => Tamil Number //If code positions E18D-E18F are used for any other purpose, then either Method 1 or below method can be used*/ ((!((c & 'E18F (Constant)')^c)) && ((c & '000F (Constant)') <= 12)) == true // => Tamil Number
- It is very easy to convert numbers to Tamil numbers (new Tamil number format) and vice versa (same as Unicode Tamil).
/* To convert a number to new format of Tamil number and vice versa, direct digit to digit conversion is enough. */ /* To convert a number to new format of Tamil number */ n = single digit number (0-9) /* Method 1 */ (n & 'E18F (Constant)') // => Tamil Number /* Method 2 */ (n | 'E180 (Constant)') // => Tamil Number /* To convert new format of Tamil number to a number */ c = single digit Tamil number character(௦-௯) (c & '000F (Constant)') // => Number
The open-tamil project provides many of the common operations, e.g. to extract letters from Unicode UTF-8 encoded string, sorting, searching etc. Even though, the project claims Level-1 compliance of Tamil text processing without using TACE16, the project is still written on top of extra programming logic which is needed for present Unicode Standard for Tamil.
#!/usr/bin/env python import codecs import tamil.utf8 as utf8 with codecs.open('singl', 'w', encoding='utf-8') as ff: letters = utf8.get_letters(u"கூவிளம் என்பது என்ன சீர்") for letter in letters: ff.write(letter) print(letter) ff.write(' ') ff.close()
generates the output, output: கூ வி ள ம் எ ன் ப து எ ன் ன சீ ர்
- TSCII (Tamil Script Code for Information Interchange)
- AnyTaFont2UTF8 An Open source project for all Tamil Encoding/Font Mapping characters.
- Report on the final recommendations of the task force on TACE16
- Tamil Nadu Government's Tender Document for development of Tamil fonts and Tamil keyboard driver for 16-bit encodings (Unicode and TACE16)
- "தமிழ் எழுத்துருக்கள் | தமிழ் இணையக் கல்விக்கழகம் Tamil Virtual Academy".
- Tamil Nadu Government's Order(G.O.), Keyboard Drivers and Fonts
- https://github.com/arcturusannamalai/open-tamil open-tamil
- https://ezhillang.wordpress.com/2014/01/26/open-tamil-text-processing-%E0%AE%89%E0%AE%B0%E0%AF%88-%E0%AE%AA%E0%AE%95%E0%AF%81%E0%AE%AA%E0%AF%8D%E0%AE%AA%E0%AE%BE%E0%AE%AF%E0%AF%8D%E0%AE%B5%E0%AF%81/ tamil.utf8.get_letters
- https://www.unicode.org/L2/L2012/12033-tamil-presentation.pdf[bare URL PDF]
- "Archive of Notices of Non-Approval".
- https://pypi.org/project/Open-Tamil/ open-tamil project