Talk:Unicode/Archive 3

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Archive 2 Archive 3 Archive 4


Can someone who knows who AAT are add to the AAT disambiguation page appropriately and also send the link on this page to the right place? Thanks EddEdmondson 08:59, 19 Jun 2004 (UTC)

Done. (AAT is Apple Advanced Typography, but we have no article on it at present.) --Zundark 09:40, 19 Jun 2004 (UTC)

Revision history has a future date

please justify this. If it is not justified within a few days i will be reverting Plugwash 12:15, 25 Dec 2004 (UTC)

I've reverted it already. Future dates are never justified for this sort of thing, because schedules can change. --Zundark 12:39, 25 Dec 2004 (UTC)

Unicode critique

There is a link in the external link section The strongest denunciation of Unicode, and a response to it which leads to a rather old paper in this context (2001); Isn't there something more recent? Otherwise I propose that these issues are treated from today's viewpoint within the article rather than giving this link. Hirzel 5 July 2005 10:14 (UTC)

I don't think we are supposed to critique the paper. It is equally important to discuss an old paper as new one. -- Taku July 5, 2005 10:30 (UTC)
Well, the article discusses the topic why Unicode 3.1 does not solve the problems of far east scripts. In the meantime we have Unicode 4.1 which added a lot of additional characters. I would like to know how the situation is today and not how somebody evaluated the situation 4 years ago. Hirzel 6 July 2005 20:45 (UTC)
I think the article(s) should be mentioned, mainly because implementations of Unicode lag way behind the latest published versions. Unicode 2.1 support is still fairly common. Feel free to add some kind of qualifier/disclaimer that says that some or all of the issues raised were addressed in Unicode 4.1 or whatever. — mjb 6 July 2005 22:19 (UTC)
It's hard to say which of the issues were not addressed by the time the article was released. --Prosfilaes 6 July 2005 23:07 (UTC)
Be vague, then. "Some, perhaps many, of the issues raised in this article were addressed in later versions of The Unicode Standard." — mjb 7 July 2005 06:07 (UTC)
It is interesting to see the problems there were in 2001. However, I have a feeling the link should be in a completely different section. "Unicode history" or something like that. The article is 15, only partially interesting, pages, and its main point - that unicode has no more than 65000 characters - is now obsolete. Mlewan 14:32, 16 May 2006 (UTC)
When it says that "being a 16-bit character definition allowing a theoretical total of over 65,000 characters", it was wrong when it was written. Pretty much everything it says was wrong when it was written.--Prosfilaes 19:06, 16 May 2006 (UTC)
Prosfilaes, I'm missing something here. You say that pretty much everything it says was wrong, and yet you reverted my change to the link title. I could have phrased it better, I know that, but I'd really like to get rid of the word "strong". An obsolete and erroneous article cannot possible be strong, can it? Mlewan 06:44, 17 May 2006 (UTC)
It can be a "strong denunciation"; that just means it's harshely critical, not that it's a strong argument.--Prosfilaes 07:21, 17 May 2006 (UTC)
Yes, but the word is ambiguous. It also has the meaning "convincing". If you had written "violent", "vitriolic" or "rabid", it would have been clear. I'd like to spare other people from what I did, reading through all the 15 pages looking for the ";strong" argument without finding anything valid at all. Mlewan 07:48, 17 May 2006 (UTC)

Having some trouble

Redirect to Talk:Code2000

Inka hieroglyphics?

Umm, wasn't it that the Inkas had no forms of writing (apart from the k'ipu) (unsigned comment by

I've removed it it was added by an anon who did not give any justification in the edit reason and there is was mention of hyroglyphics in the inka article. Plugwash 20:26, 12 August 2005 (UTC)
It's not mentioned on the Unicode roadmaps. --Prosfilaes 23:03, 12 August 2005 (UTC)

A little clarification about Tolkien's scripts and Klingon?

I don't mean to be a spoilsport, but these bits just don't seem to fit in _at_ all. I was reading through it just then and I thought an anonymous user must have added it in for a laugh. I think a rewording's in order, but perhaps it's just me. I definately don't think it deserves quite as much as has been written about it, though. :-/ Someone care to back me up here, I'm not too sure of myself? Edit: Under the 'Development' Category --Techtoucian 10:16, 07 Jan 2005 (UTC)

I think they fit, if only because they show how the Unicode consortium actually considers scripts which to some seem no more than a 'laughing matter' -- certainly Tengwar and Cirth see more actual use than some of the scripts which are already encoded.

-- Jordi· 12:41, 7 Jan 2005 (UTC)

They belong. Tengwar and Cirth will certainly be encoded one day. Klingon is famous for being rejected for good reasons. ;-) Evertype 16:33, 12 February 2006 (UTC)

U+xxxxxx notation

"When writing about Unicode code points it is normal to use the form U+xxxx or U+xxxxxx where xxxx or xxxxxx is the hexadecimal code point." Could there be an explanation about why there are two notations? -- 16:32, 16 September 2005 (UTC)

There is just one notation, but as the most characters are in the BMP, the leading zeroes can be left out. U+1234 (ሴ) is the same as U+001234.

-- Jordi· 17:05, 16 September 2005 (UTC)

But U+48 is an invalid notation for U+0048 --Pjacobi 17:10, 16 September 2005 (UTC)
Normally 4 digits are used for stuff in the BMP and 6 digits for everything else. I'm not sure if there is a rule forbidding other lengths but i've never seen them used. Plugwash 17:23, 16 September 2005 (UTC)
The rules in the Unicode standard defining this format declare that no fewer than 4 digits are to be used. I've seen 5 digits all the time.--Prosfilaes 23:21, 16 September 2005 (UTC)
The rules for U+ notation have changed a few times. Current rules are U+xxxx, U+xxxxx, or U+xxxxxx, as appropriate, to denote a character by its code point. In older version of Unicode, code points and code values were treated separately using U- and U+ notation, respectively. The number of digits required varied. Unicode 3.0 said U- notation always had to have 8 digits, and U+ always had to have 4. That's obsolete now though. — mjb 23:57, 16 September 2005 (UTC)
I've just rewritten that piece to reflect some of what has been discussed in this section it could probablly still do with more improvement though and possiblly even making a section of its own. Plugwash 01:01, 17 September 2005 (UTC)
I've never seen 6 digits with a trailing 0.--Prosfilaes 03:06, 17 September 2005 (UTC)
You mean "leading 0", don't you? Leading zeros are not used, unless there would otherwise be fewer than four digits. See section 0.3 ("Notational Conventions") of the preface to the Unicode Standard. --Zundark 09:33, 17 September 2005 (UTC)
It seems i misremembered, probablly because most of what i have seen reffering to stuff outside the BMP is stuff relating to the limits of unicode (hence 6 digits). i've updated the article to reflect that fact that 5 digits are indeed used. Plugwash 14:58, 17 September 2005 (UTC)


Someone claimed higher up this talk page near the start of a long section that VISCII had more precomposed characters than unicode. Our page on VISCII disagrees would anyone like to clarify. Plugwash 01:55, 18 September 2005 (UTC)

Of course not. VISCII only contains characters for Vietnamese; Unicode covers so many more languages and scripts. Vietnamese may have a complex alphabet, but it certainly doesn't need more precomposed characters than all the other modern languages combined. – Minh Nguyễn (talk, contribs) 06:32, 27 January 2006 (UTC)
Besides, I think VISCII was taken as an input character set to Unicode, so they added all the VISCII compositions as precomposed characters. I can't see any other reason to make LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE (U+1EB2) a precomposed character... --Alvestrand 07:12, 27 January 2006 (UTC)


Does the link in the first paragraph really refer to APL -- the programming language? Is this a typo; meant to be IPA? (International Phonetic Alphabet)


I think it's correct. IPA is already covered by the term linguistic. If you take a look at the article, you'll see that APL requires a special set of symbols. Michael Z. 2005-10-9 23:07 Z

"Endianness is a made-up word"?

What computer jargon is not made-up? And I have never heard of "significant byte order"; if I didn't see it used with "endianness" in the log message, I wouldn't even know what "significant byte order" is. "Endianness" at least is comprehensible. (PS: In fact, if whoever did the edit tried to even do a quick google, he/she would know that the use of the word "endianness" is very widespread and does not qualify as a "made-up" word.)—Wing 08:15, 22 October 2005 (UTC)

I've also byte sex for this distinction, but yb now "endianness" is most common. --Pjacobi 09:15, 22 October 2005 (UTC)
I remember now… what he/she meant is probably byte order (“MSB first” and “LSB first” back when I first saw it), which I indeed have saw and used way before I ever saw “endianness”.—Wing 15:57, 22 October 2005 (UTC)
Agree "byte order" is the normal term don't think i've ever seen it written significant byte order though. Plugwash 18:53, 22 October 2005 (UTC)
Google is not a judge of the proper use of words/phrases. Though, going by your argument, "byte order" returns ~900,000 pages and "endianness" returns ~300,000 ("significant byte order" returns 347 and so probably shouldn't be used). And for reference, "significant byte order" is a phrase, not a word (regarding edit comment). If you do not know what "byte order" or "signficant byte order" mean then you do not understand "endianness".—Kbolino 23:16, 4 November 2005 (UTC)
a search for "endian" claims to return 4.68 million pages. The world is an inconsistent place.... --Alvestrand 07:18, 27 January 2006 (UTC)

Linguistics Connection

Is there a connection between Unicode and Phonetics (such as the IPA)? With one symbol for each sound, it is similar to how Unicode's philosophy is. Maybe the next thing would be to have all languages standardized with a sound per symbol in Unicode.

No. --Prosfilaes 00:54, 15 December 2005 (UTC)
Unicode does not standardize sound - many characters in Unicode have completely different sounds in different languages (Japanese/Chinese, for instance). However, IPA characters have code positions in Unicode. --Alvestrand 07:19, 27 January 2006 (UTC)

firefox and hex boxes

"will display a box showing the hexadecimal": maybe mention that firefox does this too...

Technical note

It would be good to have a link from technical reasons giving a brief explanation of the problem and what to do about it. For example, somewhere on Wiki there is a page suggesting some downloadable fonts for WinXP that are more complete than Arial Unicode MS. --Red King 00:44, 18 February 2006 (UTC)

Compose key in X Window


it should be mentioned in the section about different input methods for Unicode characters, that all X Window application (including Gnome and KDE, but not only them) support using Compose Key. And it should be added as well, that any key (e.g., hated CapsLock) could be redefined as Compose key, if the keyboard does not contain it natively.

Ceplm 15:15, 21 February 2006 (UTC)


Why has Unicode no code point for the digraph "ch"? -- 15:43, 3 March 2006 (UTC)

Because it can be written ch, and adding a code point for ch would cause a lot of confusion and be a Bad Thing. --Prosfilaes 17:22, 3 March 2006 (UTC)

Why has Unicode no code points for digraphs assigned? -- 11:02, 6 March 2006 (UTC)

Afaict (with a few exceptions that are there either for historical reasons or because a digraph is actually considered a seperate letter in some language) digraphs are not encoded because they are merely presentational artifacts that don't add any meaning to the text. --Plugwash 11:33, 6 March 2006 (UTC)

Can a code point in the private use area of Unicode become assigned for a digraph? -- 09:22, 7 March 2006 (UTC)

Of course. You can assign any code point in the private use area for anything you like. Whether it's sensible to do so is another matter. --Zundark 15:01, 7 March 2006 (UTC)
Code points in the private use area are not and never will be officially assigned to anything. You can use them for whatever you like in your own fonts but you probablly won't persuade many others to follow your assignments. --Plugwash 16:17, 7 March 2006 (UTC)
The private use area is guaranteed not to be assigned *by Unicode* to anything. When you use codes in the private use area, you have absolutely no protection against someone else using the same codepoint for something completly different. That's what "private use" means, after all. --Alvestrand 16:24, 7 March 2006 (UTC)

Why has Unicode no code point for the Apple logo? -- 10:55, 4 March 2006 (UTC)

Because it doesn't cover corporate logos. Apple has assigned a code point for its logo in the private use area of Unicode. --Zundark 12:48, 4 March 2006 (UTC)

Greek and Cyrillic

Why has Unicode separate code points for Greek and Cyrillic? -- 10:45, 12 March 2006 (UTC)

Because they were encoded based on different sources, and never made a problem of size. The only real unification done in Unicode is the Han. --Alvestrand 13:25, 12 March 2006 (UTC)

Circuit components

Why has Unicode no code points for circuit components assigned? -- 14:29, 18 March 2006 (UTC)

Circuit components are not used in natural languages. Also, circuit component symbols, as far as I know, never existed in any historical character encodings (at least, none that the Unicode committees were interested in preserving compatibility with). —mjb 08:22, 19 March 2006 (UTC)
Otoh the musical symbols got in and from what i remember of the rationale (i cba finding it right now) it could be pretty much word for word applied to electronic symbols. I guess its just a case of noone being bothered to write up a proposal. Plugwash 12:18, 19 March 2006 (UTC)
I think I found one of the original docs: [1] - it has a section with examples of symbols used as text. I suspect the main reason for its encoding was to get someone (anyone!) to approve a canonical list of names for the symbols, though.... --Alvestrand 12:59, 19 March 2006 (UTC)

Number of bits

"Depending on the variant used, Unicode requires either 16 or 20 bits for each character represented." is completely wrong, and part of the reason I'm unhappy with these rewrites. Unicode never uses 20 bits for each character. In theory, it takes 20.1 bits to encode a Unicode character. In practice, the only constant length scheme uses 32 bits for each character.

Likewise, Unicode is not designed to encode all "human" languages. It's designed to encode all languages; it just happens that humans are the only group we know of that has a written language.--Prosfilaes 00:03, 25 March 2006 (UTC)


what the crap, unicode 5.0 was supposed to be released some time after february 2006 but its been a while and its still not out possibly they're running behind schedule? 20:03, 31 March 2006 (UTC)

It's coming soon... they're wrapping up the beta about now. Sukh | ਸੁਖ | Talk 20:57, 31 March 2006 (UTC)
thanks for the anwser :) 23:17, 14 April 2006 (UTC)

Bold and italic accented letters

Why has Unicode no code points for bold or italic accented letters assigned? -- 09:58, 21 April 2006 (UTC)

The principle followed in Unicode is that of encoding abstract characters. Properties like weight and slant are considered to be details of the same sort as variations among typefaces and so are not in general encoded. The exception is that there are bold and italic versions of some letters used as mathematical symbols because in mathematics a different weight or slant may distinguish one character from another.Bill 10:17, 21 April 2006 (UTC)

Language of Han characters

Why has Unicode not separate code points for Japanese, Korean, simplified Chinese, and traditional Chinese Han characters? -- 10:49, 21 April 2006 (UTC)

See han unification, this is probablly one of the most controversial descisions unicode made. I suspect the main reason was the sheer number of characters needed (remember unicode used to be 16 bit fixed width) Plugwash 11:02, 21 April 2006 (UTC)

Guaraní language

Why has Unicode no code point for g-tilde? -- 12:35, 21 April 2006 (UTC)

From the Unicode principles:
Certain sequences of characters can also be represented as a single character, called a precomposed character (or composite or decomposible character). For example, the character "ü" can be encoded as the single code point U+00FC "ü" or as the base character U+0075 "u" followed by the non-spacing character U+0308 "¨". The Unicode Standard encodes precomposed characters for compatibility with established standards such as Latin 1, which includes many precomposed characters such as "ü" and "ñ".
I guess no established standard encoded g-tilde, so g + combining tilde is The Way. --Alvestrand 19:15, 21 April 2006 (UTC)
What language uses g-tilde anyway? Afaict the only reason precomposed characters are there at all was to allow applications to transistion to unicode in stages (e.g. adopt unicode as thier text storage format without having to support combining characters). Plugwash 00:11, 22 April 2006 (UTC)
The clue was in the heading - Guaraní, widely spoken in Paraguay, uses g-tilde. AFAIK, it's the only language that does. All the same, it strikes me as an omission in Unicode since there are all manner of precomposed characters included which are used only in languages with considerably fewer speakers than Guaraní has. Sure, one can always use a combining tilde, but that isn't going to be an easy thing to explain if one happens to be trying to teach IT in Paraguay (and what's more, I haven't yet found a font where it looks right on the upper case G).AndyofKent 16:31, 12 August 2006 (UTC)

Hexadecimal Digits

Why has Unicode not separate code points for the Latin letters A-F and the hexadecimal digits A-F? -- 09:25, 22 April 2006 (UTC)

This is not the right place for such questions. Go ask on the Unicode mailing list or some other place for asking questions about Unicode. This is about writing an article.--Prosfilaes 12:49, 22 April 2006 (UTC)
And please don't delete other people's comments on talk pages.--Prosfilaes 17:47, 22 April 2006 (UTC)
Before asking questions anywhere about why such and such is not in Unicode, please read Where is my character? and Combining mark FAQ on the Unicode web site.Bill 19:49, 23 April 2006 (UTC)
It's simple: because no existing codepage had them separated. Unicode was designed to be a superset of all existing codepages: so, if some characters were separate in any one codepage, then they were separate in Unicode too. And if some characters were separate in no codepage, then, most often, they were not separate in Unicode, too. — Monedula 11:55, 24 April 2006 (UTC)


I see some problems withthe text in this section:

Web browsers have been supporting severals UTFs, especially UTF-8, for many years now.

That seems tomix up Unicode encodings with unicode as a document character set.

Display problems result primerally from font related issues.

How is that different to any other use, such as wordprocessing or email?

In particular Internet Explorer doesn't render many code points unless it is explicitly told to use a font that contains them.

Needs to say which version of IE, on which platform, and it seems to be incorrect anyway if referring to Windows IE6 or IE7

All W3C recommendations are using Unicode as their document character set, the encoding being variable, ever since HTML 4.0.

In fact Unicode has been the document character set ever since HTML 2.0.

Although syntax rules may affect the order in which characters are allowed to appear, both HTML 4 and XML (including XHTML) documents, by definition, comprise characters from most of the Unicode code points, with the exception of:

This varies between XML1.0 and 1.1

  • any code point above 10FFFF.

There are no Unicode code points above 10FFFF by definition. --Nantonos 18:10, 27 April 2006 (UTC)

Unicode has been the document character set ever since HTML 2.0. — that is completely incorrect. In HTML 2.0 and 3.2, it was ISO 646:1983 IRV (ASCII, basically) minus all control codes except tab, LF, and CR, plus ISO IR-001/ECMA 94 minus control codes (ISO/IEC 8859-1, basically).[2][3]. In HTML 4.0 the document character set is Unicode's entire range up to 10FFFF, minus the surrogate range and the same control characters that were disallowed in the previous versions. XML 1.0 doesn't use the term "document character set" but it does have the same concept: regardless of how the document is encoded (UTF-8, ASCII, whatever), it consists only of characters from a limited repertoire that is almost, but not quite the same as in HTML 4. This info is in the XML specs but really is not relevant to this article.—mjb 01:17, 1 May 2006 (UTC)

Alt+0### is not a Unicode input method

The "###" in "Alt+0###" is not a Unicode code point and therefore this is not a Unicode input method. On my US Win2K system it appears to be giving me Windows-1252 characters. For example, Alt+0128 gives me the Euro symbol, and anything above 255 is entered as if I had typed it modulo 256 (e.g., 256=000, 257=001, and so on). This misinformation about it being "Unicode" is repeated on Alt codes as well. Please research the actual behavior and fix the articles. Thanks!—mjb 00:48, 1 May 2006 (UTC)

The answer is it depends on the app and many people are confused about it. I've corrected it many times but errors have always been introduced.
In a plain windows edit box with no special previsions made (as in notepad) ALT+ is ansi only.
In a plain windows richedit box and in many apps that make thier own provisions (including ms word and by the looks of things firefox) a leading 0 or a code more than 3 digits makes it enter unicode. Plugwash 00:34, 2 May 2006 (UTC)
Actually testing now it seems even more complex than i thought. I'll need to have a better look into this and see if i can figure out the patterns. Plugwash 00:41, 2 May 2006 (UTC)
Thanks. I know it's not simple, and is poorly documented, which is why I didn't research it myself (we all pick our 'battles' around here)…—mjb 23:33, 3 May 2006 (UTC)
See if is of any help.—mjb 17:26, 13 May 2006 (UTC)

Fullwidth accented letters

Why has Unicode no code points for fullwidth accented letters assigned? -- 14:50, 3 May 2006 (UTC)

Please don't feed the troll.--Prosfilaes 17:49, 3 May 2006 (UTC)
Fullwidth forms exist only for compatibility with legacy encodings. Therefore they only contain characters which existed in such encodings. Plugwash 18:27, 3 May 2006 (UTC)
As Prosfilaes says, please don't feed this troll. This person asks over and over again questions that are not appropriate for this talk page to which he or she could easily find the answers on the Unicode Consortium web site, to which he or she has been referred. Please just ignore these questions.Bill 18:36, 3 May 2006 (UTC)

Latin small letter dotless J

Is there a code point in Unicode for "Latin small letter dotless J" assigned? -- 10:16, 13 May 2006 (UTC)

Yes, it's U+0237. There's also an italic version for mathematics: U+1D6A5.—mjb 17:25, 13 May 2006 (UTC)
Note that this person, whose IP fluctuates but stays in the same range, has asked a long series of these questions, and promptly after you answered, he asked another. The questions are off-topic, since they don't go towards improving the article, and if they are honest, the writer needs to learn how to look for the answers himself, most of which are easy to find on the Unicode website.--Prosfilaes 21:28, 13 May 2006 (UTC)

Latin capital letter J with dot above

Is there a code point in Unicode for "Latin capital letter J with dot above" assigned? -- 18:21, 13 May 2006 (UTC)

No. — Monedula 11:01, 31 May 2006 (UTC)

Why has Unicode no code point for "Latin capital letter J with dot above" assigned? -- 10:29, 3 June 2006 (UTC)

The code points are usually assigned from the question "why should we?" rather than "why not?" Is there a language that uses "Latin capital letter J with dot above"? If not, no need to add the code point. Mlewan 06:22, 5 June 2006 (UTC)
Even if there was a language using it it wouldn't get added because it can be formed with a combining sequence. Adding precomposed characters was a temporary concession to help existing software developers migrate to unicode. Plugwash 15:32, 5 June 2006 (UTC)

Should we insert the Unicode logo on the page?

The following reply was received from the Consortium on May 30, 1996:

You have permission to display the Unicode logo in Wikipedia as long as:
1. The logo links to our website
2. You include the following acknowledgement in the fine print:
The Unicode(R) Consortium is a registered trademark, and Unicode (TM) is a trademark of Unicode, Inc.

Monedula 11:15, 31 May 2006 (UTC)

Why has Unicode no code point for the Unicode logo assigned? -- 15:36, 8 June 2006 (UTC)

Cherlin Because Unicode does not encode logos. Not even its own. Especially not its own, in fact. ;->

Precomposed characters

Why has Unicode the encoding of new precomposed characters stopped? -- 19:13, 8 June 2006 (UTC)

Introducing a new representation (a precomposed character) for an already encodable (using combining sequences) character is a bad thing because it will break compatibility with those that can handle the character but don't understand the new representation. There is also the issue of normalisation rules that say the normalised form of a character must not change between revisions (so the new precomposed characters could not actually be used in normalised text). Unicode only had precomposed characters in the first place as a carrot to get users of legacy encodings to switch, nothing more. Plugwash 22:00, 8 June 2006 (UTC)

Why has Unicode many code points for precomposed characters assigned? -- 08:18, 9 June 2006 (UTC)


Why has Unicode no code points for ligatures assigned? -- 07:25, 9 June 2006 (UTC)

Shut up and read ! (talk) 23:56, 29 November 2007 (UTC)

Double Byte Character Sets before Unicode: Only in East Asia?

Why are all Double Byte Character Sets, before Unicode was published, from East Asia? -- 07:31, 9 June 2006 (UTC)

Really, this is not the point of the talk page. Go somewhere else to ask these questions.--Prosfilaes 07:37, 9 June 2006 (UTC)

Confusing typo in Ligatures section

The phrase "the special marks are preceed the main letterform in the datastream" appears in the Ligatures section and seems to have a typo. Which of the following is it trying to say?

1: "the special marks preceed the main letterform in the datastream"
2: "the special marks are preceeded by the main letterform in the datastream"

These two phrases have opposite meanings, of course. --Silent Blue 13:46, 16 June 2006 (UTC)