Jump to content

Talk:Unicode: Difference between revisions

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Content deleted Content added
Missing history
Tarquin (talk | contribs)
Input methods: can't get this to work on Windows 2000
Line 326: Line 326:


Another form to enter Unicode characters is Alt + '''PLUS SIGN''' + HEX CODE. It worked for me on Windows XP. --<span style="font-family: DejaVu Sans">[[User:ɑʀʇʉʀɵ|ʀʇʉʀɵ]]</span> 23:57, 25 January 2007 (UTC)
Another form to enter Unicode characters is Alt + '''PLUS SIGN''' + HEX CODE. It worked for me on Windows XP. --<span style="font-family: DejaVu Sans">[[User:ɑʀʇʉʀɵ|ʀʇʉʀɵ]]</span> 23:57, 25 January 2007 (UTC)

The article says: "it is possible to create Unicode characters by pressing Alt + PLUS + #, where # represents the hexadecimal code point up to FFFF; for example, Alt + PLUS + F1 will produce the Unicode character ñ." but I can't get this to work on Win2000. In MS Word, I just get the latest from the Recent Files opened. Does "+" mean the keys should be pressed sequentially or all together?


== Nifty resource. ==
== Nifty resource. ==

Revision as of 12:39, 18 June 2007

Template:FAOL

Intro

This paragraph was added to the start of the article:

Unicode is a standard used in computer software for encoding human readable characters in digital form. The most common encoding is the ASCII code, which can encode a maximum of 127 characters, which is enough for the English language. As computer use spread to other languages, the shortcomings of ASCII became more and more apparent. There are many other languages with many other characters; Asian languages in particular contain many, many characters.

I removed it because it is inaccurate (It overplays Unicode-as-a-standard rather than Unicode as a consortium that produces lots of standards), confusing (its mention of ASCII is not clearly historical), and adds no information that isn't already in the article. I assume, however, that it was added because someone thought the existing first paragraph was unclear, so I'm open to suggestions about how to improve it. -- 12:20, 9 November 2001 (UTC) Lee Daniel Crocker

---

I think section 0 which currently reads could be improved by changing: In computing, Unicode is the international standard whose goal is to specify a code matching every character needed by every written human language, including many dead languages in small scholarly use, to a single unique integer number, called a code point.

To: In computing, Unicode is the international standard whose goal is to specify a code matching every character needed by every written human language, including many dead languages in small scholarly use such as foo and bar as well as [some other good example, perhaps a made-up language?], to a single unique integer number, called a code point.

I think the intro would be better by adding two examples there, furthermore i think is the international standard should be is an international standard, or has it been approved by a major authoraty as the standard? -- Ævar Arnfjörð Bjarmason 17:44, 2004 Oct 5 (UTC)

The parts of Unicode which are also in ISO 10646 most likely define it as the standard, aslo given the fact that the maintaince of the ISO 8859 has been put into hibernation. Pjacobi 20:27, 5 Oct 2004 (UTC)


And, argueing about section 0, what about: in internationalization of software. A Thai programmer writing a program with Thai user interface for Thai customers doesn't fit at all the definition of internationalization. -- Pjacobi 20:30, 5 Oct 2004 (UTC)

The interesting thing for most people is that it provides a way to store text in any language in a computer. Starting off by mentioning "unique integer numbers" doesn't make Unicode easier to understand. Even as a computer programmer, I have a bit of trouble reading that sentence and understanding what it means. And it's not really true as given; characters in Unicode is a polite fiction. Many characters (Maltese "ie", Lakota p with bar above, many Khmer characters) are more then characters in Unicode-ese. Going to rewrite boldly. --Prosfilaes 21:47, 11 Oct 2004 (UTC)

Downloads

I am not a techie! Nevertheless I can see the usefulness of much of the material available in Unicode. Neither am I the sort of anti-techie that complains that anything in other than plain-Jane unaccented English alphabetical characters must be thrown out of Wikipedia, or that articles should not be displaying meaningless question marks. I was visiting the chess page, and someone there has made a valiant effort to produce diagrams of how the pieces move by using only ordinary keyboard characters. I'm sure that he would not take it as a sign of disrespect when I say that it looks like shit.

I see no such chart there. Evertype 15:43, 2004 Jun 20 (UTC)
That's because they've been removed. They were there when the comment was posted in March 2002. --Zundark 16:11, 20 Jun 2004 (UTC)

I'm sure that most of us would like to see the special symbols, letters, or chinese characters at the appropriate time and place. At the same time I understand that for many Wikipedians there are technical reasons which prevent their hardware from dealing with this material (eg. limited memory). Then there are others for whom only the appropriate software is missing. Even some of the people with hardware restrictions may be able to handle Greek or Russian, though probably not Chinese. In cases where I've tried to find the code, I've ended up wading through reams of technical discussions. These discussions may be very interesting, but they don't provide a solution to my immediate problem.

The practical suggestion may be a notice at the head of any article containing symbols not in ISO 8859-1 saying in effect. "This article contains non-standard characters. You may download these characters by activating this LINK". Eclecticology

Just because an HTML document contains characters that are not in the ISO 8859-1 range doesn't mean that the characters are nonstandard. HTML 4.0 allows nearly all of Unicode to be used in a document, and all web browsers make an attempt to handle any character they encounter. The problem is merely that the underlying operating systems upon which the browsers rely to provide character rendering tends to be either not Unicode aware or just does not have a good selection of fonts (character-to-glyph mappings) installed.
There's no reliable way to guess the user's character rendering capabilities, so we really don't know when to tell people when it would be a good idea to download font files, and fonts tend to be OS-specific anyway. I prefer just to acknowledge in the prose that any non-ASCII characters may or may not render as they are *supposed* to. I don't think we should dumb down the HTML and avoid those characters though. - mjb 18:21 Feb 20, 2003 (UTC)

In cases where I've tried to find the code, ...

What exactly were you looking for ? Do you have the Unicode value, and you're looking for a typical glyph (like a ASCII chart) ? Are you looking for the Unicode value ?

These discussions may be very interesting, but they don't provide a solution to my immediate problem.

What exactly is your "immediate problem" ?


Is there a reason to use '<code>foo</code> <code>bar</code> <code>baz</code> ...' instead of '<code>foo bar baz ...</code>'? -- Miciah

UTF-7

Isn't there a UTF-7? Or is an invetion of Microsoft (it's in .NET)? CGS 21:54, 16 Sep 2003 (UTC).

Yes[1], but it's virtually never used. --Brion 23:33, 16 Sep 2003 (UTC)

The oldest of Unicode's encodings is UTF-16, a variable-length encoding that uses either one or two 16-bit words, manifesting on most platforms as 2 or 4 8-bit bytes, for each character. {NB: This can't be true; UCS-2 has to predate UTF-16!}

66.44.102.169 wrote "{NB: This can't be true; UCS-2 has to predate UTF-16!}" in the article. UTF-16 was previously UCS-2 but I'm not sure that makes the statement untrue as such but I reworded it anyway. Angela.

In the way the terminology today is used, UCS-2 doesn't have surrogate support, and certainly a 16-bit encoding without surrogate support existed before one with it. I don't think either of these were called UCS-2 and UTF-16 at the time though. Morwen 11:52, 6 Dec 2003 (UTC)
I wrote the comment that UTF-16 can't be the oldest encoding. Currently text may be encoded in UCS-2 or it may be encoded in UTF-16. Many Windows application designers pay no heed to the difference, but by their assumptions clearly support UCS-2 and not UTF-16. I speak of the MS-Windows world, wherein UCS-2LE holds dominant sway. In fact, Microsoft documents very commonly use the term "Unicode" as a synonym for UCS-2LE. Anyway, I meant that in the current time, we have both UCS-2 encodings and UTF-16 encodings, and I suspect we will all agree that the UCS-2 encodings (by whatever name) predate the UTF-16 encodings. :)
Windows supports UTF-16 surrogates though the support isn't enabled by default. For an application that doesn't support stuff outside the BMP there is no difference between UCS-2 and UTF-16 for one that does UCS-2 would be pointless to support since you can read UCS-2 data as UTF-16. So they can be considered to be essentially the same encoding with different feature restrictions. Plugwash 02:19, 18 September 2005 (UTC)[reply]

A Brain Dropping follows:

Wasn't Unicode created to encode all languages - not just 'human' languages? In the future then, why couldn't Unicode conceivably be used to encode extraterrestrial languages as well?(well, why not? hehehehe) Therefore, shouldn't the 'human' be removed from this page? One possible alternative: Unicode is the international standard created, whose goal is to specify a code matching every character needed by every known written language to a single unique integer number, called a code point.

The Universal Character Set, whether in its Unicode Standard or its ISO/IEC 10646 manifestation, was made to encode the writing systems of the world, not the languages of the world. Evertype 15:43, 2004 Jun 20 (UTC)
I'm not an expert on this, but I don't believe the Unicode Consortium seeks to encode writing systems whose existence we don't yet know of (and if we ever meet aliens who use more than 232 characters, Unicode will have a problem). I believe that Tolkein's Elvish scripts are in there, but other fictional scripts like Klingon are not. So they're not being wholly anthropocentric. adamrice 00:02, 11 Jul 2004 (UTC)
Tengwar and Cirth are not yet encoded, but are roadmapped for encoding. To answer the brain-dropping: Were we to meet aliens who had an encodable writing system, it is likely that their characters would fit. Evertype 11:59, 2004 Jul 11 (UTC)
I find that statement a little overreaching. The aliens could easily have a writing system with a million symbols, or several Chinese-size writing systems or a history of writing that dates back millions of years instead five or six thousand. Or simply be a group of ten different species of aliens with writing histories as complex as ours. The best we can say is that humans, all told, will use about 3 planes, and there's 17 planes of characters. --Prosfilaes 03:49, 12 Jul 2004 (UTC)
More likely, I think, is the possibility that another intelligent species of life will perceive the universe in entirely different ways than humans do, and will ascribe little to no value at all to written language as we know it. -Factorial, 15:52, 28 Nov 2005 UTC
There is currently no Klingon alphabet/writing system that is suitable for encoding. The glyphs shown in Star Trek are merely a nearly-1:1-mapping of the Latin alphabet. The Star Trek folks seem to have a real Klingon alphabet, but have not yet published it. JensMueller 10:06, 5 Sep 2004 (UTC)
Any alphabet is going to have a nearly-1:1 mapping to the phonemes of the language, and hence to the Latin transcription. What must a "real" Klingon alphabet be ill-fitted to the Klingon language? --Prosfilaes 04:11, 9 Sep 2004 (UTC)
There has been no sufficient klingon alphabet published, and nearly all klingon users online are using a "roman transcription". The reason the klingon encoding proposal was turned down was because it was hardly used, (and also because the sounds were mapped to english sounds, rather than klingon). If a canonical klingon alphabet would appear, I guess a unicode encoding is likely. Klingon seems to have 26 sounds, according to Wiki article and www.kli.org, which shouldn't be too difficult to find a mapping area for.
The Klingon piqaD alphabet has a mapping in the Private Use Area of Unicode, and has recently come into occasional use on the Internet. See, for example, Chatting in piqaD and qurgh's blog The requirement for getting Klingon piqaD an assignment of regular Unicode code points is some level of use in data interchange. We can expect that it will qualify at some time in the future. --Cherlin

Perhaps for "Issues?"

One of classicists' issues with Unicode has been the omission of the LATIN Y WITH MACRON characters. While the omission has been corrected in Unicode 3, most user agents don't know to render anything for that codepoint. Somewhere in that story is an issue that perhaps might make sense in the article -- either the omission of the letter, or the outdated support available by user agents (I don't see Microsoft rushing to update its fonts and packaging them as an update to Windows or Internet Explorer just to comply with recent standards).

Definitely not. This is not an "issue" with Unicode, but with implementation. We add support for classicists constantly, and it wasn't Y WITH MACRON alone either. Evertype 21:49, 2004 Jul 10 (UTC)

UTF-8 as the basis for multilingual text?

"UNIX-like operating systems such as GNU/Linux, BSD and Mac OS X have adopted Unicode, more specifically UTF-8, as the basis of representation of multilingual text."

Mac OS X stores a lot of text in UTF-8, but the other UTF's are also supported throughout the system and widely used. I agree that UTf-8 is currently the most widely used Unicode encoding (because it is the most legacy-compatible encoding), and that is important enough to mention in the leading section, but perhaps it should be rephrased so that it doesn't mislead the reader into believing those OSes don't support other kinds of Unicode? — David Remahl 04:16, 9 Sep 2004 (UTC)

afaict UCS-2/UTF-16 (which can essentialy be regarded as different implementation levels of the same thing) is the norm in the windows and java worlds whilst UTF-8 is the norm in the unix and web worlds. Yes other encodings are supported by the conversion libs but UTF-8 and UTF-16 are the ones the systems are built arround.
Microsoft Word for Windows gives me the option to save plain text in 4 flavors of Unicode: Big-Endian (UTF-16BE?), UTF-7, UTF-8, or "Unicode". If I choose "Unicode" what exactly do I get? From the above, UTF-16, is that correct?--84.188.146.200 02:35, 9 February 2006 (UTC)[reply]
Hmm which version of word are you using and can you say exactly where you found those options i can't find them in word 2K here. I'd guess its plain unicode option is little endian utf-16 though. Plugwash 22:46, 18 February 2006 (UTC)[reply]

Largest and most complete

This phrase has appeared recently without much discussion. "Unicode is the most complete character set, and one of the largest." Could anyone give justification? -- Taku 06:14, Oct 12, 2004 (UTC)

ISO/IEC 10646? Unicode reserves 1,114,112 (2^20 + 2^16) code points, and currently assigns characters to more than 96,000 of those code points. No other encoding even comes close. {Ανάριον} 09:09, 12 Oct 2004 (UTC)
As for the completeness, take a look at mapping of Unicode characters for all the scripts encoded. {Ανάριον} 09:10, 12 Oct 2004 (UTC)


GB18030 is by defintion as large as Unicode, but except for the pre-existing mappings, all GB18030 codepoints and Unicode codepoints, including yet unassigned ones, are algorithmically mapped. So, it is more like a strange encoding form of Unicode.
For certain scripts, there are character sets with more precomposed glyphs, e.g. VISCII for Vietnamese, TSCII for Tamil, or some scholarly encoding for pointed Hebrew. But they don't count as larger, as they don't support more than one or two scripts, and they don't count as more complete, as the encoded characters are uniquely representable in Unicode as sequences including combining characters.
So yes, according to all my knowledge and research, Unicode is the most complete and one of the largest character set, for which information is freely available in languages I can read.
If you have knowledge of implemented characters sets (not counting proposals, shich are cheap to made) which are more complete than Unicode, please elaborate.
Otherwise, I'll revert your reversal.
Pjacobi 09:12, 12 Oct 2004 (UTC)
Unicode is meant to be a superset of all known character sets, so it is hardly possible that there are character sets not covered by it. (And surely all VISCII and TSCII characters are included in Unicode as they are). — Monedula 11:11, 12 Oct 2004 (UTC)
Unicode is only a superset of the character sets in use when they started out. In practice, they were a superset of most new character sets up to 2000, at which they stopped encoding new precomposed characters. (So they're still a superset of them, in a practical sense.) One of the Chinese standards that encoded every minor variation on the ideograph seen wasn't added to Unicode, which decided to adopt a most unifing encoding policy, but all the Chinese and Japanese standards in widespread use are subsets. --Prosfilaes 21:53, 12 Oct 2004 (UTC)
I agree with your first half-sentence, but Unicode has decided to not encode any more precomposed characters. Only for pre-existing national and international standards there was a consensus to include the precomposed characters. Now, new suggestions for precomposed characters are routinely declined, and for good reasons. In fact it is hoped that in some future version (6.0?) all exsiting precomposed characters will become deprecated. A like case exist for glyph variant. What got in, is in, but new additions will be declined. See also: http://www.unicode.org/standard/where/
So there is no chance (neither is there a necessity) that TSCII codepoint 0xE0 "tU" will be assigned a single codepoint in Unicode. Instead it transcodes as [U+0ba4 U+0bc2] --Pjacobi 12:37, 12 Oct 2004 (UTC)
Ignoring the whole Han ideographs, and ignoring the sets that are basically new encodings of Unicode, what is there? TRON? --Prosfilaes 21:53, 12 Oct 2004 (UTC)
The idea of switching to several character encodings isn't unique to TRON, it was already included in ISO-2022. And both have in common, that it makes implementation difficults scatters the design process of new script encodings instead of unifiing it. I don't think much of it is still in use. Heck, you can't even use both (TRON and full ISO-2022 with escape switching) on the Web or in e-Mail. Most Unicode criticism on the TRON advocate's pages are just outdated or a result of misunderstandings. --Pjacobi 22:32, 12 Oct 2004 (UTC)

I am convinced that probably Unicode is the largest and most complete character set but can we still ignore criticism on unicode? What I am often heard about unicode, it is not inadequate in handling old text or text containing outdated characters. Maybe most of criticism are pointless or a result of misunderstandings but I still hear them and I don't think we should make a general statement which not everyone agrees with. Unicode is meant to be the largest and the most complete but if it is really so is disputed, if such dispute is nonsense in actuallity. -- Taku 22:41, Oct 12, 2004 (UTC)

Yes, of course the criticisms must be included, but we must try hard to find the right criticisms and the right way to present them. Don't forget the long expertise of Unicode in this field and the large number of field experts contributing to the evolving Unicode effort. We would achieve nothing but spoil the creditability of the Wikipedia, if we hastily add criticisms of mediocre quality.
A generic problem with Unicode is the long process it takes, to get additions done. This is the downside of centralism. And you need somebody with "weight" to get major additions and changes done. Either a national standards body are field experts of value.
A brainstorming list of criticisms:
  • Unicode got the Hebrew points for biblical texts all wrong (or something like that, I'm no expert)
  • Unicode has unified scripts (and requires different fonts and markup to differentiate), which should not have been unified.
  • Unicode has not unified scripts, which should have been unified, as they are only font differces.
  • Unicode has too few presentation forms for complex shaping scripts
  • Unicode has not enough presentation forms for complex shaping scripts
  • Unicode has too few precomposed glyphs
  • Unicode has not enough precomposed glyphs
As you can seen, some criticisms arise out of the fact, that decisions must be made in a standard on questions which are viewed differently by different people.
Pjacobi 00:32, 13 Oct 2004 (UTC)
If you want to say that Unicode is not the largest and most complete, then there must be something that's larger or more complete. If you tell us what it is, we can discuss it.
Most of the complaints about Unicode don't stem from size or completeness. Most of the scripts and characters that are left are very obscure and almost invariably not used for writing new material. The complaints come from how Unicode treats the existing scripts; often the question is whether two entities should be treated as distinct. Since in all these cases, they are distinct in some ways, and not in others, there's no "right" answer that will satisfy everyone. The Chinese and Japanese encodings that are supposedly more "complete" are in reality more fine-grained, in that they seperate characters that Unicode unifies. --Prosfilaes 20:01, 13 Oct 2004 (UTC)

Reorganization

I made some reorganization of the sections and the continuing work on the leading section. I think the new 4 big sections make good sense: origin and development, mapping and encoding, process and issues and in use. In addition to this, we probably need:

  • difference in character and glyph; we should give some example
  • difference in mapping and encoding; particularly, what is code point, what is plane?
  • short summary of utf; what is utf? and why we want it
  • size comparison, particularly what unicode not to include; perhaps Pjacobi is right that some criticism are wrong but it is still true that many people advertise their sets as being larger and more complete. We need some response to them.

If I have some time, I will try to address them but you can also help me. Finally, I'm sorry for late reply to unicode as largest and the most complete question. I slighly reworded the mention. Please make further edit if you think necessary. -- Taku 20:42, Oct 17, 2004 (UTC)

You must give some concrete examples who advertises which character set to be larger in what specific sense. --Pjacobi 22:18, 17 Oct 2004 (UTC)
ok, the press release of chokanji 3 [2] (in Japanese) says it supports 170,000 kanji while Unicode handles 20,000 chinese characters, 12,000 of which are kanji. -- Taku 02:15, Oct 18, 2004 (UTC)
a) If I'm not mistaken, the press release is dated 2001-01-06. So, nearly four years later, are there any implementations? Can you give the URL of a single webpage in this charset? Is the IANA registration in progress? Does somebody work on GNU iconv support? Does somebody worl on IBM ICU support? You can't compare vaporware to a widely implemented standard.
b) As of version 4.0, Unicode supports 71,000 Han characters, it is a horribly outdated or mis-informed to state the number 20,000. And PRC is busily adding more. It is a political decision of JIS, not to propose adding more kanji. Either because JIS doesn't see the necessity or for other reasons.
Pjacobi 06:12, 18 Oct 2004 (UTC)
We don't compare. You wanted "some concrete examples who advertises which character set to be larger in what specific sense.". So this is the answer. Again and again, I didn't mean what they are saying so fair. I won't use their product because there is just so little compatibility and besides, I don't have any pratical problem with unicode. I mean I agree with you so I am not sure whom you try to convince. -- Taku 13:14, Oct 18, 2004 (UTC)

Thank you for giving the concrete example. Yes, I specifically asked for it. I apologize for replying in flame-war style. --Pjacobi 14:41, 18 Oct 2004 (UTC)

Many documents in non-western languages, for instance, are still represented in other character sets. Which languages? Which character sets In this generality it doesn't help. Please state languages and character sets used. And remember, GB18030 is now fully harmonized with Unicode and cannot be considered a different character set, but Unicode encoding form standardized by somebody other than ISO or Unicode Org, namely the Guobiao. --Pjacobi 22:23, 17 Oct 2004 (UTC)

Maybe Shift-JIS? I don't think it is only character set used beside Unicode. If you know more, that would help. -- Taku 02:15, Oct 18, 2004 (UTC)
The largest use of a non-Unicode charset is still EBCDIC, ASCII and ISO-8859-1, as seen on this Wiki. So this doesn't look like a west vs east problem to me. The difference is, that almost universally all other charsets are considered to be subsets of Unicode nowadays. And especially the HTML and XML character model explicetely states, that while the physical charset may vary, the logical charset is always Unicode. Also in programming, it is nearly always assumed, that everything can be converted (and most things reversibly) to Unicode.
So if I can judge this correctly, the Unicode character encoding models is only challenged by some users of Japanese and not much is known outside of Japan of this. As said above, I'm very skeptical about the practical relevance of the Unicode challengers. But the interesting point, why this happens in Japan, seems to be good stuff to write a separate article about Japanese character encoding.
Pjacobi 06:23, 18 Oct 2004 (UTC)
I had absolutely no intention to make a case like west vs east problem. If you think some sentences are problematic, then go ahead to edit. I just wanted to illustrate the adoptation of unicode and the sentence absolutely never mean to imply the use of unicode is problematic or anything. Besides, I am not sure what you are saying. I don't think you believe any non-unicode character sets have died out completely. We want to show when when unicode is used and when it is not. I mean what you want after all? -- Taku 13:14, Oct 18, 2004 (UTC)
Sorry for being unclear. And apologies for not contributing to the article itself in the moment. I am of the opinion some non-trivial additions are dearly needed (on the character model, on character vs glyphs vs graphemes), but I feel unable to do it myself. Perhaps I'll try it next week.
No, I surely don't want to say non-unicode character sets have died out completely. What I tried to say, is that the character encoding model of Unicode is nearly universal success and nowadays other character sets are mostly seen as subsets of Unicode. This wasn't the case ten years ago.
Pjacobi 14:41, 18 Oct 2004 (UTC)

It's fine. I was just puzzled about what upset you so much. As a matter of fact, I am neither the backer of unicode nor the detractor. I am only interested in making the article informative for those who have questions about unicode. It's very surprising that many people don't know well about unicode, even computer programmers. The article could be a help for them. -- Taku 15:54, Oct 23, 2004 (UTC)

Fully agree. When supporting charset issues (as I do sometimes for Firebird SQL) it's quite amazing that some programmers at first don't even see a problem in the different mappings between characters and bytes. --Pjacobi 17:55, 23 Oct 2004 (UTC)

Phishing

In the section that talks about pre-composed characters vs. composing with several codepoints, how about mentioning that this capability opens up lots of opportunities for phishing once URLs are more universally excepted in UTF-8? For example, once accented characters are common in website addresses, links with a pre-composed "è" and separate "e" plus an accent will point to different sites, but look identical to the user (in fact the intent is for them to look the same). I don't know if this info belongs here, but it's an interesting tidbit. Rlobkovsky 00:06, 6 Dec 2004 (UTC) Insert non-formatted text here

If and when URIs start supporting characters beyond ASCII in a standard way, some decomposing must take place, as according to the principles behind unicode the precomposed character à is exactly equivalent to ` + a. Any future internet domain funkèynáme.ext will have to point to the same IP(v6?) address for all its possible decomposings.

-- Jordi· 12:26, 28 Dec 2004 (UTC)

Sentence

"To address the short coming, Unicode is being revised periodically with the addition of more characters and increase in the size of characters potentially represented in unicode."

It's something of a moot point now, but in case it comes up in the future, the reason I cut that sentences is because it was inaccurate. They don't add more characters to address the shortcoming (one word) that people don't use Unicode; there's probably less than a hundred thousand people who would use any of the scripts that are going to be added to Unicode. And for several of the scripts, like Egyptian Hieroglyphics or Hungrian Runic or Tengwar, there's no commericial interest in the script, and there's little to no academic interest in encoding the script (the Egyptologist community has basically told Unicode to go away and come back in few decades). Hobbyist demand for unencoded scripts isn't a huge shortcoming that Unicode is trying to overcome.

What does "increase in the size of characters potentially represented in unicode" mean? I assume by size, you mean number (since you can increase the size of characters just by using a larger font), but I'm not sure what "potentially" means here. As I read it, it's redundant with "addition of more characters". --Prosfilaes 03:38, 11 Dec 2004 (UTC)

The simplest representation of Unicode (giving every character the same number of bits, rather than a more complicated variable-width encoding) has historically increased from 16 bits to about 20 bits. There is (currently) about 2^20 "potential" characters. I suspect the original author suspected that in the future, *more* than (roughly) 20 bits will be required; and that the consortium is planning to "periodically" increase the number of bits. --DavidCary 22:17, 11 Feb 2005 (UTC)

The consortium doesn't plan to increase the number of bits. In 15 years, two planes of characters have almost been filled, out of 15. Just as importantly, those two planes include virtually every character used in a computer; a few people use Tengwar or pIqaD or Cuneiform or Egyptian hieroglyphics, but they're incredibly rare and they amount to a few thousand characters, not the more than a half million it would take to require expansion. And honestly, if it was a matter of expanding for those or ignoring them, their concerns are minor enough and the changes in every piece of Unicode software major enough I suspect they would get ignored. --Prosfilaes 00:30, 1 Jun 2005 (UTC)
it depends on exactly how you define filled.
the BMP (plane 0) is basically full mostly with fully allocated and standardised codepoints
the SMP (plane 1) is mostly stuff in various stages of approval but still has quite a bit of room marked as completely unknown (less than half though)
the SIP (plane 2) is more than half filled by "CJK Unified Ideographs Extension B" and most of the rest is pencilled in for yet more CJK stuff.
the SSP (plane 14) is mostly empty right now
iirc planes 15 and 16 are reserved for private use but i'm not sure.
so if you count the areas that are pencilled in for future scripts then a LOT more than 2 planes are in use.


I define filled as allocated. Pencilling in is meaningless; many of the pencilled-in scripts don't have anyone interested in formalizing them for Unicode. Everything in the SMP right now can fit in the unused space in the SIP. 128 characters are used in SSP, which is barely touched. Even with everything pencilled-in, only three planes would be filled, which doesn't qualify as a LOT more.--Prosfilaes 01:22, 27 Jun 2005 (UTC)

Chinese Punctuation

"Unicode also has a number of serious mistakes in the area of CJK punctuation. For example, it mistakenly treats partial punctuation marks in the various CJK encodings as full punctuation marks, for instance treating half of a CJK ellipsis as the same as an English ellipsis, even though the two glyphs are both semantically and visually dissimilar (considering that the CJK ellipsis can be centred between the baseline and ascender, but the English ellipsis must always be placed on the baseline)." --Gniw 06:53, 6 Feb 2005 (added to article)

This page should not be a page of everyone's minor complaints about Unicode. I've read the Unicode list for four or five years, I've read the Standard, I've read both pro- and anti-Unicode pages (including all the Tron pages in English, and they include about every general or Japanese-specific Unicode complaint possible) and I've never heard this before. Given that it seems to be one person's complaint, I don't think it's worthy of being added to an encyclopedia article. --Prosfilaes 21:48, 6 Feb 2005 (UTC)

This is not a minor complaint if you use do bilingual typesetting or write bilingual (Chinese and English) web pages. The result of the ellipsis misidentification in Unicode causes very ugly web pages to result in mixed English-and-Chinese web pages. But given the sad state of punctuation typesetting taught at art schools these days, and the way English computing has changed Chinese typesetting, I'm not surprised that no one has talked about this. Ah Wing 22:49, 9 Feb 2005 (UTC)
I stand on my position. This is an encyclopedia, not a list of what's wrong with Unicode. If there's no English pages on the issue, then most of the people who could fix the issue have never heard of it; and if no one has ever seen fit to bring it before them, I hardly see it as a major issue. I wouldn't post bug reports about a program on Wikipedia, so I don't see this as appropriate.
But please, if someone else has an opinion on this, please chime in.--Prosfilaes 03:45, 11 Feb 2005 (UTC)
Why isn't this a big issue? The triviality of this is precisely the reason it is important; it shows that the Unicode has mistakes that even primary school students should be able to spot, yet here it is in the standard. This just shows how sloppy Unicode is regarding CJK.
Do you really think that if people who are likely to be affected by the issue has mentioned about it, and the discussion happen to be not in English, then it is not an issue?!
What you mean is "the use of English is a requirement for an issue to be recognized as an issue" or "no matter whether people have discussed it or not, if it has never been discussed in English then it cannot possibly be an issue". Or, in short, "English is the measure of all things". If this is not Western imperialism I don't know what it is, and you don't understand why the Japanese are opposed to Unicode? Opposition to Unicode is not really so much of a technical problem but more a perception of a lack of respect, the fact that my contribution was deleted on New Year tells a lot. 24.101.156.72 19:18, 11 Feb 2005 (UTC)
If a Chinese encyclopedia wrote an article complaining about some problem in the English Wikipedia, and they never mentioned it to anyone who could fix it, we'd be a little pissed. Bring the issue before us, and if we choose not to fix it, then there's a valid complaint, but we can't fix what we don't know about. If it doesn't matter enough to bring it to the people who can fix it, or the people discussing it don't respect the standard enough to try and fix it, it's not an important issue.
I think says a lot that you're not discussing the issue, you're complaining about imperalism and that somehow people shouldn't correct articles on holidays. I will repeat again, this is a thirty year old problem made by Chinese standards. You can't do better using Big5 or any other Chinese standard. Which says a lot to me about the importance of the problem.
While we're on the subject of "Western Imperalism", I will note that the US-based Summer Institute of Lingistics and the Ireland-based Michael Everson have been instrumental in getting new scripts (e.g. several Philippine scripts like Buhid) into the standard, while the Japanese standards body sent a letter to the ISO working group asking for such new standards efforts to cease. Such accusations are insulting and provably inaccurate. --Prosfilaes 23:46, 11 Feb 2005 (UTC)
Thank you for saying so, Prosfilaes. Buhid was years ago, though. Recent scripts I worked on encoding include N'Ko, Vai, Balinese, Cuneiform, Phoenician and lots more. And the letter from JIS was also years ago, and JIS is now Secretariat for SC2, so they're hardly asking to stop the work from proceeding. Evertype 16:32, 12 February 2006 (UTC)[reply]
Excuse me. Do you know what a "double byte character set" is? Big5 (as well as GB, EUC-KR, EUC-JP, and Shift JIS) is a DBCS, and by the very nature of a DBCS, you can't encode a whole CJK ellipsis. We have to encode half of the ellipsis. Now when the Unicode committee look at the CJK national character sets and decide that half a CJK ellipsis is equal to a full English ellipsis, that is incredible sloppiness. This is not a "thirty year old problem made by Chinese standards" in the context of Unicode.
If you hae a problem, write up a disunification paper, give evidence, and submit it to the UTC or through your ISO National Body member in SC2. Badmouthing Unicode on this talk page will accomplish nothing. Evertype 16:32, 12 February 2006 (UTC)[reply]
And how do you want me to discuss the issue? When whatever I write will simply get deleted. 66.163.1.120 00:05, 12 Feb 2005 (UTC)
It's not incredible sloppiness. It's a unification decision that had some negative side effects. (And we could discuss the incredible sloppiness involved in assuming that every non-ASCII character was double-width, one that still sometimes plagues Russians who get the pleasure of dealing with double-width Cyrillic.) And I want you to discuss it here, on the talk page, instead of making changes on the main page, until some sort of consensus is reached. (And I'd really like a third party to chime in.) --Prosfilaes 01:32, 12 Feb 2005 (UTC)
I cannot understand why this is not sloppiness. The two are completely different. As I originally wrote, (1) they are different in form (the CJK ellipsis can be set on the baseline, or between the baseline and the ascender; the English ellipsis can only be set on the baseline) and (2) they are different in meaning (two "ideographic three dot leader"s, as some Japanese people think it should be called, are required to make one true ellipsis, the leader itself is meaningless; one "horizontal ellipsis" (U+2026) is meaningful by itself). The two cannot be unified no matter whether they consider unification to be based on form or on meaning.
Ok, you might argue that this only means they are unable to spot the differences. But they go into so much effort into distinguishing between almost-indistinguishable variations in ideogram forms (many are really typographic stylistic variations that unfortunately came to be associated with different countries), not making comparable effort in distinguishing these two glyphs certainly sounds extremely strange. Even if they had checked the punctuation sections of a Chinese or Japanese dictionary they would have realized that the "ideographic three dot leader" is not itself a punctuation mark. And this has the added benefit that dictionaries usually set the ellipsis between the baseline and the ascender, so they would simultaneously realize that the two are different in form. In short, there is simply no basis for "unification": Yet they got "unified". Aside from "incredible sloppiness" I really cannot explain this.
(I do accept that Unicode unifications are sometimes based on form, though I think this is contrary to the spirit of Unicode unifications. I personally don't like the CJK unification myself, and you won't understand why I feel this way until you try to work on a Unicode font yourself. But if you ask for my objections to unification decisions, I'll say the unification of the umlaut and the diaeresis really make no sense considering they dis-unify a lot of other things (I'm talking about western script, not CJK) that look 100% identical. In the case of the CJK vs English ellipses, form is not even a question, since they are different in form.)
I do agree with the double-width mess. For us the opposite problem occurs, that all the box-drawing characters become single-width, making Unicode almost useless in terminal emulators if box-drawing characters are to appear anywhere. --Wing 03:45, 12 Feb 2005 (UTC)
First, I stand by my point: for 15 years, this unification has stood, and no one has complained to Unicode. For probably ten of those years, there would have been no problem disunifying the characters, yet not a single standards body made the request. If they were so completely inappropriately unified, there has been incrediably sloppiness and apathy on the basis of the users of the affected scripts.
You make too many assumptions about what I do and don't understand. I believe I understand the reasons why people disagree with CJK unification, and seriously doubt that making a font would make a bit of difference. The whole question is whether the difference is a difference in preferred fonts or a difference in script.
You are apparently a splitter. Besides the fundamental backward compatibility problems, I can't imagine trying to explain to the people at Distributed Proofreaders that coöperate uses a different ö from Köln. Splitting these would cause a world of pain to the advantage of a few librarians. In any case, the various opinions on when to split and when to unify a much more general and interesting topic to add to the page. --Prosfilaes 00:31, 13 Feb 2005 (UTC)
Well, I think I am correct in assuming that you have never worked on a Unicode font. Before I attempted to work on a Unicode font some time ago, I thought just like you (being content with the state of the Han unification).
In the current state of the Han unification, there are many characters that are not unified. However, after adding a radical, the new characters are all unified.
If I want to make one Unicode font containing all the ideograms (not an unreasonable thing, since making such a font requires so much effort), which style should I choose? If adding the radicals would not make the new characters unified, I'd be all happy too (it would just mean that all variants are distinguished, as opposed to variants being not distinguished); as it is, no matter which style I choose, I end up with a font that is wrong.
Regarding the ellipsis itself, it is not a difference in font. Would you consider an ellipsis-like glyph that is raised above the baseline (to about x-height) suitable for typesetting English? From your viewpoint, this is exactly what unification of U+2026 and the hypothetical "ideographic three dot leader" means.
In a sense, the mis-unification of the ellipsis and the "ideographic three dot leader" can be thought of as equivalent to the problem of having full-width Cyrillic letters (in that both mistakenly equates a glyph that's only appropriate in C/J/K with an incompatible western glyph). If you find full-width Cyrillic letters unacceptable and is "incredible sloppiness", I fail to understand why an ellipsis raised to x-height for English is acceptable or is not the result of sloppiness.
I would not object to your calling us having "incredible apathy" regarding Unicode. We have already acquired "incredible apathy" after using the suboptimal national character sets for so long; and many of our typesetting and/or punctuation conventions have been destroyed by Western-centric computing for so long (can you imagine just about ten years ago even westerners know that in C/J/K, numbers should be grouped by myriads, but now many Chinese do not even know this, but rather group digits by thousands and then laborously count the digits every time a large number is being read… and many Chinese are so used to western-style underlining that they are now desensitized with the grammatical mistakes they are making every time they underline Chinese words that are not proper names…) I definitely think that this is pathetic enough, and there is no need for Unicode to make this kind of mistakes to further worsen the situation.
I am not saying that the knowledge of proper punctuation has not deteriorated in the West; but at least the deterioration has not been codified into an international standard (unless I count this ellipsis mis-unification)… --Wing 04:30, 13 Feb 2005 (UTC)
PS: Perhaps there is; other than this ellipsis thing, there is also this hyphen-dash confusion. It seems to be just as bad…
afaict the hyphen-dash issue comes from the fact that ascii and other encodings of its era came from the days when charactors on computers were fixed width. given that and the limited number of code values availible in ascii it seemed totally reasonable to unify the hyphen dashes and minus signs. There was also the unification of beta and sharp s in ibm code page 437 Plugwash 02:46, 1 Jun 2005 (UTC)

The year wikilinks in the revisions list are a little confusing; I clicked through thinking I was going to be led to that particular revision, but found myself on a general-year page. Could you reconsider these links please? Thanks. Courtland

Good luck in changing this deep-incrusted policy of irrelevantly wikilinking each and every year number.--84.188.146.200 02:58, 9 February 2006 (UTC)[reply]
There has recently been some movement away from this, with some people doing mass unlinking of years. But regardless of that, the years can be linked to something else, the people who link every year don't seem to mind where the link goes, just that the years are nice and blue. Qutezuce 03:20, 9 February 2006 (UTC)[reply]
Is there a policy on year-linking, or a place where this is being discussed? I can see arguments on both sides of the fence - especially in historical articles, I think it's fun to click on years and find "what else happened then". But since the linking of years is done so much, it seems wise to include more text in the link than the year if you want to link something else. --Alvestrand 04:02, 9 February 2006 (UTC)[reply]
The issue is talked about on Wikipedia:Only make links that are relevant to the context. Qutezuce 04:16, 9 February 2006 (UTC)[reply]

Unicode adoption in e-mail

The adoption of Unicode in e-mail has been very slow. Most East-Asian text is still encoded in a local encoding such as ISO-2022-JP, and many commonly used e-mail programs still cannot handle Unicode data correctly. This situation is not expected to change in the foreseeable future.

This doesn't look like an accurate picture to me. Mac OS X's default Mail.app client has transparently supported Unicode since 2001. Didn't Windows 95's Internet Mail and News or Outlook Express have Unicode support even earlier? I don't know how widely used Unicode is, but hasn't it been very widely supported for years? Michael Z. 2005-04-12 21:20 Z

Keep in mind that that some programs support unicode does not mean they can handle text encoded in unicode correctly. The situation may have changed since then, but I used to hear that you should not send mails in unicode because many programs have problems with them. You see I heard a report that even gmail does not correctly handle the subject of e-mails. More research would certainly help, but I don't think the above is far from the reality. -- Taku 02:35, Apr 13, 2005 (UTC)
The situation is changing all the time - for one thing, Outlook now seems to have debugged most of its Unicode support. I've been sending email in UTF-8 for a year or so, and very few people report problems with it. Of course, it helps that most of my mail is in English, and the rest is in Norwegian.... some systems handle UTF-8 OK, but only if the output is within the Latin1 charset, for instance.

Input methods

On Windows XP, any Unicode character can be input by pressing Alt, then, with Alt down (and using only the numeric keypad keys), pressing the decimal digits of the Unicode characters one after the other. For example, Alt, then, with Alt still down, 9, then 6 and then 0 yields π (Greek lowercase letter Pi). For values less than 256, precede the digits with a 0, to avoid code page translation (see Extended ASCII), e.g. Alt 0, 1, 6, 5 yields ¥.

This just doesn't work when I try it. Pressing Alt-9-6-0 gives me └, which appears to be "Box Drawings Light Up And Right", character x2514/9,492 (└). However, Alt-0-x-x-x does work for me and always has (I can get the yen symbol fine). Does this statement need correction or clarification? —Simetrical (talk) 01:57, 8 May 2005 (UTC)[reply]

Forgot to mention, I do use Windows XP, English-language SP 2 to be precise. —Simetrical (talk) 02:31, 8 May 2005 (UTC)[reply]

I use WinXP, Spanish-language SP2, and it does not work for me, either. Nor does it work for anyone I know who uses WinXP, either. By the way, the character '└' can also be obtained by pressing Alt+192 - moreover, I have found that under WinXP, Alt+number produces the same output as Alt+number modulo 256 (provided that any zeroes before the original number are preserved). So, Alt+289 produces '!', Alt+416 produces 'á', and Alt+0416 produces ' ', the non-breaking-space.
I think that paragraph should be removed. --Fibonacci 21:53, 21 May 2005 (UTC)[reply]
it seems to depend on the edit control in use. it seems stuff that uses the standard edit (e.g. notepad) doesn't allow unicode entry with alt+numpad whereas stuff that uses the standard richedit (e.g. wordpad) does (tested on english winxp non-sp2 not sure if its original or sp1). Plugwash 22:37, 21 May 2005 (UTC)[reply]
The way I understand it, a four-digit or longer number enters the Unicode character. A three-digit number under 256 enters the character in the current code page, which I suppose would be Win CP-1252 for English and some European languages (don't know if that includes Spanish). It appears that three-digit numbers over 255 are processed with some funky math (Shouldn't numbers over 255 be Unicode? Can anyone think of a reason for using modulo-256 except programmer laziness?). Michael Z. 2005-05-25 17:45 Z
NO NO NO
in apps that use the windows EDIT control (ie notepad) you CANNOT enter unicode with alt+numpad (unless the app makes special provisions which some apps seem to do) and numbers entered with alt+numpad are treated modulo 256 regardless of lengh
in apps that use the windows RICHEDIT control numbers over 256 and all numbers 4 digits or more are unicode (for numbers like 052 the local code page matches unicode anyway so its impossible to really tell)
other apps that set up thier own edit controls may behave differently again.Plugwash 18:40, 25 May 2005 (UTC)[reply]
In Windows (at least versions XP, 2000, 2003) you need to have in your registry at HKEY_Current_User/Control Panel/Input Method, the value EnableHexNumpad set to "1". If you have to add it, set the type to be REG_SZ. WARNING: Don't mess with the windows registry unless you know what you are doing. The only problem is, how do you add a hex number like 39A from your numpad? The numpad doesn't include A-F and keying 'A' (for example) invokes a menu entry since the ALT key is pressed. Any ideas on this one? EGT. 15:25 12 July 2005 (GMT+2)
You have to remember that the Unicode are in hex, so to enter a Unicode charater with the number pad use the decimal value for the hex value. In the case of the Unicode value of 39a enter the decimal value 0922. Mr. McCoy 11:52 Jan 12 2006 (GMT -8)

Another form to enter Unicode characters is Alt + PLUS SIGN + HEX CODE. It worked for me on Windows XP. --ʀʇʉʀɵ 23:57, 25 January 2007 (UTC)[reply]

The article says: "it is possible to create Unicode characters by pressing Alt + PLUS + #, where # represents the hexadecimal code point up to FFFF; for example, Alt + PLUS + F1 will produce the Unicode character ñ." but I can't get this to work on Win2000. In MS Word, I just get the latest from the Recent Files opened. Does "+" mean the keys should be pressed sequentially or all together?

Nifty resource.

I found, at some point, a nifty resource for Unicode at fileformat.info. It has some rather decent tools for looking up individual codepoints, like U+0023 or U+20AC. Each page includes a browser test and font support info. Perhaps it would be useful to link U+F00F the same way we link PMID, ISBN and RFC IDs now. grendel|khan 16:50, 2005 May 25 (UTC)

-1. Not as long as they keep those ads running. An idea more in line with the Wikipedia ethos would be to link to the Wiktionary entries, like . However, this cannot be done consistently without some manual intervention for certain intercepted characters like "+", "]", etc., and the Wiktionary entries for things like € and Latin letters are not very exciting, if present at all. — mjb 1 July 2005 02:57 (UTC)
+1. I didn't realize there was a policy against linking to sites with ads. If there is enough interest, I sure we can find a way to get rid of the ads. Please let me know! Andrew M, FileFormat.Info author. 4 July 2005
There isn't such a policy, but of course we prefer resources with more usefulness and less advertising. The fileformat.info site seems pretty good. I would put it in right after the Letter Database, which seems to offer similar functionality without ads. Michael Z. 2005-07-5 04:21 Z
Of course there's no set policy, but the examples so far are leaning in that direction, and I'm sure I'm not the only one who prefers it that way. Traffic from Wikipedia and the sites that mirror its content would be a windfall for an ad-supported site; we should be very careful who we choose to "support" in this way. Rather than favoring one particular info/library/retail source for books, the automatic links on ISBN numbers go to a generated portal. Automatic RFC links to faqs.org are fairly innocuous, as well; RFCs are static documents and all that is done at faqs.org is some minor reformatting and hyperlinking. I think if faqs.org were to be using frames and ads like zvon.org, people would not be so happy about it, and would be more likely to favor linking directly to the plain text documents in the original IETF repository. So for character information, I want to see something equally neutral and encyclopedic. fileformat.info is good, but not thorough or encyclopedic enough, even without ads. — mjb 5 July 2005 06:44 (UTC)

I think there should be a lot more discussion about a character-linking strategy. When it comes to character information, the information sources Wikipedia, Wiktionary, fileformat.info, and the Letter Database are all great in their own way, but none of them are complete. For some characters, some sources are better than others. Wikipedia has great info on Latin script punctuation, Wiktionary has great entries for East Asian characters, fileformat.info has a lot more character set info than the Letter Database, and the Letter Database has some unique properties of its own, like language data. Compare the entry for 京 at Wiktionary, fileformat.info, and the Letter Database (ouch!).

Another complication is that what we call "characters" are actually a codified abstraction of graphemes and constructs of similar utility (control codes, zero-width joiners, and such); how might this affect what kind of information we want to link to? Take the Latin script for example: it has one hyphen grapheme, but Unicode has codified it as a half-dozen characters in order to accommodate different rendering behaviors, languages and legacy encodings. And East Asian scripts have other complications, as noted in Han unification. For example, decisions were made that can result in one logogram appearing at multiple code points depending on purpose, and similar logograms appearing at one code point but requiring sometimes substantially different renderings depending on language. So far, none of the info sources take any of this into account, although some of the cryptic Han data in fileformat.info might be indicative of a few of these properties, I'm not sure. In any case, I'd question whether it is sufficient to only provide "character" data when grapheme info may be useful more often, depending on whether the researcher is coming at it from a lay person / linguist's point of view, or from a programmer / computer professional's point of view. I suggest developing some kind of meta-article, along the line of the ISBN pages. — mjb 5 July 2005 06:44 (UTC)

There are two very different approaches to resource construction: manual (Wiktionary) and automatic (FileFormat.Info). The manual sources will always have more in-depth info, but the automatic will have better coverage (and, if both the site and data source are maintained, be more up to date). An automatic site is much easier to link to. The unicode information at FileFormat.Info is from the Unicode Character Database, the Java run-time, and the dotNet runtime. While I made a spot for per-character custom information, there isn't any since I'm not any sort of authority. I could add Wiktionary links if I could figure out a standard URL (or a standard and a list of exceptions). Note: I'm definitely link-whoring, but not because I'm hoping for a windfall: unicode searchers don't seem to be worth much in the advertising world. I'm willing to give up the ads. Andrew M. 5 July 2005

Unicode 4.1.0

Can someone give me a link so that I can download Unicode 4.1.0 for free? JarlaxleArtemis 00:14, May 27, 2005 (UTC)

http://www.unicode.orgMonedula 05:56, 27 May 2005 (UTC)[reply]


Unicode critique

There is a link in the external link section The strongest denunciation of Unicode, and a response to it which leads to a rather old paper in this context (2001); Isn't there something more recent? Otherwise I propose that these issues are treated from today's viewpoint within the article rather than giving this link. Hirzel 5 July 2005 10:14 (UTC)

I don't think we are supposed to critique the paper. It is equally important to discuss an old paper as new one. -- Taku July 5, 2005 10:30 (UTC)
Well, the article discusses the topic why Unicode 3.1 does not solve the problems of far east scripts. In the meantime we have Unicode 4.1 which added a lot of additional characters. I would like to know how the situation is today and not how somebody evaluated the situation 4 years ago. Hirzel 6 July 2005 20:45 (UTC)
I think the article(s) should be mentioned, mainly because implementations of Unicode lag way behind the latest published versions. Unicode 2.1 support is still fairly common. Feel free to add some kind of qualifier/disclaimer that says that some or all of the issues raised were addressed in Unicode 4.1 or whatever. — mjb 6 July 2005 22:19 (UTC)
It's hard to say which of the issues were not addressed by the time the article was released. --Prosfilaes 6 July 2005 23:07 (UTC)
Be vague, then. "Some, perhaps many, of the issues raised in this article were addressed in later versions of The Unicode Standard." — mjb 7 July 2005 06:07 (UTC)
It is interesting to see the problems there were in 2001. However, I have a feeling the link should be in a completely different section. "Unicode history" or something like that. The article is 15, only partially interesting, pages, and its main point - that unicode has no more than 65000 characters - is now obsolete. Mlewan 14:32, 16 May 2006 (UTC)[reply]
When it says that "being a 16-bit character definition allowing a theoretical total of over 65,000 characters", it was wrong when it was written. Pretty much everything it says was wrong when it was written.--Prosfilaes 19:06, 16 May 2006 (UTC)[reply]
Prosfilaes, I'm missing something here. You say that pretty much everything it says was wrong, and yet you reverted my change to the link title. I could have phrased it better, I know that, but I'd really like to get rid of the word "strong". An obsolete and erroneous article cannot possible be strong, can it? Mlewan 06:44, 17 May 2006 (UTC)[reply]
It can be a "strong denunciation"; that just means it's harshely critical, not that it's a strong argument.--Prosfilaes 07:21, 17 May 2006 (UTC)[reply]
Yes, but the word is ambiguous. It also has the meaning "convincing". If you had written "violent", "vitriolic" or "rabid", it would have been clear. I'd like to spare other people from what I did, reading through all the 15 pages looking for the ";strong" argument without finding anything valid at all. Mlewan 07:48, 17 May 2006 (UTC)[reply]

Having some trouble

Redirect to Talk:Code2000

Inka hieroglyphics?

Umm, wasn't it that the Inkas had no forms of writing (apart from the k'ipu) (unsigned comment by 200.119.238.115)

I've removed it it was added by an anon who did not give any justification in the edit reason and there is was mention of hyroglyphics in the inka article. Plugwash 20:26, 12 August 2005 (UTC)[reply]
It's not mentioned on the Unicode roadmaps. --Prosfilaes 23:03, 12 August 2005 (UTC)[reply]

A little clarification about Tolkien's scripts and Klingon?

I don't mean to be a spoilsport, but these bits just don't seem to fit in _at_ all. I was reading through it just then and I thought an anonymous user must have added it in for a laugh. I think a rewording's in order, but perhaps it's just me. I definately don't think it deserves quite as much as has been written about it, though. :-/ Someone care to back me up here, I'm not too sure of myself? Edit: Under the 'Development' Category --Techtoucian 10:16, 07 Jan 2005 (UTC)

I think they fit, if only because they show how the Unicode consortium actually considers scripts which to some seem no more than a 'laughing matter' -- certainly Tengwar and Cirth see more actual use than some of the scripts which are already encoded.

-- Jordi· 12:41, 7 Jan 2005 (UTC)

They belong. Tengwar and Cirth will certainly be encoded one day. Klingon is famous for being rejected for good reasons. ;-) Evertype 16:33, 12 February 2006 (UTC)[reply]

U+xxxxxx notation

"When writing about Unicode code points it is normal to use the form U+xxxx or U+xxxxxx where xxxx or xxxxxx is the hexadecimal code point." Could there be an explanation about why there are two notations? --132.203.92.116 16:32, 16 September 2005 (UTC)[reply]

There is just one notation, but as the most characters are in the BMP, the leading zeroes can be left out. U+1234 (ሴ) is the same as U+001234.

-- Jordi· 17:05, 16 September 2005 (UTC)[reply]

But U+48 is an invalid notation for U+0048 --Pjacobi 17:10, 16 September 2005 (UTC)[reply]
Normally 4 digits are used for stuff in the BMP and 6 digits for everything else. I'm not sure if there is a rule forbidding other lengths but i've never seen them used. Plugwash 17:23, 16 September 2005 (UTC)[reply]
The rules in the Unicode standard defining this format declare that no fewer than 4 digits are to be used. I've seen 5 digits all the time.--Prosfilaes 23:21, 16 September 2005 (UTC)[reply]
The rules for U+ notation have changed a few times. Current rules are U+xxxx, U+xxxxx, or U+xxxxxx, as appropriate, to denote a character by its code point. In older version of Unicode, code points and code values were treated separately using U- and U+ notation, respectively. The number of digits required varied. Unicode 3.0 said U- notation always had to have 8 digits, and U+ always had to have 4. That's obsolete now though. — mjb 23:57, 16 September 2005 (UTC)[reply]
I've just rewritten that piece to reflect some of what has been discussed in this section it could probablly still do with more improvement though and possiblly even making a section of its own. Plugwash 01:01, 17 September 2005 (UTC)[reply]
I've never seen 6 digits with a trailing 0.--Prosfilaes 03:06, 17 September 2005 (UTC)[reply]
You mean "leading 0", don't you? Leading zeros are not used, unless there would otherwise be fewer than four digits. See section 0.3 ("Notational Conventions") of the preface to the Unicode Standard. --Zundark 09:33, 17 September 2005 (UTC)[reply]
It seems i misremembered, probablly because most of what i have seen reffering to stuff outside the BMP is stuff relating to the limits of unicode (hence 6 digits). i've updated the article to reflect that fact that 5 digits are indeed used. Plugwash 14:58, 17 September 2005 (UTC)[reply]

Old talk

AAT

Can someone who knows who AAT are add to the AAT disambiguation page appropriately and also send the link on this page to the right place? Thanks EddEdmondson 08:59, 19 Jun 2004 (UTC)

Done. (AAT is Apple Advanced Typography, but we have no article on it at present.) --Zundark 09:40, 19 Jun 2004 (UTC)

Revision history has a future date

please justify this. If it is not justified within a few days i will be reverting Plugwash 12:15, 25 Dec 2004 (UTC)

I've reverted it already. Future dates are never justified for this sort of thing, because schedules can change. --Zundark 12:39, 25 Dec 2004 (UTC)

VISCII

Someone claimed higher up this talk page near the start of a long section that VISCII had more precomposed characters than unicode. Our page on VISCII disagrees would anyone like to clarify. Plugwash 01:55, 18 September 2005 (UTC)[reply]

Of course not. VISCII only contains characters for Vietnamese; Unicode covers so many more languages and scripts. Vietnamese may have a complex alphabet, but it certainly doesn't need more precomposed characters than all the other modern languages combined. – Minh Nguyễn (talk, contribs) 06:32, 27 January 2006 (UTC)[reply]
Besides, I think VISCII was taken as an input character set to Unicode, so they added all the VISCII compositions as precomposed characters. I can't see any other reason to make LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE (U+1EB2) a precomposed character... --Alvestrand 07:12, 27 January 2006 (UTC)[reply]

APL?

Does the link in the first paragraph really refer to APL -- the programming language? Is this a typo; meant to be IPA? (International Phonetic Alphabet)

--Pkirlin

I think it's correct. IPA is already covered by the term linguistic. If you take a look at the article, you'll see that APL requires a special set of symbols. Michael Z. 2005-10-9 23:07 Z

"Endianness is a made-up word"?

What computer jargon is not made-up? And I have never heard of "significant byte order"; if I didn't see it used with "endianness" in the log message, I wouldn't even know what "significant byte order" is. "Endianness" at least is comprehensible. (PS: In fact, if whoever did the edit tried to even do a quick google, he/she would know that the use of the word "endianness" is very widespread and does not qualify as a "made-up" word.)—Wing 08:15, 22 October 2005 (UTC)[reply]

I've also byte sex for this distinction, but yb now "endianness" is most common. --Pjacobi 09:15, 22 October 2005 (UTC)[reply]
I remember now… what he/she meant is probably byte order (“MSB first” and “LSB first” back when I first saw it), which I indeed have saw and used way before I ever saw “endianness”.—Wing 15:57, 22 October 2005 (UTC)[reply]
Agree "byte order" is the normal term don't think i've ever seen it written significant byte order though. Plugwash 18:53, 22 October 2005 (UTC)[reply]
Google is not a judge of the proper use of words/phrases. Though, going by your argument, "byte order" returns ~900,000 pages and "endianness" returns ~300,000 ("significant byte order" returns 347 and so probably shouldn't be used). And for reference, "significant byte order" is a phrase, not a word (regarding edit comment). If you do not know what "byte order" or "signficant byte order" mean then you do not understand "endianness".—Kbolino 23:16, 4 November 2005 (UTC)[reply]
a search for "endian" claims to return 4.68 million pages. The world is an inconsistent place.... --Alvestrand 07:18, 27 January 2006 (UTC)[reply]

Linguistics Connection

Is there a connection between Unicode and Phonetics (such as the IPA)? With one symbol for each sound, it is similar to how Unicode's philosophy is. Maybe the next thing would be to have all languages standardized with a sound per symbol in Unicode.

No. --Prosfilaes 00:54, 15 December 2005 (UTC)[reply]
Unicode does not standardize sound - many characters in Unicode have completely different sounds in different languages (Japanese/Chinese, for instance). However, IPA characters have code positions in Unicode. --Alvestrand 07:19, 27 January 2006 (UTC)[reply]

firefox and hex boxes

"will display a box showing the hexadecimal": maybe mention that firefox does this too...

Technical note: Due to technical limitations, some web browsers may not display some special characters in this article.

It would be good to have a link from technical reasons giving a brief explanation of the problem and what to do about it. For example, somewhere on Wiki there is a page suggesting some downloadable fonts for WinXP that are more complete than Arial Unicode MS. --Red King 00:44, 18 February 2006 (UTC)[reply]

Compose key in X Window

Hi,

it should be mentioned in the section about different input methods for Unicode characters, that all X Window application (including Gnome and KDE, but not only them) support using Compose Key. And it should be added as well, that any key (e.g., hated CapsLock) could be redefined as Compose key, if the keyboard does not contain it natively.

Ceplm 15:15, 21 February 2006 (UTC)[reply]

Digraphs

Why has Unicode no code point for the digraph "ch"? --84.61.38.22 15:43, 3 March 2006 (UTC)[reply]

Because it can be written ch, and adding a code point for ch would cause a lot of confusion and be a Bad Thing. --Prosfilaes 17:22, 3 March 2006 (UTC)[reply]

Why has Unicode no code points for digraphs assigned? --84.61.50.36 11:02, 6 March 2006 (UTC)[reply]

Afaict (with a few exceptions that are there either for historical reasons or because a digraph is actually considered a seperate letter in some language) digraphs are not encoded because they are merely presentational artifacts that don't add any meaning to the text. --Plugwash 11:33, 6 March 2006 (UTC)[reply]

Can a code point in the private use area of Unicode become assigned for a digraph? --84.61.50.3 09:22, 7 March 2006 (UTC)[reply]

Of course. You can assign any code point in the private use area for anything you like. Whether it's sensible to do so is another matter. --Zundark 15:01, 7 March 2006 (UTC)[reply]
Code points in the private use area are not and never will be officially assigned to anything. You can use them for whatever you like in your own fonts but you probablly won't persuade many others to follow your assignments. --Plugwash 16:17, 7 March 2006 (UTC)[reply]
The private use area is guaranteed not to be assigned *by Unicode* to anything. When you use codes in the private use area, you have absolutely no protection against someone else using the same codepoint for something completly different. That's what "private use" means, after all. --Alvestrand 16:24, 7 March 2006 (UTC)[reply]

Why has Unicode no code point for the Apple logo? --84.61.31.185 10:55, 4 March 2006 (UTC)[reply]

Because it doesn't cover corporate logos. Apple has assigned a code point for its logo in the private use area of Unicode. --Zundark 12:48, 4 March 2006 (UTC)[reply]

Greek and Cyrillic

Why has Unicode separate code points for Greek and Cyrillic? --84.61.48.129 10:45, 12 March 2006 (UTC)[reply]

Because they were encoded based on different sources, and never made a problem of size. The only real unification done in Unicode is the Han. --Alvestrand 13:25, 12 March 2006 (UTC)[reply]

Circuit components

Why has Unicode no code points for circuit components assigned? --84.61.37.190 14:29, 18 March 2006 (UTC)[reply]

Circuit components are not used in natural languages. Also, circuit component symbols, as far as I know, never existed in any historical character encodings (at least, none that the Unicode committees were interested in preserving compatibility with). —mjb 08:22, 19 March 2006 (UTC)[reply]
Otoh the musical symbols got in and from what i remember of the rationale (i cba finding it right now) it could be pretty much word for word applied to electronic symbols. I guess its just a case of noone being bothered to write up a proposal. Plugwash 12:18, 19 March 2006 (UTC)[reply]
I think I found one of the original docs: [3] - it has a section with examples of symbols used as text. I suspect the main reason for its encoding was to get someone (anyone!) to approve a canonical list of names for the symbols, though.... --Alvestrand 12:59, 19 March 2006 (UTC)[reply]


Number of bits

"Depending on the variant used, Unicode requires either 16 or 20 bits for each character represented." is completely wrong, and part of the reason I'm unhappy with these rewrites. Unicode never uses 20 bits for each character. In theory, it takes 20.1 bits to encode a Unicode character. In practice, the only constant length scheme uses 32 bits for each character.

Likewise, Unicode is not designed to encode all "human" languages. It's designed to encode all languages; it just happens that humans are the only group we know of that has a written language.--Prosfilaes 00:03, 25 March 2006 (UTC)[reply]

wtf?

what the crap, unicode 5.0 was supposed to be released some time after february 2006 but its been a while and its still not out possibly they're running behind schedule? 66.169.1.14 20:03, 31 March 2006 (UTC)[reply]

It's coming soon... they're wrapping up the beta about now. Sukh | ਸੁਖ | Talk 20:57, 31 March 2006 (UTC)[reply]
thanks for the anwser :) 66.169.1.14 23:17, 14 April 2006 (UTC)[reply]

Bold and italic accented letters

Why has Unicode no code points for bold or italic accented letters assigned? --84.61.56.17 09:58, 21 April 2006 (UTC)[reply]

The principle followed in Unicode is that of encoding abstract characters. Properties like weight and slant are considered to be details of the same sort as variations among typefaces and so are not in general encoded. The exception is that there are bold and italic versions of some letters used as mathematical symbols because in mathematics a different weight or slant may distinguish one character from another.Bill 10:17, 21 April 2006 (UTC)[reply]

Language of Han characters

Why has Unicode not separate code points for Japanese, Korean, simplified Chinese, and traditional Chinese Han characters? --84.61.56.107 10:49, 21 April 2006 (UTC)[reply]

See han unification, this is probablly one of the most controversial descisions unicode made. I suspect the main reason was the sheer number of characters needed (remember unicode used to be 16 bit fixed width) Plugwash 11:02, 21 April 2006 (UTC)[reply]

Why has Unicode no code point for g-tilde? --84.61.62.20 12:35, 21 April 2006 (UTC)[reply]

From the Unicode principles:
Certain sequences of characters can also be represented as a single character, called a precomposed character (or composite or decomposible character). For example, the character "ü" can be encoded as the single code point U+00FC "ü" or as the base character U+0075 "u" followed by the non-spacing character U+0308 "¨". The Unicode Standard encodes precomposed characters for compatibility with established standards such as Latin 1, which includes many precomposed characters such as "ü" and "ñ".
I guess no established standard encoded g-tilde, so g + combining tilde is The Way. --Alvestrand 19:15, 21 April 2006 (UTC)[reply]
What language uses g-tilde anyway? Afaict the only reason precomposed characters are there at all was to allow applications to transistion to unicode in stages (e.g. adopt unicode as thier text storage format without having to support combining characters). Plugwash 00:11, 22 April 2006 (UTC)[reply]
The clue was in the heading - Guaraní, widely spoken in Paraguay, uses g-tilde. AFAIK, it's the only language that does. All the same, it strikes me as an omission in Unicode since there are all manner of precomposed characters included which are used only in languages with considerably fewer speakers than Guaraní has. Sure, one can always use a combining tilde, but that isn't going to be an easy thing to explain if one happens to be trying to teach IT in Paraguay (and what's more, I haven't yet found a font where it looks right on the upper case G).AndyofKent 16:31, 12 August 2006 (UTC)[reply]

Hexadecimal Digits

Why has Unicode not separate code points for the Latin letters A-F and the hexadecimal digits A-F? --84.61.62.67 09:25, 22 April 2006 (UTC)[reply]

This is not the right place for such questions. Go ask on the Unicode mailing list or some other place for asking questions about Unicode. This is about writing an article.--Prosfilaes 12:49, 22 April 2006 (UTC)[reply]
And please don't delete other people's comments on talk pages.--Prosfilaes 17:47, 22 April 2006 (UTC)[reply]
Before asking questions anywhere about why such and such is not in Unicode, please read Where is my character? and Combining mark FAQ on the Unicode web site.Bill 19:49, 23 April 2006 (UTC)[reply]
It's simple: because no existing codepage had them separated. Unicode was designed to be a superset of all existing codepages: so, if some characters were separate in any one codepage, then they were separate in Unicode too. And if some characters were separate in no codepage, then, most often, they were not separate in Unicode, too. — Monedula 11:55, 24 April 2006 (UTC)[reply]

Web

I see some problems withthe text in this section:

Web browsers have been supporting severals UTFs, especially UTF-8, for many years now.

That seems tomix up Unicode encodings with unicode as a document character set.

Display problems result primerally from font related issues.

How is that different to any other use, such as wordprocessing or email?

In particular Internet Explorer doesn't render many code points unless it is explicitly told to use a font that contains them.

Needs to say which version of IE, on which platform, and it seems to be incorrect anyway if referring to Windows IE6 or IE7

All W3C recommendations are using Unicode as their document character set, the encoding being variable, ever since HTML 4.0.

In fact Unicode has been the document character set ever since HTML 2.0.

Although syntax rules may affect the order in which characters are allowed to appear, both HTML 4 and XML (including XHTML) documents, by definition, comprise characters from most of the Unicode code points, with the exception of:

This varies between XML1.0 and 1.1

  • any code point above 10FFFF.

There are no Unicode code points above 10FFFF by definition. --Nantonos 18:10, 27 April 2006 (UTC)[reply]

Unicode has been the document character set ever since HTML 2.0. — that is completely incorrect. In HTML 2.0 and 3.2, it was ISO 646:1983 IRV (ASCII, basically) minus all control codes except tab, LF, and CR, plus ISO IR-001/ECMA 94 minus control codes (ISO/IEC 8859-1, basically).[4][5]. In HTML 4.0 the document character set is Unicode's entire range up to 10FFFF, minus the surrogate range and the same control characters that were disallowed in the previous versions. XML 1.0 doesn't use the term "document character set" but it does have the same concept: regardless of how the document is encoded (UTF-8, ASCII, whatever), it consists only of characters from a limited repertoire that is almost, but not quite the same as in HTML 4. This info is in the XML specs but really is not relevant to this article.—mjb 01:17, 1 May 2006 (UTC)[reply]

Alt+0### is not a Unicode input method

The "###" in "Alt+0###" is not a Unicode code point and therefore this is not a Unicode input method. On my US Win2K system it appears to be giving me Windows-1252 characters. For example, Alt+0128 gives me the Euro symbol, and anything above 255 is entered as if I had typed it modulo 256 (e.g., 256=000, 257=001, and so on). This misinformation about it being "Unicode" is repeated on Alt codes as well. Please research the actual behavior and fix the articles. Thanks!—mjb 00:48, 1 May 2006 (UTC)[reply]

The answer is it depends on the app and many people are confused about it. I've corrected it many times but errors have always been introduced.
In a plain windows edit box with no special previsions made (as in notepad) ALT+ is ansi only.
In a plain windows richedit box and in many apps that make thier own provisions (including ms word and by the looks of things firefox) a leading 0 or a code more than 3 digits makes it enter unicode. Plugwash 00:34, 2 May 2006 (UTC)[reply]
Actually testing now it seems even more complex than i thought. I'll need to have a better look into this and see if i can figure out the patterns. Plugwash 00:41, 2 May 2006 (UTC)[reply]
Thanks. I know it's not simple, and is poorly documented, which is why I didn't research it myself (we all pick our 'battles' around here)…—mjb 23:33, 3 May 2006 (UTC)[reply]
See if http://www.fileformat.info/tip/microsoft/enter_unicode.htm is of any help.—mjb 17:26, 13 May 2006 (UTC)[reply]

Fullwidth accented letters

Why has Unicode no code points for fullwidth accented letters assigned? --84.61.45.177 14:50, 3 May 2006 (UTC)[reply]

Please don't feed the troll.--Prosfilaes 17:49, 3 May 2006 (UTC)[reply]
Fullwidth forms exist only for compatibility with legacy encodings. Therefore they only contain characters which existed in such encodings. Plugwash 18:27, 3 May 2006 (UTC)[reply]
As Prosfilaes says, please don't feed this troll. This person asks over and over again questions that are not appropriate for this talk page to which he or she could easily find the answers on the Unicode Consortium web site, to which he or she has been referred. Please just ignore these questions.Bill 18:36, 3 May 2006 (UTC)[reply]

Latin small letter dotless J

Is there a code point in Unicode for "Latin small letter dotless J" assigned? --84.61.54.142 10:16, 13 May 2006 (UTC)[reply]

Yes, it's U+0237. There's also an italic version for mathematics: U+1D6A5.—mjb 17:25, 13 May 2006 (UTC)[reply]
Note that this person, whose IP fluctuates but stays in the same range, has asked a long series of these questions, and promptly after you answered, he asked another. The questions are off-topic, since they don't go towards improving the article, and if they are honest, the writer needs to learn how to look for the answers himself, most of which are easy to find on the Unicode website.--Prosfilaes 21:28, 13 May 2006 (UTC)[reply]

Latin capital letter J with dot above

Is there a code point in Unicode for "Latin capital letter J with dot above" assigned? --84.61.56.3 18:21, 13 May 2006 (UTC)[reply]

No. — Monedula 11:01, 31 May 2006 (UTC)[reply]

Why has Unicode no code point for "Latin capital letter J with dot above" assigned? --84.61.48.130 10:29, 3 June 2006 (UTC)[reply]

The code points are usually assigned from the question "why should we?" rather than "why not?" Is there a language that uses "Latin capital letter J with dot above"? If not, no need to add the code point. Mlewan 06:22, 5 June 2006 (UTC)[reply]
Even if there was a language using it it wouldn't get added because it can be formed with a combining sequence. Adding precomposed characters was a temporary concession to help existing software developers migrate to unicode. Plugwash 15:32, 5 June 2006 (UTC)[reply]

Should we insert the Unicode logo on the page?

The following reply was received from the Consortium on May 30, 1996:

You have permission to display the Unicode logo in Wikipedia as long as:
1. The logo links to our website http://www.unicode.org
2. You include the following acknowledgement in the fine print:
The Unicode(R) Consortium is a registered trademark, and Unicode (TM) is a trademark of Unicode, Inc.

Monedula 11:15, 31 May 2006 (UTC)[reply]

Why has Unicode no code point for the Unicode logo assigned? --84.61.60.218 15:36, 8 June 2006 (UTC)[reply]

Cherlin Because Unicode does not encode logos. Not even its own. Especially not its own, in fact. ;->

Precomposed characters

Why has Unicode the encoding of new precomposed characters stopped? --84.61.71.139 19:13, 8 June 2006 (UTC)[reply]

Introducing a new representation (a precomposed character) for an already encodable (using combining sequences) character is a bad thing because it will break compatibility with those that can handle the character but don't understand the new representation. There is also the issue of normalisation rules that say the normalised form of a character must not change between revisions (so the new precomposed characters could not actually be used in normalised text). Unicode only had precomposed characters in the first place as a carrot to get users of legacy encodings to switch, nothing more. Plugwash 22:00, 8 June 2006 (UTC)[reply]

Why has Unicode many code points for precomposed characters assigned? --84.61.7.99 08:18, 9 June 2006 (UTC)[reply]

Ligatures

Why has Unicode no code points for ligatures assigned? --84.61.7.99 07:25, 9 June 2006 (UTC)[reply]

Double Byte Character Sets before Unicode: Only in East Asia?

Why are all Double Byte Character Sets, before Unicode was published, from East Asia? --84.61.7.99 07:31, 9 June 2006 (UTC)[reply]

Really, this is not the point of the talk page. Go somewhere else to ask these questions.--Prosfilaes 07:37, 9 June 2006 (UTC)[reply]

Confusing typo in Ligatures section

The phrase "the special marks are preceed the main letterform in the datastream" appears in the Ligatures section and seems to have a typo. Which of the following is it trying to say?

1: "the special marks preceed the main letterform in the datastream"
2: "the special marks are preceeded by the main letterform in the datastream"

These two phrases have opposite meanings, of course. --Silent Blue 13:46, 16 June 2006 (UTC)[reply]

Noncharacters U+FDD0 - U+FDEF

Are there any characters to the Unicode code points U+FDD0 - U+FDEF assigned? --84.61.23.172 09:47, 23 June 2006 (UTC)[reply]

Please, don't ask that kind of questions here. This page is about the content of the article and nothing else. Mlewan 12:53, 23 June 2006 (UTC)[reply]
I disagree with you about the purpose of talk pages. It is a very good place to ask questions about the topic (and I have seen it used much for this purpose in the past, to the benefit of those involved, and often of benefit to the article as well). If you don't want to answer them, you can ignore them. - Rainwarrior 19:19, 3 July 2006 (UTC)[reply]
You use "you" for two persons here. I added the comment above but never removed anything. You can check the history if you want to see who removed what. The initial question was lazy and does not add anything to the discussion. If 84.61.23.172 had simply read the article and followed the links, s/he would have found the answer. The question below is different. The answer is not obvious and it is at least possibly interesting. Mlewan 19:46, 3 July 2006 (UTC)[reply]
(Yes, the edit comments were addressed to the person who reverted the addition of the second question. The talk page comment was only addressed to you. Sorry if that was confusing.) I wouldn't have commented if you had just said "you can find easily this information in the article" to the person, but you said "this page is about the content of the article and nothing else", so I took it to mean just that. - Rainwarrior 19:54, 3 July 2006 (UTC)[reply]
It's not a very good place to ask questions about the topic. It's inappropriate and distracts people actually working on the article. If it were one or two questions, it wouldn't be a big deal, but this is needlessly bloating the talk page. You go anywhere and start asking a lot of questions that show you haven't done your homework, ignore the mission of that group of people, and ignore anyone who tries talking to you about better ways of doing things, you're going to be considered rude. Frankly, I think he's a troll, the way he always has more questions the instant the last one is answered and never gives any explanation. If not, he's just horribly rude and demanding.--Prosfilaes 00:45, 4 July 2006 (UTC)[reply]
Definately a troll. I firmly approve of removing this user's comments when they're of no relevance to the content of the page. It'd be a different matter if the amount of pointless questions asked wasn't trolling. It's as if the user has a list of Top 100 Unicode questions and reels one off when there has been a large time gap or as soon as the last one has been answered. Sukh | ਸੁਖ | Talk 00:51, 4 July 2006 (UTC)[reply]
Ah, I hadn't looked at the rest of the page. I'm sorry. I retract any of my earlier statements. Treat this guy however you will. - Rainwarrior 01:45, 4 July 2006 (UTC)[reply]

Displaying Unicode Characters

To use one of the available unicode fonts (in your computer) to display the unicode special characters existing on web pages, then, if you are using that special char inside a table or chart or box, specify the class="Unicode" in the table's TR tag (or, in each TD tag, but using it in each TR is easier than using it in each TD), in wiki table code, use that after the (TR equivalent) "|-" (like |- class="Unicode"). For individual case, template code {{Unicode|char}} for each character can also be used. you may use HTML decimal or hexadecimal in the place of char. If a paragraph with lots of special Unicode chars needs to be displayed, then <p class="Unicode"> ... </p> code can be used. Thanks. ~ Tarikash 22:42, 14 July 2006 (UTC).[reply]

Bibliography or Further Reading?

Could this article use a bibliography or a list of books for further reading? Seems to me there are some good texts in print regarding Unicode.

tamil?

"However, several issues remain as far as Indic language support is concerned. For instance, the Tamil language has been assigned only 127 blocks, which, while enabling correct text display, causes problems when text is copied onto a word processor. This problem can easily be rectified if an additional 130 blocks are allotted to Tamil."

wtf does blocks mean in this case? what is the problem with word processors and how will more "blocks" help? i've commented this out until it is explained. Plugwash 18:18, 13 August 2006 (UTC)[reply]

Hi

By blocks I was referring to code points allotted in the code chart for the Tamil language. The problem is not with word processors but with the allocation of spaces in the Unicode standard itself. The Tamil language has 12 vowels and 18 consonants. Simple math yields 216+12+18=246 characters. Tamil also has a special character called 'aytha ezhuthu'. Put together there are 247 letters of the alphabet. However, the powers that be at Unicode have decided that Tamil does not have to be allocated so many points. Instead they have allotted a few code points for joiners and modifiers. The problem arises when text is copied and pasted. The joiners are rendered as independent characters ('ku' is displayed as 'k'+'u', for instance). Illogical ordering of letters and modifiers is another problem.

Regards

C Ramesh —The preceding unsigned comment was added by 203.199.211.197 (talkcontribs) 12:57, 14 August 2006 (UTC)

Ramesh, this still isn't very clear (For anyone who wonders where the 216 comes from, I assume it is all constanant/vowel pairs). Are you saying that 'ku' is representable in the source application and destination application, but not in the clipboard? What representation are the source and destination applications using that allow them to represent the 'ku' character? Chovain 13:58, 14 August 2006 (UTC)[reply]

Hi Chovain

You are right. The 216 letters are vowel-consonant pairs, but they are all treated as individual letters, unlike in English.

I think the problem would be better understood with this illustration:

க = the letter ka

ு = the 'u' vowel sign

கு = the letter 'ku'

When I copy 'கு' from a Web page and paste it onto a word processor, it would appear as க ு (without the space between the two letters). The letter and the vowel sign must not appear as separate letters in Tamil. That's my biggest quibble with Unicode. It's perfect for Tamil text display but fails miserably when it comes to text representation in a word processor or text editor.

C Ramesh —The preceding unsigned comment was added by 203.199.211.197 (talkcontribs) 15:29, 14 August 2006 (UTC)

Ok - that makes much more sense. I've rewritten the paragraph in question. Let me know what you think. Chovain 23:48, 15 August 2006 (UTC)[reply]

I do not understand the recent addition that says that "TSCII does not suffer from this problem". The current article on TSCII says that it only uses 128 characters for "non-ASCII" - which would make it impossible to encode the 216 vowel-consonant combinations. So how does TSCII solve the problem? --Alvestrand 05:55, 16 August 2006 (UTC)[reply]
Hmm - that's a good point. C Ramesh (or anyone for that matter): Any chance you could shed some light on this? (Don't forget to sign your posts with "~~~~", too) Chovain 07:04, 16 August 2006 (UTC)[reply]
Actually, I think I've got it right now. Not 'all' uyirmei require a special glyph, so it 'can' be done in 128 slots, but Unicode doesn't even use all of the 128 slots allocated. Does anyone have a better reference for this stuff? This is feeling too much like original research. Chovain 07:16, 16 August 2006 (UTC)[reply]
The most official TSCII to Unicode conversion guide is Unicdoe technical note 15, referenced on the TSCII page. [6] - even though Unicode technical notes are not parts of the standard, I don't think many people want to deviate from that. Note that this refers to Unicode version 4.0; 4.1 added another character. --Alvestrand 07:41, 17 August 2006 (UTC)[reply]

Hi Chovain

Thanks for the rewrite. It ceratinly provides a lot more clarity.

C Ramesh —The preceding unsigned comment was added by 203.199.211.197 (talkcontribs) 12:25, 16 August 2006 (UTC)

Ramesh - PLEASE sign your comments with ~~~~.  
See WP:SIG if you don't know what I am talking about.Chovain 03:22, 17 August 2006 (UTC)[reply]
So in summary it seems the real issue is that word processor authors haven't fully dealt with the fact that "user percieved character" is not the same as "code point" despite having had many years now to do so. Plugwash 16:17, 16 August 2006 (UTC)[reply]
That still doesn't stop this from being a valid issue. Requiring programmers to treat particular characters differently is a pretty serious issue in itself. If I were writing a word processor that I wanted to handle any language, I certainly wouldn't know to treat these cases any differently. If it took 2 characters to represent French's "ç" character (the "c" and the dangly bit - sorry, I don't actually know French :)), this would be considered a very serious problem. Chovain 03:13, 17 August 2006 (UTC)[reply]
There's no feasible way to handle the world's languages without intelligence. Many, many scripts need position-sensitive shaping to look right. And ç can be stored as 2 code points, a c and a combining cedilla, and many Latin languages use letters that must be stored as two or more code points.--Prosfilaes 04:18, 17 August 2006 (UTC)[reply]
Wait a bit. If the main problem is that கு becomes க ு in a word processor this is a non issue. It works perfectly fine for me when I copy from Safari and paste in TextEdit, Pages or a large number of other MacOS X word processors. Pasting in MS Word fails, exactly as described, but that is a shortcoming of MS Word for Mac - not a shortcoming of Unicode. Could someone confirm that this is indeed the problem and revert the article? Mlewan 05:36, 17 August 2006 (UTC)[reply]
No, that's not the problem as I understand it. You think you're seeing it correctly in your OSX apps because everything is displaying it wrong. To make matters worse, anyone with a Tamil enable browser is going to see 'கு' differently to the rest of us :). The character we are discussing (கு) is not meant to look like a க and a ு joined together; it's meant to look like க with the tail extending around like a backwards '@'. See 6th char along the top row of this image. Chovain 06:16, 17 August 2006 (UTC)[reply]
I see exactly the 6th character in both Safari and TextEdit exactly as you describe it and as the picture shows. Mlewan 06:55, 17 August 2006 (UTC)[reply]
Ok, great. But does the fact that many OSX apps have worked around this issue stop it from being a Unicode issue? Can TextEdit individually represent the characters 'ka' and 'u' without a space between them? There are 3 separate characters in Tamil, and Unicode can only represent 2 of them. This is not an issue with the number of code points allocated (as originally described). As I understand it, is just a problem that Unicode does not define all of Tamil's characters as other encodings do.
Yes, ka and u can be written (displayed when they are pasted) next to each other without a space. I do have three different characters in front of me in a TextEdit document: 0BC1 (u), 0b95 (ka) and the mix of them as per your picture. Displayed in any order. No spaces. Or with.
This is not something "many OS X apps" have worked around. The solution is built into the OS. It doesn't work in MS Word, as Word uses its own text rendering engine. Mlewan 08:12, 17 August 2006 (UTC)[reply]
(unintdenting) So you are able to display 4 different things: 'ka', 'u', the correct 'ku' glyph, and the incorrect 'ku' glyph. The hex values are: 'ka'=>0x0B95; 'u'=>0x0BC1; 'ku'=>0x0B95,0x0BC1; and how is the incorrect one (MS-style side-by-side representation) represented in non-MS apps? If it is also represented by a 0x0B95,0x0BC1 combination, then I'm betting it gets displayed correctly again when you save and reopen the file. Chovain 12:01, 17 August 2006 (UTC)[reply]
If I get your question right, you want to know what happens if I put க ு (0x0B95 and 0x0BC1) displayed as two separate characters in a TextEdit document, save it, close it and reopen it. The answer is that it displays exactly the same as when I saved it. If you want to know more about the options, I suggest you try a Mac out at your nearest dealer. To actually use Tamil input may not be trivial, but you have information about that at http://discussions.apple.com/message.jspa?messageID=1200527#1200527 . Mlewan 20:25, 17 August 2006 (UTC)[reply]
Some further test results: A UTF16 encoded text file from a Mac shows கு perfectly fine in Notepad on Windows XP. However, the trick MacOS uses to be able to have க and ு next to each other without a space, is to actually save it with a space. The consequence is that an additional space is displayed in Notepad. If you type a space between க and ு on MacOS, the file is saved with two spaces: க ு. Even on Windows you can paste கு successfully to both Notepad and OpenOffice, but MS Office fails. Mlewan 13:32, 22 August 2006 (UTC)[reply]

(Sorry for the indent mess below. I do not know what was an answer to what anymore. Mlewan 13:32, 22 August 2006 (UTC))[reply]

Unicode gives Latin has 'a', 'e', and 'æ'. It does not rely on the Operating system to look for all occurences of 'ae' and display them as 'æ'. It gives French a 'c', '¸' and a 'ç'. (If it didn't, I couldn't write 'c¸' as 2 separate characters to illustrate this example). Chovain 07:31, 17 August 2006 (UTC)[reply]
This is simply for historic and compatibility reasons. Having seperate characters for combined letters is redundant. Of course it would be possible to write c and ¸ as two seperate characters, the word processor just needs to add an invisible blank. -- 80.156.42.129 13:20, 28 November 2006 (UTC)[reply]
To elaborate: TSCII is able to represent the க character (0xB8), and can apply ு as a modifier to a consonant (0xA4). It has a special (single) character for கு though (0xCC). See tscii_charlist for the full chart. The table listed on the TSCII page displays 0xCC incorrectly, as it is just sending us the unicode. Either way, the paragraph as it stands needs improvement. I'll take a shot at it soon if noone beats me to it. Chovain 06:40, 17 August 2006 (UTC)[reply]

The Unicode support of Tamil is perfectly able to fulfill all user requirements (except perhaps some strange issues concerning text markup), but the software implementation is needs somewhat more sophistication that visual order encodings like TSCII. Note that Thai got its visual order encoding grandfathered into Unicode, but most Unicode expert consider the Unicode Thai implementation an odd deviation, needing special-casing here and there (e.g. in UCA). --Pjacobi 12:11, 17 August 2006 (UTC)[reply]

See also: Wikipedia:Enabling complex text support for Indic scripts
You can also try my test pages at http://www.jodelpeter.de/i18n/tamil/index.htm to verify that identical display can be achieved using Unicode or TSCII, provided you've installed your OS and fonts correctly.
Pjacobi 19:10, 17 August 2006 (UTC)[reply]
"The table listed on the TSCII page displays 0xCC incorrectly, as it is just sending us the unicode" sending the unicode equivilents is all we can do in html to display a charset and at least here the display in that box does look substantially similar. an image may be a better option for displaying minority (read: not supported by many peoples systems) charsets though. Its certainly an option to consider. Plugwash 11:43, 18 August 2006 (UTC)[reply]

Issues section - references are 404'd

In the Issues section, both the [1] and [2] references ('alternatives to Unicode' and 'Thai problems in collation') link to the same dead page at IBM.

WGL-4, MES-1 and MES-2 table

Can something be done to improve this table? The reader is left to figure out for themselves the correlation between the bolding and italicising and which codepoint ranges are included in the subsets. I presume that bold means it is in WGL-4, italics that it is in MES-1 (actually, there don't appear to be any examples of this), bold italics that it is in both WGL-4 and MES-1 and that all mentioned codepoint ranges are included in MES-2. Is this right? I'm still not sure why in the F0 line, 01-02 are given in parentheses. Perhaps there is another scheme (like using colours) to make all this clearer (and not quite so ugly)? Is there a particular reason the table was forced into a different font to the rest of the article? And finally, the notes [1] and [2] in the title don't seem to do anything (their content seems to only appear when you edit them). (I note from the history that the table was originally inserted by User:Crissov back in April.) Thylacoleo 03:06, 23 August 2006 (UTC)[reply]

I've tried to improve the lead-in text and the table heading. Comments, corrections and (especially) improvements are welcome. Cheers, CWC 08:43, 20 March 2007 (UTC)[reply]

Input methods???

I'm tempted to delete the entire section "Input methods". They are essentially unrealted to Unicode. --Pjacobi 22:33, 25 August 2006 (UTC)[reply]

OS List

Would a more comprehensive listing of operating systems be of benefit?

I don't think so. The question of what it means to support Unicode is hard, and all recent OSs support Unicode in some way, with the exception of some very low-level stuff.--Prosfilaes 23:59, 7 September 2006 (UTC)[reply]
It would definitely be of benefit. If someone could make a comparison between different OSs and how they support (or claim to support) unicode, that would definitely be interesting. However, I see the potential for a lot of details, so it may be better to dedicate a page to it. If there is someone prepared to collect the information, of course. Mlewan 04:32, 8 September 2006 (UTC)[reply]
I certainly wouldn't want to clutter up the article, but different OSes have varying degrees of being Unicode-enabled. Maybe a seperate article that had a comparison and notes would be of some benefit.

Translating HP fonts to Unicode

Considering how HP is a party to the Consortium that attempts to be responsible for Unicode, maybe they can appear out of the blue, pipe up, and explain a simple way to translate my custom-designed HP laserfonts (originally composed in 1988 or 1989) to Unicode? (No, I don't use a PC and I don't use a Mac, and admit I am dealing with a fairly ordinary 68000 environment contained (and accessed) in a non-FAT, non-PC filesystem.)

The information here at Wikipedia is simply not explicit enough for me to translate my laserfonts to Unicode. There's got to be a way, but I need a lot more information than what is currently in the main article. And I shouldn't have to buy a PC running under Windows just to see the Unicodes.

Doesn't Unicode worry about the depths of margins into the bitmaps, or the heights and widths of the relevant data of the bitmaps?

Somebody should put together an article about the way Unicode stores its data. —The preceding unsigned comment was added by 198.177.27.18 (talkcontribs) 06:47, 11 October 2006 (UTC)

Thank you for your suggestion! When you feel an article needs improvement, please feel free to look up the details elsewhere, and make those changes. Wikipedia is a wiki, so anyone can edit almost any article by simply following the Edit this page link at the top. You don't even need to log in (although there are many reasons why you might want to). The Wikipedia community encourages you to be bold in updating pages. Don't worry too much about making honest mistakes — they're likely to be found and corrected quickly. If you're not sure how editing works, check out how to edit a page, or use the sandbox to try out your editing skills. New contributors are always welcome. Chovain 08:01, 11 October 2006 (UTC)[reply]
I think you have something of a disconnect.... a font is used to represent characters, and Unicode is used to identify characters. If your font is not very fancy/intelligent/complicated/convoluted, and you know which character set your font used to represent, you can probably find a conversion tablle from unicode.org and move the characters around until they make sense for Unicode input. But Unicode isn't a mechanism for representing fonts. --Alvestrand 10:40, 11 October 2006 (UTC)[reply]
A typical HP font from 1988 or 1989 allows dynamic re-editing of fonts by encouraging the user to specify particular, individual characters, and then rewriting the data (having first successfully identified the font), so there is a great similarity between locating a character in a set of HP fonts, and locating a character defined by the Unicode consortium. —The preceding unsigned comment was added by 198.177.27.20 (talkcontribs) 21:08, 11 October 2006 (UTC)
You were talking about bitmaps and 'the way unicode stories its data'. Unicode, however, is not a font specification. It specifies a mapping of characters to codes, and a few other things, like character properties, case-folding algorithms, and the like. All this data should be findable at the Unicode website. You need to identify a font format that supports Unicode. JamesFox 22:43, 11 October 2006 (UTC)[reply]

I need a bit of help...

I've been searching all over on how to update my PC's unicode registry, as very few non-keyboard characters appear as anything other than boxes, but to no avail. Can anyone tell me how I do this? (Also, might be something to incorporate into the article) --Nintendorulez talk 20:58, 21 October 2006 (UTC)[reply]

I never heard of a PC's "unicode registry". Usually when unusual characters appear as boxes the problems are in the font or the application settings or the document itself. Which characters do you look for in particular? Chinese? Russian? Which application do they not appear in? Internet Explorer? MS Word? Firefox? What kind of documents have you tried? HTML? Text? Word? Which operating system do you have? Windows? MacOS X? Linux? Mlewan 22:02, 21 October 2006 (UTC)[reply]

Redrafted "Origin and development" section

I went to tidy up a recent edit to the "Origin and development" section of the article, and ended up rewriting the whole section. Here's what I came up with.

By the late 1980s, character encodings were available for many of the world's writing systems. However, those encodings are mutually incompatible and most are only useful in particular regions of the world. Moreover, writing software that can use multiple encodings is quite difficult. (For instance, in ISO/IEC 2022, the meaning of a byte depends on all the bytes preceding it.)
Unicode was developed as a solution to this problem: a single character encoding that includes all existing encodings, covers the whole world, and can encode texts in which arbitrary writing systems are mixed together.
The Unicode standard also includes a number of related items, such as character properties, text normalisation forms, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).
===Design Principles===
Unicode encodes the underlying charactersgraphemes and grapheme-like units — rather than the variant glyphs (renderings) for such characters. It assigns a unique code point — a number, not a glyph — for each character. In other words, Unicode represents a character in an abstract way, and leaves the visual rendering (size, shape, font or style) to other software, such as a web browser or word processor. In the case of Chinese characters, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs (see Han unification).
To encourage the use of Unicode, its designers also emphasised round-trip compatibility: any text in a standard encoding can translated to Unicode and back with no loss of information. For instance, The first 256 code points were made identical to the content of ISO 8859-1, making it trivial to convert most existing western text.
The tension between these two goals has required some compromises. In many cases, essentially identical characters were encoded multiple times at different code points to preserve distinctions used by legacy encodings, for the sake of round-trip compatibility. For example, the "fullwidth forms" section of code points encompasses a full Latin alphabet that is separate from the main Latin alphabet section. In Chinese, Japanese, and Korean (CJK) fonts, these characters are rendered at the same width as CJK ideographs rather than at half the width. For other examples, see Duplicate characters in Unicode.
Also, while Unicode allows for combining characters, it also contains precomposed versions of most letter/diacritic combinations in normal use. These make conversion to and from legacy encodings simpler and allow applications to use Unicode as an internal text format without having to implement combining characters. For example é can be represented in Unicode as U+0065 (Latin small letter e) followed by U+0301 (combining acute) but it can also be represented as the precomposed character U+00E9 (Latin small letter e with acute).

The main changes I'm aware of (it's late here) is mentioning ISO/IEC 2022 and round-trip compatibility. Does anyone think this go in the article? Cheers, CWC(talk) 17:50, 27 October 2006 (UTC)[reply]

UCS-2 obsolete?

Can you really call UCS-2 "obsolete"? To me, "obsolete" means something that no longer is used. However MS SQL Server still uses UCS-2 internally, and that means that a lot of us use it indirectly every day. Our bank may use it, our HR System that pays our salary, some of our favourite web sites... Mlewan 06:28, 4 November 2006 (UTC)[reply]

No, what Microsoft and many others use is officially UTF-16, see that article and the UCS-2 article. --Red King 11:52, 4 November 2006 (UTC)[reply]
Obsolete doesn't mean it's no longer used. There's still programs written for systems so obsolete they are being emulated on an emulator written for a system that is itself obsolete and thus being run on an emulated system on real hardware. It means that a replacement has come out and that it is no longer being supported and use of it, especially in new programs, is discouraged. That applies to UCS-2; Unicode has been strongly discouraging use of it for years.--Prosfilaes 13:28, 4 November 2006 (UTC)[reply]
From wiktionary: "no longer in use; gone into disuse; disused or neglected (often by preference for something newer, which replaces the subject)."
You can say as much as you want that that is not what it means to you, but it does to me, and probably to a fair number of other people. If I read that something is obsolete, I assume that there is no need to learn anything about it. However, as MS uses it, I have to learn what it is and what restrictions come with it. Besides, I guess that they deliberately have not moved to UTF-16 for some reason - perhaps indexing and performance - and then UCS-2 even has some benefits over UTF-16. Mlewan 14:22, 4 November 2006 (UTC)[reply]
Red King, the linked to article says nothing about UTF-16, and hopefully Microsoft could be trusted to make the distinction. Unfortunately, many programmers, apparently including the ones working on MS SQL server, don't feel the concern to move from UCS-2 to UTF-16, which means that UTF-16 can usually be used masquerading as UCS-2, but the program won't properly handle the difference.--Prosfilaes 13:28, 4 November 2006 (UTC)[reply]
Remember that UCS-2 is a subset of UTF-16. Software that uses UCS-2 will automatically handle UTF-16 correctly as long as it doesn't do character-specific processing like normalization, collation, rendering, word-splitting etc. These days, support for characters outside the BMP (especially the Supplementary Ideographic Plane) is almost mandatory, so I'd be really suprised if MS SQL Server does not handle UTF-16. (I'd be a lot less suprised if some Microsoft technical writers weren't up to speed on the difference between UTF-16 and UCS-2.)
There is an important piece of software that fully supports UCS-2 but is clumsier with UTF-16: Java. Not the implementations, the language itself! Java was designed before surrogates were added to Unicode. Bad timing, that.
Regards, CWC(talk) 03:02, 5 November 2006 (UTC)[reply]
The article from MS, which I start this section with, is by no means the only reference to it. Also see blogs.msdn.com or just Google for it. I think everyone who hears about this thing for the first time is surprised, but it nevertheless seems a fact that MS Sql Server uses UCS-2. And MySQL recommends UCS-2 over UTF-8 in some situations, as their implementation of UTF-8 does not support the supplementary plane either.
I saw that thing with Java as well, and was equally surprised. So one of the world's most used programming language and (at least) two of the world's most used databases still choose UCS-2 over alternatives in some situations.
I think the word "obsolete" does not describe the situation correctly. Mlewan 07:25, 5 November 2006 (UTC)[reply]
Really the only bits of software that need to consider UTF-16 surrogates as a special case are those that deal with actually rendering the text and those that convert to/from other encodings. As far as everything else is concerned surrogates are just 16 bit words like any other. Is there any evidence that those data types can't store surrogates and if so where is that evidence? Plugwash 17:11, 5 November 2006 (UTC)[reply]
Things that count characters and anything that treats the text as more than a black box needs to understand surrogates.--Prosfilaes 13:29, 6 November 2006 (UTC)[reply]
I'm not sure I understand the purpose Plugwash's question in this context, but in addition to Prosfilae's answer, there is sorting and finding text - both things a database server is supposed to be able to do. Mlewan 13:56, 6 November 2006 (UTC)[reply]
Code written to store and find UCS-2 would work on UTF-16 if it didn't barf on the reserved codepoints that are now used for surrogates. OTOH, string lengths, collated sorts, code conversions, etc would all fail badly. In practice, characters outside the BMP are relatively rare at present. I guess that's why Microsoft and MySQL don't see a cost/benefit advantage in supporting them. CWC(talk) 18:08, 6 November 2006 (UTC)[reply]

The Java situation: Character handling in J2SE 5 is based on version 4.0 of the Unicode standard. This includes support for supplementary characters, which has been specified by the JSR 204 expert group and implemented throughout the JDK. See the article Supplementary Characters in the Java Platform, the Java Specification Request 204 or the Character class documentation for more information.

http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp

The Microsoft OS situation

Windows 2000 introduced support for basic input, output, and simple sorting of supplementary characters. However, not all system components are compatible with supplementary characters. Also, supplementary characters are not supported in Windows 95/98/Me.

http://windowssdk.msdn.microsoft.com/en-us/library/ms776414.aspx

The MS SQL server situation

Since these characters’ surrogate pairs are considered two separate Unicode code points, the size of nvarchar(n) needs to be 2 to hold a single supplementary character (i.e. space for a surrogate pair)

  • String operations are not supplementary character aware. Thus operations such as Substring(nvarchar(2),1,1) will result in only the high surrogate of the supplementary characters surrogate pair. Also the Len operation will return the count of two characters for every supplementary character encountered – one for the high surrogate and one for the low surrogate.
  • In sorting and searching, all supplementary characters compare equal to all other supplementary characters

http://www.microsoft.com/globaldev/DrIntl/columns/021/default.mspx#EHD

Pjacobi 18:37, 6 November 2006 (UTC)[reply]


"OTOH, string lengths, collated sorts, code conversions, etc would all fail badly"
Lets go through theese one at a time
A simple concept of string length is dead in the water with unicode anyway, Most of the time unless you are writing a display engine number of units in memory is the main thing you need to be concerned with.
A sort that is based on 16 bit word values will when applied to UTF-16 provide an order that is different from but not in any obvious way worse than say a sort that sorts supplementry codepoints by codepoint number.
Code conversions in and out of the 16 bit format are indeed one of the main things that needs to be changed (the other main one being the rendering engine) for workable support of supplementry characters. Plugwash 20:11, 6 November 2006 (UTC)[reply]
Good points, Plugwash. (Actually, sorting on 16-bit word values is exactly equivalent to sorting by codepoint. The surrogate stuff is very well designed. The only bad thing about it is the name "surrogate", IMO.) OTOH, sorting by raw codepoint is very user-hostile, and locale-specific collated sorts written for UCS-2 will mess up on non-BMP codepoints. (Aside: the variety of rules different cultures use for sorting is quite striking.)
Talking about simple concepts of strings, not only is the concept of string length dead, the concept of a character is on its deathbed as well. Good programmers should no longer write code that treats strings as sequences of characters; instead, strings should be treated as sequences of codepoints (the low-level view) or sequences of graphemes (the medium level view) or sequences of higher-level units (words, lines, etc).
This is why JSR 204 can get away with retaining char as a 16-bit type and storing non-BMP codepoints as a surrogate pair. Code that processes strings character by character has to be rewritten to use CodePointAt and similar methods which JSR 204 added to java.lang.String and java.lang.StringBuffer, but it's better to use a higher-level ICU4J facility such as BreakIterator. (See also the brief rationale for JSR 204 in Supplementary Characters in the Java Platform.) The days when any competent programmer could write production-quality text-processing tools from scratch are over.
Thanks also to Pjacobi for those very useful links above.
Going back to the original question, my answer is that UCS-2 is obsolete (or at least becoming obsolete), but many systems written to store and (to a lesser extent) process UCS-2 text are not. Of course, this is a fairly narrow distinction.
Cheers, CWC(talk) 09:25, 12 November 2006 (UTC)[reply]
"Actually, sorting on 16-bit word values is exactly equivalent to sorting by codepoint"
incorrect: sorting on 16-bit word values will put the suplementry characters before the characters in the range U+E000-U+FFEF.
As for locale specific collated sorts written for UCS-2 i presume they will treat surrogates like any other characters from outside thier locale and thereby provide a sort which behaves consistantly but not nessacerally in a user friendly way. Plugwash 11:31, 12 November 2006 (UTC)[reply]
Quite true. My mistake. (IIRC, UTR10 does use codepoint as the "sort of last resort".) Cheers, CWC(talk) 12:12, 12 November 2006 (UTC)[reply]

IPA and Unicode

Hi, Unicode 5.0 is now out. Does anyone know if the IPA character for the labiodental flap () has been incorporated into the latest version of the Unicode standard? Thank you. --Kjoonlee 18:10, 6 December 2006 (UTC)[reply]

It hans't been incorporated yet. SIL Corporate PUA Assignments say it's to be included in a later version, and Proposed New Characters: Pipeline Table mentions it's still in the pipeline. --Kjoonlee 16:25, 7 December 2006 (UTC)[reply]

email and japanese

i dont know if this is a unicode thing but when somebody sends me japanese characters i get stuff i cant read but if i send that email again to a system that uses that set it still translates it right(eg a mobile phone(jp version) would be good to add a link on the unicode page that leads to programs that translate these characters back to japanese and a webbased sollution too. 124.102.32.2 04:58, 27 January 2007 (UTC)not my ip anyway[reply]

It's probably written in a standard your system isn't set up to read properly. 惑乱 分からん 16:10, 19 February 2007 (UTC)[reply]

Normalization?

The article mentions normalization, but it doesn't explain what normalization is in this context.— Preceding unsigned comment added by 217.85.157.177 (talkcontribs)

Ah, er, ... no, it doesn't. does it. And it should, shouldn't it.
I'll create a section on normalization in a few days, unless someone beats me to it. Cheers, CWC(talk) 17:09, 16 March 2007 (UTC)[reply]


Weasel words Issues section?

Do some of the descriptions in the Issues section sound like weasel words to anyone else? Specifically, I mean the phrases like "Some Japanese computer programmers object to Unicode" and (especially) "Some have decried Unicode as a plot against Asian cultures perpetrated by Westerners..." I had added the weasel tag but it was quickly removed by someone and I was cited for vandalism - I swear, I'm not trying to mess around with anything. But that section definitely has quite a bit of "some X say" and "it is claimed that," etc.

I don't actually know anything about the debates surrounding those issues themselves (I was just browsing to learn about Unicode) so I don't know how those phrases should be corrected, but my impression is that one can add a tag there to signal for other people who might know better about how to clarify?

Yishan 03:07, 20 March 2007 (UTC)[reply]

I've just edited that section to add some references, which might be helpful. There was some opposition to Unicode 5-10 years ago, mostly from Japan, but not much was written about it in English. Those statements ("Some Japanese computer programmers object to Unicode", "Some have decried Unicode as a plot against Asian cultures perpetrated by Westerners...") are a bit 'weaselly', but they're also perfectly accurate as far as I know, and they're probably the best we can do with English-language sources. I hope this helps, CWC 09:37, 20 March 2007 (UTC)[reply]
Were not some of the Tron links very explicitly against Unicode from a Japanese perspective? If someone wants to unweasel the text, I think some of the links at Han_unification may help. Mlewan 11:50, 20 March 2007 (UTC)[reply]

Suggest merge

I suggested to merge Unicode roadmap into this article. Anyone oppose? -- Hello World! 09:47, 22 April 2007 (UTC)[reply]

Missing history

Unicode began with the opposition to ISO 10646 and later two party finally reached a consensus and merge into one. Do anyone know about that history? — HenryLi (Talk) 18:55, 14 June 2007 (UTC)[reply]