Talk:Unicode equivalence

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Typography (Rated C-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Typography, a collaborative effort to improve the coverage of articles related to Typography on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the quality scale.
 Mid  This article has been rated as Mid-importance on the importance scale.
WikiProject Computing (Rated C-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.

Technical tone[edit]

The tone of this article is really technical. babbage (talk) 04:23, 4 October 2009 (UTC)

Useful link[edit]

Is it OK to add a link to some software that I found it useful? It's called charlint and its a perl script that can be used for normalisation. It can be found at Wrecktaste (talk) 15:54, 21 June 2010 (UTC)


Glyph Composition / Decomposition redirects here, but the term glyph is not used in this article. — Christoph Päper 15:50, 27 August 2010 (UTC)


Mathematically speaking, the compatible forms are subsets of the canonical ones. But that sentence is a bit confusing and should probably be rewritten. (talk) 16:36, 11 March 2011 (UTC)

Then please do so. I prefer a readable Unicode description. -DePiep (talk) 22:46, 11 March 2011 (UTC)

Rationale for equivalence[edit]

The following rationale was offered for why UNICODE introduced the concept of equivalence:

it was desirable that two different strings in an existing encoding would translate to two different strings when translated to Unicode, therefore if any popular encoding had two ways of encoding the same character, Unicode needed to as well.

AFAIK, this is only part of the story. The main problem (duplicated chars and composed/decomposed ambiguity) was not inherited from any single prior standard, but from the merging of multiple standards with overlapping character sets.
One of the reasons was the desire to incorporate several preexisting character sets while preserving their encoding as much as possible, to simplify the migration to UNICODE. Thus, for example, the ISO-Latin-1 set is exactly incuded in the first 256 code positions, and several other national standards (Russian, Greek, Arabic, etc.) were included as well. Some attempt was made to eliminate duplication; so, for example, European punctuation is encoded only once (mostly in the Latin-1 segment). Still, some duplicates remained, such as the ANGSTROM SIGN (originating from a set of miscellaneous symbols) and the LETTER A WITH RING ABOVE (from Latin-1). Another reason was the necessary inclusion of combining diacritics: first, to allow for all possibly useful letter-accent combinations (such as the umlaut-n used by a certain rock band) without wasting an astronomical number of code points, and, second, because several preexisting standards used the decomposed form to represent accented letters. Yet another reason was to preserve traditional encoding distinctions between typographic forms of certain letters, for example the superscript and subscript digits of Latin-1, the ligatures of Postscript, Arabic, and other typographically-oriented sets, and the circled digits, half-width katakana and double-width Latin letters which had their own codes in standard Japanese charsets.
All these features meant that UNICODE would allow multiple encodings for identical or very similar characters, to a much greater degree than any previous standard --- thus negating the main advantage of a standard, and making text search a nightmare. Hence the need for the standard normal forms. Canonical equivalence was introduced to cope with the first two sources of ambiguity above, while compatibility was meant to address the last one. Jorge Stolfi (talk) 14:49, 16 June 2011 (UTC)

I agree it would be nice to find a source that says the exact reasons. There are better quotes in some other Unicode articles on Wikipedia. However, except for the precomposed characters, all your reasons are the same as "an exising character set had N ways of encoding this character and thus Unicode needed N ways".
Precomposed characters were certainly mostly driven by the need to make it easy to convert existing encodings, and to make rendering readable output from most Unicode easy. There may have been existing character sets with both precomposed and combining diacritics, if so this would fall into the first explanation. But I doubt that would have led to the vast number of combined characters in Unicode.Spitzak (talk) 18:58, 16 June 2011 (UTC)

So Unicode equivalence is necessary. the question you want to answer is why where NFD and NFC introduced? (talk) 21:31, 24 October 2012 (UTC)


This article says nothing about Unicode equivalence usage. This mean there is misisng some text.

I do not know many software which relies on/supports Unicode equivalence, but there is at least one: Wikipedia.

Unicode équivalence is recognized by Wikipedia software in a way which allows users of both NFD and NFC systems to access the same page-article despite technical internal NF differentiation.

Might be that some people woul need a reference to proove this, but I do not bring any refernce, only this demonstration:

For instance those two pages are a single article:

The same does occur with Cancún: Cancún and Cancún (despite colors might differ for any obscure and not obious reason):

I suggest to use this information in a way to improve the article, without making wikipedia article any «how to use wikipedia». (talk) 19:14, 11 October 2012 (UTC)


"Well-formedness" refers to whether the sequences of 8-bit, 16-bit or 32-bit storage units properly define a sequence of characters (technically, 'scalar values'). Having combining characters without base characters makes a string 'defective'. There are other faults in a well-formed string that have no name, such as broken Hangul syllable blocks, characters in the wrong order (not all scripts have been set up so that canonical equivalence will 'eliminate' ambiguities), and variation selectors in the wrong places. RichardW57 (talk) 00:49, 17 June 2014 (UTC)

External links modified[edit]

Hello fellow Wikipedians,

I have just added archive links to one external link on Unicode equivalence. Please take a moment to review my edit. If necessary, add {{cbignore}} after the link to keep me from modifying it. Alternatively, you can add {{nobots|deny=InternetArchiveBot}} to keep me off the page altogether. I made the following changes:

When you have finished reviewing my changes, please set the checked parameter below to true to let others know.

You may set the |checked=, on this template, to true or failed to let other editors know you reviewed the change. If you find any errors, please use the tools below to fix them or call an editor by setting |needhelp= to your help request.

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

If you are unable to use these tools, you may set |needhelp=<your help request> on this template to request help from an experienced user. Please include details about your problem, to help other editors.

Cheers.—cyberbot IITalk to my owner:Online 18:31, 18 January 2016 (UTC)