Talk:List of XML and HTML character entity references

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
WikiProject Lists (Rated List-class)
WikiProject iconThis article is within the scope of WikiProject Lists, an attempt to structure and organize all list pages on Wikipedia. If you wish to help, please visit the project page, where you can join the project and/or contribute to the discussion.
 List  This article has been rated as List-Class on the project's quality scale.
 ???  This article has not yet received a rating on the project's importance scale.
WikiProject Computing (Rated List-class)
WikiProject iconThis article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
 List  This article has been rated as List-Class on the project's quality scale.
 ???  This article has not yet received a rating on the project's importance scale.

Canonical Reference[edit]

The W3C standard XML Entity Definitions for Characters April 1, 2010 is the final authority on entity names. The ISO original standards committee (ISO/IECJTC1 SC34) invited the W3C MathML working group to take over the maintenance and development of entity names. The Unicode Consortium accepts the ISO recommendation. Since there is one defining document for all entity names it should be referenced as the authoritative document for all entity names. Other references for entity names should be shown for historical reasons since some entity names have been associated with different characters over time (examples are 'lang' and 'rang' from U+2329 and U+232A to U+27E8 and U+27E9 respectively). —Preceding unsigned comment added by Joejava (talkcontribs) 17:04, 16 November 2012 (UTC)

Octal anyone?[edit]

I don't think that the standard references octal numbers, but some of my text tools (e.g. Unix's od(1) command) output octal representations of data. It sure would be convenient to search on whatever we've got without having to convert to hex. Since   was the one that triggered this thought, here's a proposal for an alternative.

Name Character Unicode code point (decimal octal) Standard DTD[1] Old ISO subset[2] Description[3]
quot " U+0022 (34 042) HTML 2.0 HTMLspecial ISOnum quotation mark (= APL quote)
nbsp   U+00A0 (160 0240) HTML 3.2 HTMLlat1 ISOnum no-break space (= non-breaking space)[4]

And... since I'm guessing that this is machine generated, here's a Perl snippet that prints the (augmented) cell.

foreach $code_point (34, 160) {
    printf "| U+%04X (%d 0%o)\n", ($code_point)x3;

MichaelRWolf (talk) 14:03, 28 November 2009 (UTC)

P.S. I'd be glad to flesh out this line of Perl to generate the entire table, should you like.

&#nnn; or &#nnnn;[edit]

Some cheat sheets show 3 digit references, some show 4 digit references. If I'm correct, the 3 digit references refer to ISO-8859-1 and the 4 digit references refer to ISO10646/Unicode.

For example, I'd like to use an en dash on my site, but I'm not sure whether to use – or –

Which should I be using, or does it depend on my encoding (or something else)?

Wulf (2006-08-28T23:28:00Z)

Your encoding and the number of digits doesn't matter, but the range of numbers represented by those digits does. € through Ÿ, whether you write them like that or with any number of leading zeroes (or in hexadecimal form preceded by 'x') are technically not allowed in HTML documents, and if they were, they'd be, according to the specs, referring to non-printing control codes.
Browsers that render some refs in that range as if they were references to Windows-1252 bytes, rather than UCS code points, are doing so only for backward compatibility with pre-HTML 4 browsers that were trying to accommodate authors who were using those refs in an attempt to put certain then-illegal characters (such as the Euro symbol, en dash, em dash, and curved quotation marks) in their documents. If you use the proper codes for the characters you want (most of which would indeed require 4 digits), you should see them in all modern browsers and environments. —mjb 05:36, 30 August 2006 (UTC)
Thanks :) –Wulf 03:30, 1 September 2006 (UTC)

need to add[edit]

ř is a Czech character that is used in the name of the composer Dvořák, but I don't know the rest of the information for that row. I just know it would be useful to list. Symphony Girl (talk) 00:43, 6 May 2008 (UTC)

character entity reference[edit]

I'd like to know what allowable names are for non-numeric entity references. a-z, numbers, dashes seem to be allowed, but what about underscores? Other characters? Case sensitivity? How long can a name be?

Also, it appears that at least in SGML entity values are not restricted to one character. Is there a lenght limit, and how does it compare to XML? (talk) 17:40, 8 December 2007 (UTC)

Vertical bar[edit]

What is the code for "|"? Since the code for the broken vertical bar exists, shouldn't one exist for the "original", unbroken version? __meco (talk) 14:40, 9 June 2010 (UTC)

(U+007C). | . Dan 19:54, 10 June 2010 (UTC)
|. —Tamfang (talk) 20:17, 10 June 2010 (UTC)

In the article. __meco (talk) 21:14, 10 June 2010 (UTC)

We're funnin' ya. Since the common-or-garden pipe is not a special character in HTML, nor an extension to the "original" character set, it needs no code other than "|"; but any character can be specified by its Unicode number, as shown above. Same goes for the "original" unaccented 'e'. —Tamfang (talk) 02:34, 11 June 2010 (UTC)

Case sensitivity of named character entities[edit]

The article does not mention anywhere, whether (XML and/or HTML) named entitied are case sensitive or not.

I.e. does ' ' &Apos; and &apoS; all signify the same apostrophe character, or is only the first of the preceding list valid?

For HTML character entities, there are separate definitions that differ only by case (e.g. Ø and ø for an upper-/lowercase letter "O" with a forward slash (Ø and ø). But does the standard allow "free case" where no ambiguity exists?

—Preceding unsigned comment added by Mortenhattesen (talkcontribs) 08:31, 6 December 2010 (UTC)

-- No idea how to reply but they are case-sensitive in both HTML and XML. —Preceding unsigned comment added by (talk) 17:57, 4 January 2011 (UTC)

Entity names have been case sensitive since HTML 2.0. See rfc 1866 section "3.2.3." which says "Element and attribute names are not case sensitive, but entity names are. For example, `<BLOCKQUOTE>', `<BlockQuote>', and `<blockquote>' are equivalent, whereas `&amp;' is different from `&AMP;'."
However, the OP's question asked about &apos; &APOS; &Apos; and &apoS;. None of those are valid entity names for HTML 2.0 through 4.01[5]. &apos; is part of the HTML 5.0[6] proposal and is in XHTML 1.0.[7] --Marc Kupper|talk 18:24, 5 September 2011 (UTC)

Apos entity[edit]

The HTML 4 doesn't include the "apos" entity. However, with "apos", the list consists of 253 items. — Preceding unsigned comment added by (talk) 14:55, 31 October 2011 (UTC)


As XML does not have "character entity references" but "predefined entities" is this the best title? Widefox (talk) 10:13, 13 June 2012 (UTC)


HTML5 adds a truckload of new named references, and changes a few from HTML 4.0 (like &lang; and &rang;). How should we handle this? -- [[User:Edokter]] {{talk}} 08:25, 15 October 2014 (UTC)

Perpendicular or bottom?[edit]

Unicode spec says:


= top

→ 2E06 ⸆  raised interpolation marker

→ 1F768 🝨  alchemical symbol for crucible-4

22A5 ⊥ UP TACK

= base, bottom

→ 27C2 ⟂  perpendicular

So how is the XML perp defined? 22A5 would not make sense

I'm sorry I don't have time to investige now :( — Preceding unsigned comment added by (talk) 17:29, 2 December 2015 (UTC)