Polyglot markup

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Polyglot markup is HTML that has been written to conform to both the HTML and XHTML specifications.[1] A polyglot document can therefore be parsed as either HTML (which is SGML-compatible) or XML, and will produce the same DOM structure either way. For example, in order for an HTML5 document to meet these criteria, the two requirements are that it must have an HTML5 doctype, and be written in well-formed XHTML.[2] The same document can then be served as either HTML or XHTML, depending on browser support and MIME type.

The required elements of a polyglot markup document are html, head, title, and body. The most basic possible polyglot markup document would therefore look like this:[1]

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
  <head>
    <title>The title element must not be empty.</title>
  </head>
  <body>
  </body>
</html>

In a polyglot markup document non-void elements (such as script, p, div) cannot be self-closing even if they are empty, as this is not valid HTML.[3] For example, to add an empty textarea to a page, one can not use <textarea/>, but has to use <textarea></textarea> instead.

Key Points for Creating Polyglot Documents[edit]

  • Do not use document.write() or document.writeln() because these may not be used in XML. Use the innerHTML property instead.
  • Do not use the noscript element because it cannot be used in XML documents.
  • Do not use XML processing instructions or an XML declaration.
  • Use UTF-8 encoding, and declare it in one of the ways listed in the W3C document. I recommend using <meta charset="UTF-8"/>.
  • Use an acceptable DOCTYPE, like <!DOCTYPE html>. Do not use DOCTYPE declarations for HTML4 or previous versions of HTML.
  • To maintain XML compatibility, explicitly declare the default namespaces for "html", "math", and "svg" elements, like <html xmlns="http://www.w3.org/1999/xhtml">.
  • If using any attributes in the XLink namespace, then declare the namespace on the html element or once on the foreign element where it is used.
  • Every polyglot document must have at least these elements (they cannot be left out): html, head, title, and body.
  • Every tr element must be explicitly wrapped in a tbody, thead or tfoot element to keep the HTML and XML DOMs consistent.
  • Every col element in a table element must be explicitly wrapped in a colgroup element. Again, this is to keep the HTML and XML DOMs consistent.
  • Use the correct case for element names. Only lowercase letters may be used for HTML and MathML element names, though some SVG elements must use only lowercase and some must use mixed case.
  • Use the correct case for attribute names. Only lowercase letters may be used for HTML and MathML attribute names, with the exception of definitionURL. Some SVG attribute names must use only lowercase and some must use mixed case.
  • Maintain case consistency on attribute values. An easy way to do this is to only use lowercase, but this is not required.
  • Only certain elements can be void. These elements must use the minimized tag syntax like <br/> (no end tags allowed). Some of these void elements are: area, br, embed, hr, img, input, link, and meta.
  • If the HTTP Content-Language header specifies exactly one language tag, specify the language using both the lang and xml:lang attributes on the html element.
  • Do not begin the text inside of a textarea or pre element with a newline.
  • All attribute values must be surrounded by either single or double quotation marks.
  • Do not use newline characters within an attribute value.
  • Do not use the xml:space or xml:base attributes, except in foreign content like MathML and SVG. These attributes are not valid in documents served as text/html.
  • When specifying a language, use both the lang and xml:lang attributes. Do not use one attribute without the other, and both must have identical values.
  • Use only the following named entity references: amp, lt, gt, apos, quot. For others, use the decimal or hexadecimal values instead of named entities.
  • Always use character references for the less-than sign and the ampersand, except when used in a CDATA section.
  • Whenever possible (though not required), script and style elements should link to external files rather than including them inline (this is good advice even for non-polyglot documents). However, when inline content is used, it should be "safe content" that does not contain any problematic less-than or ampersand characters (escaping them is not an option due to the creation of different DOMs). It is also recommended wrapping inline script content in a CDATA section, with the CDATA markers commented out (use //<![CDATA[ as the first line before the script and //]]> as the last line, using "//" to comment out the CDATA markers). But again, you can avoid these issues by using external files rather than inline content.

References[edit]

External links[edit]