In computer science, canonicalization (sometimes standardization or normalization) is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.
Canonicalization of filenames is important for computer security. For example, a web server may have a security rule stating "only execute files under the cgi directory (C:\inetpub\wwwroot\cgi-bin)". The rule is enforced by checking that the path starts with "C:\inetpub\wwwroot\cgi-bin\", and if it does, the file is executed.
Should file "C:\inetpub\wwwroot\cgi-bin\..\..\..\Windows\System32\cmd.exe" be executed? No, because this trick path goes back up the directory hierarchy (through use of the '..' path specifier), not staying within cgi-bin. Accepting it at face value would be an error due to failure to canonicalize the filename to the unique (simplest) representation, namely: "C:\Windows\System32\cmd.exe", before doing the path check. This type of fault is called a directory traversal vulnerability.
In Unicode, many accented letters can be represented in more than way. For example é can be represented in Unicode as the Unicode character U+0065 (LATIN SMALL LETTER E) followed by the character U+0301 (COMBINING ACUTE ACCENT), but it can also be represented as the precomposed character U+00E9 (LATIN SMALL LETTER E WITH ACUTE). This makes string comparison more complicated, since every possible representation of a string containing such glyphs must be considered. To deal with this, Unicode provides the mechanism of canonical equivalence. In this context, canonicalization is Unicode normalization.
Variable-length encodings in the Unicode standard, in particular UTF-8, may cause an additional need for canonicalization in some situations. Namely, by the standard, in UTF-8 there is only one valid byte sequence for any Unicode character, but some byte sequences are invalid, i. e. cannot be obtained by encoding any string of Unicode characters into UTF-8. Some sloppy decoder implementations may accept invalid byte sequences as input and produce a valid Unicode character as output for such a sequence. If one uses such a decoder, some Unicode characters have effectively more than one corresponding byte sequence: the valid one and some invalid ones. This could lead to security issues similar to the one described in the previous section. Therefore, if one wants to apply some filter (e. g. a regular expression written in UTF-8) to UTF-8 strings that will later be passed to a decoder that allows invalid byte sequences, one should canonicalize the strings before passing them to the filter. In this context, canonicalization is the process of translating every string character to its single valid byte sequence. An alternative to canonicalization is to reject any strings containing invalid byte sequences.
Search engines and SEO
In web search and search engine optimization (SEO), URL canonicalization deals with web content that has more than one possible URL. Having multiple URLs for the same web content can cause problems for search engines - specifically in determining which URL should be shown in search results.
All of these URLs point to the homepage of Wikipedia, but a search engine will only consider one of them to be the canonical form of the URL.
A Canonical XML document is by definition an XML document that is in XML Canonical form, defined by The Canonical XML specification. Briefly, canonicalization removes whitespace within tags, uses particular character encodings, sorts namespace references and eliminates redundant ones, removes XML and DOCTYPE declarations, and transforms relative URIs into absolute URIs.
Simple example: Given two versions of the same XML:
- "<node1>Data</node1 > <node2>Data</node2>"
- "<node1>Data</node1> <node2>Data</node2>"
Note the extra spaces in the samples, the canonicalized version of these two might be:
- "<node1>Data</node1> <node2>Data</node2>"
Note that the spaces (and other whitespace) are removed only inside of tags! The whitespace between tags is not removed under W3C canonicalization.
A full summary of canonicalization changes is listed below:
- The document is encoded in UTF-8
- Line breaks normalized to #xA on input, before parsing
- Attribute values are normalized, as if by a validating processor
- Character and parsed entity references are replaced
- CDATA sections are replaced with their character content
- The XML declaration and document type declaration are removed
- Empty elements are converted to start-end tag pairs
- Whitespace outside of the document element and within start and end tags is normalized
- All whitespace in character content is retained (excluding characters removed during line feed normalization)
- Attribute value delimiters are set to quotation marks (double quotes)
- Special characters in attribute values and character content are replaced by character references
- Superfluous namespace declarations are removed from each element
- Default attributes are added to each element
- Fixup of
xml:baseattributes is performed
- Lexicographic order is imposed on the namespace declarations and attributes of each element