HTML sanitization

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 168.168.33.250 (talk) at 11:31, 8 January 2014 (The htmlspecialchars() function does not sanitize HTML, it merely escapes characters with a special meaning in HTML such as <, >, &, and quotes.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

HTML sanitization is the process of examining an HTML document and producing a new HTML document that preserves only whatever tags are designated "safe". HTML sanitization can be used to protect against cross-site scripting (XSS) attacks by sanitizing any HTML code submitted by a user.

Basic tags for changing fonts are often allowed, such as <b>, <i>, <u>, <em>, and <strong> while more advanced tags such as <script>, <object>, <embed>, and <link> are removed by the sanitization process.

Sanitization is typically performed by using either a whitelist or a blacklist approach. An item left off a whitelist, makes the sanitization produce HTML code that lacks safe elements. If an item is left off a blacklist, a vulnerability will be present in the sanitized HTML output. New unsafe HTML features, introduced after a blacklist has been defined, causes the blacklist to become out of date.

In PHP, HTML sanitization can be performed using the strip_tags() function at the risk of removing all textual content following an unclosed less-than symbol or angle bracket.[1] The HTML Purifier library is another popular option for PHP applications.[2]

In Java (and .NET), sanitization can be achieved by using the OWASP Java HTML Sanitizer Project.[3]

In .NET, a number of sanitizers use the Html Agility Pack, a HTML parser.[4][5]

See also

References

  1. ^ "strip_tags". PHP.NET.
  2. ^ http://www.htmlpurifier.org
  3. ^ https://www.owasp.org/index.php/OWASP_Java_HTML_Sanitizer_Project
  4. ^ http://htmlagilitypack.codeplex.com/
  5. ^ http://eksith.wordpress.com/2011/06/14/whitelist-santize-htmlagilitypack/