Comparison of HTML parsers

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

  • HTML traversal: offer an interface for programmers to easily access and modify the "HTML string code". Canonical example: DOM parsers.
  • HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser License Implementation language(s) Latest date* HTML parsing[1] HTML5-compliant parsing Clean HTML** Update HTML***
HTML Tidy W3C license ANSI C 2021-03-24[2] Yes[3] Yes Yes[3] Yes
HtmlUnit Apache License 2.0 Java 2021-05-16[4] Yes ? No No
libxml2 HTMLparser MIT License C 2021-05-13[5] Yes No ? ?
Parser License Implementation language(s) Latest date* HTML Parsing HTML5-compliant Parsing Clean HTML** Update HTML***
* Latest release (of significant changes) date.
** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").


6. Parse HTML codes [1]