Comparison of HTML parsers

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Parsing HTML is an automated task, performed by (so called) HTML parsers. They have two main purposes:

  • HTML traversal: offer an interface for programmers to easily access and modify of the "HTML string code". Canonical example: DOM parsers.
  • HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser License Implementation language(s) Latest date* HTML Parsing[1] Clean HTML** Update HTML***
Html Agility Pack Microsoft Public License C# 2012-08-07[2] Yes No  ?
Beautiful Soup (base on lxml and html5lib)[3] Python S. F. L. Python 2013-10-02 Yes Yes Yes
Gumbo Apache License 2.0 C 2013-08-13 Yes  ?  ?
html5lib MIT License Python (and PHP, six years ago) 2013-12-23[4] Yes Yes No
HTML::Parser Perl license Perl 2013-03-28 Yes[5]  ?  ?
htmlPurifier GNU Lesser GPL PHP 2009-03-25[6] No Yes Yes
HTML Tidy W3C license ANSI C 2009-03-25[7] Yes[8] Yes  ?
HtmlCleaner BSD License[9] Java 2013-09-05 No Yes  ?
Hubbub MIT License C 2013-04-19 Yes  ?  ?
Jaunt API Jaunt Beta License Java 2013-08-01 Yes Yes No
Jericho HTML Parser Eclipse Public License Java 2012-10-30[10] No??  ?  ?
jsdom MIT license JavaScript 2013-07-21 No  ?  ?
jsoup MIT license Java 2014-09-27[11] Yes Yes Yes
JTidy JTidy License Java 2012-10-09[12] Yes Yes  ?
libxml2 HTMLparser MIT License C 2012-09-11[13] Yes  ?  ?
NekoHTML Apache License 2.0 Java 2013-02-27[14] No  ?  ?
TagSoup Apache License 2.0 Java 2011-07-07 No  ?  ? HTML Parser MIT License Java 2012-06-05 Yes  ?  ?
PHP Simple HTML DOM Parser MIT License PHP 2014-08-28 Yes No No
The PHP DOMDocument-class PHP License PHP 2014-10-04 Yes No No
Nokogiri MIT License Ruby 2014-11-26 Yes No No
Parser License Implementation language(s) Latest date* HTML Parsing Clean HTML** Update HTML***
* Latest release (of significant changes) date.
** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").