Comparison of HTML parsers

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

  • HTML traversal: offer an interface for programmers to easily access and modify the "HTML string code". Canonical example: DOM parsers.
  • HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser License Implementation language(s) Latest date* HTML parsing[1] HTML5-compliant parsing Clean HTML** Update HTML***
Lambda Soup BSD-2-Clause OCaml 2016-12-10[2] Yes Yes ? ?
html.parser Python S. F. L. Python 2016-06-27[3] Yes ? No No
Html Agility Pack MIT License C# 2019-07-07[4] Yes ? No ?
HTML Monkey Microsoft Public License C# 2018-12-14 Yes ? ? ?
Beautiful Soup Python S. F. L. Python 2019-01-07[5] Yes Partial[6] Yes Yes
Gumbo Apache License 2.0 C 2015-05-01 Yes Yes ? ?
html5ever Apache License 2.0 Rust 2016-02-23 Yes Yes ? ?
html5lib MIT License Python (and PHP, six years ago) 2016-07-15[7] Yes Yes Yes No
HTML::Parser Perl license Perl 2013-03-28 Yes No[8] ? ?
WebGear GPL3 Perl 2017-03-10 Yes Yes ? ?
htmlPurifier GNU Lesser GPL PHP 2019-07-14[9] No No Yes Yes
HTML Tidy W3C license ANSI C 2017-03-01[10] Yes[11] Yes Yes[11] Yes
HtmlUnit Apache License 2.0 Java 2016-05-27[12] Yes ? No No
HtmlCleaner BSD License[13] Java 2015-08-24 No No Yes ?
Hubbub MIT License C 2016-02-16 Yes Yes[14] ? ?
Jaunt API Jaunt Beta License Java 2013-08-01 Yes ? Yes No
Jericho HTML Parser Eclipse Public License Java 2015-10-24[15] Yes ? ? ?
jsdom MIT license JavaScript 2018-08-19 Yes Yes ? ?
jsoup MIT license Java 2019-05-12[16] Yes Yes[17] Yes Yes
JTidy JTidy License Java 2012-10-09[18] No ? Yes ?
libxml2 HTMLparser MIT License C 2017-11-02[19] Yes No ? ?
NekoHTML Apache License 2.0 Java 2014-06-02[20] Yes ? ? ?
TagSoup Apache License 2.0 Java 2011-07-07 No ? ? ? HTML Parser MIT License Java 2012-06-05 Yes Yes ? ?
PHP Simple HTML DOM Parser MIT License PHP 2014-08-28 Yes ? No No
The PHP DOMDocument-class PHP License PHP 2014-10-04 Yes ? No No
Nokogiri MIT License Ruby 2016-10-03[21] Yes ? No No
AVHTML AGPL C++ 2015-08-27[22] Yes ? No Yes
BrilliantHTML5Parser Apache License 2.0 Swift 3 2016-11-10 Yes ? No No
MyHTML LGPL C 2018-09-06 Yes Yes No No
Aspose.HTML Proprietary C# 2018-06-06 Yes Yes ? ?
Lexbor Apache License 2.0 C - Yes Yes No No
tooska LGPL C++ 2019-06-29 Yes ? ? ?
Parser License Implementation language(s) Latest date* HTML Parsing HTML5-compliant Parsing Clean HTML** Update HTML***
* Latest release (of significant changes) date.
** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").