Comparison of HTML parsers
|
|
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)
(Learn how and when to remove this template message)
|
HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:
- HTML traversal: offer an interface for programmers to easily access and modify of the "HTML string code". Canonical example: DOM parsers.
- HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
| Parser | License | Implementation language(s) | Latest date* | HTML parsing[1] | HTML5-compliant parsing | Clean HTML** | Update HTML*** |
|---|---|---|---|---|---|---|---|
| Lambda Soup | BSD-2-Clause | OCaml | 2016-12-10[2] | Yes | Yes | ? | ? |
| html.parser | Python S. F. L. | Python | 2016-06-27[3] | Yes | ? | No | No |
| Html Agility Pack | Microsoft Public License | C# | 2016-07-14[4] | Yes | ? | No | ? |
| Beautiful Soup | Python S. F. L. | Python | 2016-08-02[5] | Yes | Partial[6] | Yes | Yes |
| Gumbo | Apache License 2.0 | C | 2015-05-01 | Yes | Yes | ? | ? |
| html5ever | Apache License 2.0 | Rust | 2016-02-23 | Yes | Yes | ? | ? |
| html5lib | MIT License | Python (and PHP, six years ago) | 2016-07-15[7] | Yes | Yes | Yes | No |
| HTML::Parser | Perl license | Perl | 2013-03-28 | Yes | No[8] | ? | ? |
| WebGear | GPL3 | Perl | 2017-03-10 | Yes | Yes | ? | ? |
| htmlPurifier | GNU Lesser GPL | PHP | 2009-03-25[9] | No | No | Yes | Yes |
| HTML Tidy | W3C license | ANSI C | 2017-03-01[10] | Yes[11] | Yes | Yes[11] | Yes |
| HtmlUnit | Apache License 2.0 | Java | 2016-05-27[12] | Yes | ? | No | No |
| HtmlCleaner | BSD License[13] | Java | 2015-08-24 | No | No | Yes | ? |
| Hubbub | MIT License | C | 2016-02-16 | Yes | Yes[14] | ? | ? |
| Jaunt API | Jaunt Beta License | Java | 2013-08-01 | Yes | ? | Yes | No |
| Jericho HTML Parser | Eclipse Public License | Java | 2015-10-24[15] | Yes | ? | ? | ? |
| jsdom | MIT license | JavaScript | 2013-07-21 | No | ? | ? | ? |
| jsoup | MIT license | Java | 2017-11-04[16] | Yes | Yes[17] | Yes | Yes |
| JTidy | JTidy License | Java | 2012-10-09[18] | No | ? | Yes | ? |
| libxml2 HTMLparser | MIT License | C | 2012-09-11[19] | Yes | No | ? | ? |
| NekoHTML | Apache License 2.0 | Java | 2014-06-02[20] | No | ? | ? | ? |
| TagSoup | Apache License 2.0 | Java | 2011-07-07 | No | ? | ? | ? |
| Validator.nu HTML Parser | MIT License | Java | 2012-06-05 | Yes | Yes | ? | ? |
| PHP Simple HTML DOM Parser | MIT License | PHP | 2014-08-28 | Yes | ? | No | No |
| The PHP DOMDocument-class | PHP License | PHP | 2014-10-04 | Yes | ? | No | No |
| Nokogiri | MIT License | Ruby | 2016-10-03[21] | Yes | ? | No | No |
| AVHTML | AGPL | C++ | 2015-08-27[22] | Yes | ? | No | Yes |
| BrilliantHTML5Parser | Apache License 2.0 | Swift 3 | 2016-11-10 | Yes | ? | No | No |
| MyHTML | LGPL | C | 2017-11-07 | Yes | Yes | No | No |
| Parser | License | Implementation language(s) | Latest date* | HTML Parsing | HTML5-compliant Parsing | Clean HTML** | Update HTML*** |
- * Latest release (of significant changes) date.
- ** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
- *** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").
References[edit]
- ^ 12.2 Parsing HTML documents — HTML Standard Archived 2013-01-16 at the Wayback Machine.
- ^ Lambda Soup 0.6.1
- ^ Python 3.5.2
- ^ Nuget Html AgilityPack
- ^ Beautiful Soup 4.5.1
- ^ via html5lib
- ^ Releases · html5lib/html5lib-python
- ^ Bug #53300 for HTML-Parser: HTML 5
- ^ HTML Tidy for Windows
- ^ HTML Tidy release 5.4.0
- ^ a b What is Tidy?
- ^ HtmlUnit Release 2.22 Changes
- ^ HtmlCleaner is distributed under BSD License
- ^ according to project's home page
- ^ Jericho HTML Parser - Browse /jericho-html/3.4 at SourceForge.net
- ^ jsoup release 1.11.1
- ^ https://jsoup.org/ Per project homepage
- ^ JTidy - Browse /JTidy at SourceForge.net
- ^ libxml2 Releases
- ^ NekoHTML | Change History
- ^ Nokogiri release 1.6.8.1
- ^ Latest commit 8c0d99f on 27 Aug 2015