Semantic HTML

Semantic HTML is the use of HTML markup to reinforce the semantics, or meaning, of the information in webpages rather than merely to define its presentation (look). Semantic HTML is processed by regular web browsers as well as by many other user agents. CSS is used to suggest its presentation to human users.

As an example, recent HTML standards discourage use of the tag <i> (italic, a typeface)^[1] in preference of more specific tags such as <em> (emphasis); the CSS stylesheet should then specify whether emphasis is denoted by an italic font, a bold font, underlining, slower or louder audible speech etc. This is because italics are used for purposes other than emphasis, such as citing a source; for this, HTML 4 provides the tag <cite>.^[2] Another use for italics is foreign phrases or loanwords; web designers may use built-in XHTML language attributes^[3] or specify their own semantic markup by choosing appropriate names for the class attribute values of HTML elements (e.g. class="loanword"). Marking emphasis, citations and loanwords in different ways makes it easier for web agents such as search engines and other software to ascertain the significance of the text.

History

HTML has included semantic markup since its inception.^[4] In an HTML document, the author may, among other things, "start with a title; add headings and paragraphs; add emphasis to [the] text; add images; add links to other pages; [and] use various kinds of lists".^[5] At one time, HTML also included presentational markup such as <font>, <i> and <center> tags. There are also the semantically neutral span and div tags. Since the late 1990s when Cascading Style Sheets were beginning to work in most browsers, web authors have been encouraged to avoid the use of presentational HTML markup with a view to the separation of presentation and content.^[6]

In 2001 Tim Berners-Lee participated in a discussion of the Semantic Web, where it was presented that intelligent software 'agents' might one day automatically trawl the Web and find, filter and correlate previously unrelated, published facts for the benefit of human users.^[7] Such agents are not commonplace even now, but some of the ideas of Web 2.0, mashups and price comparison websites may be coming close. The main difference between these web application hybrids and Berners-Lee's semantic agents lies in the fact that the current aggregation and hybridisation of information is usually designed in by web developers, who already know the web locations and the API semantics of the specific data they wish to mash, compare and combine.

An important type of web agent that does crawl and read web pages automatically, without prior knowledge of what it might find, is the Web crawler or search-engine spider. These software agents are dependent on the semantic clarity of web pages they find as they use various techniques and algorithms to read and index millions of web pages a day and provide web users with search facilities without which the World Wide Web would have only a fraction of its current usefulness.

In order for search-engine spiders to be able to rate the significance of pieces of text they find in HTML documents, and also for those creating mashups and other hybrids, as well as for more automated agents as they are developed, the semantic structures that exist in HTML need to be widely and uniformly applied to bring out the meaning of published text.^[8]

While the true semantic web may depend on complex RDF ontologies and metadata, every HTML document makes its contribution to the meaningfulness of the Web by the correct use of headings, lists, titles and other semantic markup wherever possible. The correct use of Web 2.0 'tagging' creates folksonomies that may be equally or even more meaningful to many.^[8] HTML 5 will introduce several new semantic tags that will become commonplace in web documents of the future, such as section, article, footer, progress, nav etc.

Presentational markup tags are not deprecated in current HTML (4.01) and XHTML recommendations, but were recommended against. In HTML 5 some of those elements, such as i^[9] and b^[10] are still specified as their meaning has been clearly defined "as to be stylistically offset from the normal prose without conveying any extra importance".

Considerations

In cases where a document requires more precise semantics than those expressed in HTML alone, fragments of the document may be enclosed within span or div elements with meaningful class names^[11] such as <span class="author"> and <div class="invoice">. Where these class names are also a fragment identifier within a schema or ontology, they may link to a more defined meaning. Microformats formalise this approach to semantics in HTML.
One important restriction of this approach is that such markup based on element inclusion must meet the well-formedness conditions. As these documents are broadly tree-structured, this means that only balanced fragments from a sub-tree can be marked up in this way.^[12] A means of marking-up any arbitrary section of HTML would require a mechanism independent of the markup structure itself, such as XPointer.
Good semantic HTML also improves the accessibility of web documents (see also Web Content Accessibility Guidelines). For example, when a screen reader or audio browser can correctly ascertain the structure of a document, it will not waste the visually impaired user's time by reading out repeated or irrelevant information when it has been marked up correctly.

Google 'rich snippets'

In 2010, Google specified three forms of structured metadata that their systems will use to find structured semantic content within webpages. Such information, when related to reviews, people profiles, business listings, and events will be used by Google to enhance the 'snippet', or short piece of quoted text that is shown when the page appears in search listings. Google specifies that that data may be given using microdata, microformats or RDFa.^[13] Microdata is specified inside itemtype and itemprop attributes added to existing HTML elements; microformat keywords are added inside class attributes as discussed above; and RDFa relies on rel, typeof and property attributes added to existing elements.^[14]

See also

Microformats
Plain Old Semantic HTML
Semantic Web
XML

References



^ "Alignment, font styles, and horizontal rules in HTML documents". W3C. 2000, revised 2002. {{cite web}}: Check date values in: |date= (help)

^ "HTML 4.01 Specification: Phrase elements: EM, STRONG, DFN, CODE, SAMP, KBD, VAR, CITE, ABBR, and ACRONYM". W3C. 1999. Retrieved 2009-10-18.

^ "XHTML 1.0 The Extensible HyperText Markup Language (Second Edition): The lang and xml:lang Attributes". W3C. 2000, revised 2002. Retrieved 2009-10-18. {{cite web}}: Check date values in: |date= (help)

^ Berners-Lee, Tim; Fischetti, Mark (2000). Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor. San Francisco: Harper. ISBN [[Special:BookSources/978-0-06-251587-X |978-0-06-251587-X [[Category:Articles with invalid ISBNs]]]]. {{cite book}}: Check |isbn= value: invalid character (help)

^ Raggett, Dave (24 April 2005). "Getting started with HTML". World Wide Web Consortium. Retrieved 8 December 2010.

^ Raggett, Dave (8 April 2002). "Adding a touch of style". World Wide Web Consortium. Retrieved 8 December 2010. This article notes that presentational HTML markup may be useful when targeting browsers "before Netscape 4.0 and Internet Explorer 4.0" which were both released in 1997.

^ Berners-Lee, Tim; Hendler, James; Lassila, Ora (2001). "The Semantic Web". Scientific American. Retrieved 2009-10-02.

^ ^a ^b Shadbolt, Nigel; Berners-Lee, Tim; Hall, Wendy (2006). "The Semantic Web Revisited" (PDF). IEEE Intelligent Systems. Retrieved 8 December 2010. {{cite web}}: Unknown parameter |month= ignored (help)

^ "HTML5". World Wide Web Consortium. {{cite web}}: |section= ignored (help)

^ "HTML5". World Wide Web Consortium. {{cite web}}: |section= ignored (help)

^ These class names are at best suggestive rather than formally meaningful, unless they are previously shared between both creator and consumer of the content.

^ "Well-Formed XML Documents". Extensible Markup Language (XML) 1.1. W3C.

^ "Rich snippets". Webmaster Central. Google. Retrieved 26 May 2010.

^ "Businesses and organizations - About organization information". Webmaster Central. Google. Retrieved 26 May 2010.

External links

schema.org An initiative from Google, Bing and Yahoo! to create and support a common set of schemas for structured data markup on web pages.

[1] "Alignment, font styles, and horizontal rules in HTML documents". W3C. 2000, revised 2002. {{cite web}}: Check date values in: |date= (help)

[2] "HTML 4.01 Specification: Phrase elements: EM, STRONG, DFN, CODE, SAMP, KBD, VAR, CITE, ABBR, and ACRONYM". W3C. 1999. Retrieved 2009-10-18.

[3] "XHTML 1.0 The Extensible HyperText Markup Language (Second Edition): The lang and xml:lang Attributes". W3C. 2000, revised 2002. Retrieved 2009-10-18. {{cite web}}: Check date values in: |date= (help)

[4] Berners-Lee, Tim; Fischetti, Mark (2000). Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor. San Francisco: Harper. ISBN [[Special:BookSources/978-0-06-251587-X |978-0-06-251587-X [[Category:Articles with invalid ISBNs]]]]. {{cite book}}: Check |isbn= value: invalid character (help)

[5] Raggett, Dave (24 April 2005). "Getting started with HTML". World Wide Web Consortium. Retrieved 8 December 2010.

[6] Raggett, Dave (8 April 2002). "Adding a touch of style". World Wide Web Consortium. Retrieved 8 December 2010. This article notes that presentational HTML markup may be useful when targeting browsers "before Netscape 4.0 and Internet Explorer 4.0" which were both released in 1997.

[7] Berners-Lee, Tim; Hendler, James; Lassila, Ora (2001). "The Semantic Web". Scientific American. Retrieved 2009-10-02.

[Semantic_Web_Revisted-8] Shadbolt, Nigel; Berners-Lee, Tim; Hall, Wendy (2006). "The Semantic Web Revisited" (PDF). IEEE Intelligent Systems. Retrieved 8 December 2010. {{cite web}}: Unknown parameter |month= ignored (help)

[9] "HTML5". World Wide Web Consortium. {{cite web}}: |section= ignored (help)

[10] "HTML5". World Wide Web Consortium. {{cite web}}: |section= ignored (help)

[11] These class names are at best suggestive rather than formally meaningful, unless they are previously shared between both creator and consumer of the content.

[12] "Well-Formed XML Documents". Extensible Markup Language (XML) 1.1. W3C.

[13] "Rich snippets". Webmaster Central. Google. Retrieved 26 May 2010.

[14] "Businesses and organizations - About organization information". Webmaster Central. Google. Retrieved 26 May 2010.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]