Overlapping markup

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

In markup languages and the digital humanities, overlap occurs when a document has two or more structures that interact in a non-hierarchical manner. A document with overlapping markup cannot be represented as a tree. This is also known as concurrent markup. Overlap happens, for instance, in poetry, where there may be a metrical structure of feet and lines; a linguistic structure of sentences and quotations; and a physical structure of volumes and pages and editorial annotations.[1][2]

History[edit]

The structural differences between multiple editions of Frankenstein have been analysed with overlapping techniques.[3]

The problem of non-hierarchical structures in documents has been recognised since 1988; resolving this problem against the dominant paradigm of text as a single hierarchy (an ordered hierarchy of content objects or OHCO) was initially thought to be merely a technical issue, but has, in fact, proven much more difficult.[4] In 2008, Jeni Tennison identified markup overlap as "the main remaining problem area for markup technologists".[5]

Properties and types[edit]

A distinction exists between schemes that allow non-contiguous overlap, and those that allow only contiguous overlap. Often, 'markup overlap' strictly means the latter. Contiguous overlap can always be represented as a linear document with milestones (typically co-indexed start- and end-markers), without the need for fragmenting a (logical) component into multiple physical one). Non-contiguous overlap may require document fragmentation. Another distinction in overlapping markup schemes is whether elements can overlap with other elements of the same kind (self-overlap).[2]

A scheme may have a privileged hierarchy. Some XML-based schemes, for example, represent one hierarchy directly in the XML document tree, and represent other, overlapping, structures by another means; these are said to be non-privileged.

Approaches and implementations[edit]

DeRose (2004, Evaluation criteria) identifies several criteria for judging solutions to the overlap problem: readability and maintainability, tool support and compatibility with XML, possible validation schemes, and ease of processing.

Tag soup is, strictly speaking, not overlapping markup—it is malformed HTML, which is a non-overlapping language, and may be ill-defined. Some web browsers attempted to represent overlapping start and end tags with non-hierarchical Document Object Models (DOM), but this was not standardised across all browsers and was incompatible with the innately hierarchical nature of the DOM.[6][7] HTML5 defines how processors should deal with such mis-nested markup in the HTML syntax and turn it into a single hierarchy.[8] With XHTML and SGML-based HTML, however, mis-nested markup is a strict error and makes processing by standards-compliant systems impossible.[9] The HTML standard defines a paragraph concept which can cause overlap with other elements and can be discontiguous.[10]

SGML, which early versions of HTML were based on, has a feature called CONCUR that allows multiple independent hierarchies to co-exist without privileging any. DTD validation is only defined for each individual hierarchy with CONCUR. Validation across hierarchies is not defined by the standard. CONCUR could not support self-overlap, and it interacted poorly with some of SGML's abbreviatory features. This feature was poorly supported by tools and saw very little actual use; using CONCUR to represent document overlap was not a recommended use case, according to a commentary by the standard's editor.[11][12]

Within hierarchical languages[edit]

There are several approaches to representing overlap in a non-overlapping language.[13] The Text Encoding Initiative, as an XML-based markup scheme, cannot directly represent overlapping markup. All four of the below approaches are suggested.[1] The Open Scripture Information Standard is another XML-based scheme, designed to mark up the Bible. It uses empty milestone elements to encode non-privileged components.[14]

To illustrate these approaches, marking up the sentences and lines of a fragment of Richard III by William Shakespeare will be used as a running example. Where there is a privileged hierarchy, the lines will be used.

Multiple documents[edit]

Multiple documents, which provide different internally consistent hierarchies. The advantage of this approach is that each document is simple and can be processed with existing tools, but requires maintenance of redundant content and it can be difficult to cross-reference between different views.[15] With multiple documents, the overlap can be analysed with data comparison and delta encoding techniques, and, in an XML context, specific XML tree differencing algorithms are available.[16][17]

Example, with lines marked up:

  <line>I, by attorney, bless thee from thy mother,</line>
  <line>Who prays continually for Richmond's good.</line>
  <line>So much for that.—The silent hours steal on,</line>
  <line>And flaky darkness breaks within the east.</line>

With sentences marked up:

  <sentence>I, by attorney, bless thee from thy mother,
  Who prays continually for Richmond's good.</sentence>
  <sentence>So much for that.</sentence><sentence>—The silent hours steal on,
  And flaky darkness breaks within the east.</sentence>

Milestones[edit]

Milestones are empty elements that mark the beginning and end of a component. These can be used to embed a non-privileged structure within a hierarchical language, and can only represent contiguous overlap. Existing tools will also not understand the meaning of the milestone elements and so cannot easily process or validate the non-privileged structure.[18][19] The markup being near the content is an advantage for maintainability and readability.[20] CLIX (DeRose 2004) is an example of such an approach.

Example:

  <line><sentence-start />I, by attorney, bless thee from thy mother,</line>
  <line>Who prays continually for Richmond's good.<sentence-end /></line>
  <line><sentence-start />So much for that.<sentence-end /><sentence-start />—The silent hours steal on,</line>
  <line>And flaky darkness breaks within the east.<sentence-end /></line>

Punctuation and inter-word spaces have been identified as a type of milestone-style 'crypto-overlap' or 'pseudo-markup', as the boundaries of words, clauses, sentences and the like do not necessarily align with the formal markup boundaries hierarchically.[21][22]

Joins[edit]

Joins are pointers within a privileged hierarchy to other components of the privileged hierarchy, which may be used to reconstruct a non-privileged component akin to following a linked list. A single non-privileged element is segmented into several partial elements within the privileged hierarchy; the partial elements themselves do not represent a single unit in the non-privileged hierarchy, which can be misleading and make processing difficult.[23][24] While this approach can support some discontiguous structures, it is not able to re-order elements.[25] A slightly different approach can, however, express re-ordering by expressing the join away from the content, at the cost of directness and maintainability.[26]

Example:

  <line><sentence id="a">I, by attorney, bless thee from thy mother,</sentence></line>
  <line><sentence continues="a">Who prays continually for Richmond's good.</sentence></line>
  <line><sentence id="b">So much for that.</sentence><sentence id="c">—The silent hours steal on,</sentence></line>
  <line><sentence continues="c">And flaky darkness breaks within the east.</sentence></line>

Stand-off markup[edit]

Stand-off markup is similar to using joins, except that there is no privileged hierarchy: each part of the document is given a label (or might be referred to by an offset), and the document is constructed by pointing to the content from markup that 'stands off' from the content (possibly in an entirely different file), and might contain no content itself. Validation of stand-off markup is very challenging.[27] In addition, maintenance is a problem.[12]

Example:

  <span id="a">I, by attorney, bless thee from thy mother,</span>
  <span id="b">Who prays continually for Richmond's good.</span>
  <span id="c">So much for that.</span><span id="d">—The silent hours steal on,</span>
  <span id="e">And flaky darkness breaks within the east.</span>
  ...
  <line contents="a" />
  <line contents="b" />
  <line contents="c d" />
  <line contents="e" />
  <sentence contents="a b" />
  <sentence contents="c" />
  <sentence contents="d e" />

New languages[edit]

Another approach is to design an entirely new markup language. These forego the tool support in existing languages for a less complicated semantic model and more convenient syntax.

  • LMNL is a non-hierarchical markup language first described in 2002 by Jeni Tennison and Wendell Piez, annotating ranges of a document with properties and allowing self-overlap. CLIX, which originally stood for 'Canonical LMNL In XML', provides a method for representing any LMNL document in a milestone-style XML document.[28] It also has another XML serialisation, xLMNL.[29]
  • MECS was developed by the University of Bergen's Wittgenstein Archive. However, it had several problems: it allowed some non-sensical documents of overlapping elements, it could not support self-overlap, and it did not have the capacity to define a DTD-like grammar.[30] The theory of General Ordered-Descendant Directed Acyclic Graphs (GODDAGs), while not strictly a markup language itself, is a general data model for non-hierarchical markup. Restricted GODDAGs were designed specifically to match the semantics of MECS; general GODDAGs may be non-contiguous and need a more powerful language.[31] TexMECS is a successor to MECS, which has a formal grammar and is designed to represent every GODDAG and nothing that is not a GODDAG.[32]
  • XCONCUR (previously MuLaX) is a melding-together of XML and SGML's CONCUR, and also contains a validation language, XCONCUR-CL, and a SAX-like API.[33][34][35]
  • Marinelli, Vitali and Zacchiroli provide algorithms to convert between restricted GODDAGs, ECLIX, LMNL, parallel documents in XML, contiguous stand-off markup and TexMECS.[36]

Graph-based formalisms[edit]

Rather than grounding markup information in a tree, standoff XML employs a data model based on directed graphs.[37] As an alternative to traditional markup, such graph-based data models can be represented with formalisms originally developed for generalised directed multigraphs, most notably the Resource Description Framework (RDF).[38][39] EARMARK is an early RDF/OWL representation that encompasses GODDAGs.[13]

RDF provides different linearizations, including an XML format that can be modeled to mirror conventional standoff XML, and a linearization that lets RDF be expressed in XML attributes (RDFa). But while it is semantically equivalent to standoff XML, it does not require special-purpose technology for storing, parsing and querying. Multiple interlinked RDF files representing a document or a corpus may constitute an example of Linguistic Linked Open Data.

References[edit]

  1. ^ a b Text Encoding Initiative.
  2. ^ a b DeRose 2004, The problem types.
  3. ^ Piez 2014.
  4. ^ Renear, Mylonas & Durand 1993.
  5. ^ Tennison 2008.
  6. ^ Ian Hickson (2002-11-21). "Tag Soup: How UAs handle <x> <y> </x> </y>". Retrieved 2017-11-05. 
  7. ^ Henri Sivonen (2003-08-16). "Tag Soup: How Mac IE 5 and Safari handle <x> <y> </x> </y>". Retrieved 2017-11-05. 
  8. ^ W3 Consortium (16 September 2014). "HTML5 (Proposed Recommendation) § 8.2.8 An introduction to error handling and strange cases in the parser". Retrieved 2014-10-14. 
  9. ^ Sperberg-McQueen & Huitfeldt 2000, 2.1. Non-SGML Notations.
  10. ^ "HTML Standard § 3.2.5.4 Paragraphs". Retrieved 2017-10-20. 
  11. ^ Sperberg-McQueen & Huitfeldt 2000, 2.2. CONCUR.
  12. ^ a b DeRose 2004.
  13. ^ a b Di Iorio, Peroni & Vitali 2009.
  14. ^ Durusau, Patrick (2006). OSIS Users Manual (OSIS Schema 2.1.1) (PDF). Retrieved 2014-10-14. 
  15. ^ Text Encoding Initiative, 20.1 Multiple Encodings of the Same Information.
  16. ^ Schmidt 2009.
  17. ^ La Fontaine 2016.
  18. ^ Text Encoding Initiative, 20.2 Boundary Marking with Empty Elements.
  19. ^ Sperberg-McQueen & Huitfeldt 2000, 2.4. Milestones.
  20. ^ DeRose 2004, TEI-style milestones.
  21. ^ Birnbaum & Thorsen 2015.
  22. ^ Haentjens Dekker & Birnbaum 2017.
  23. ^ Text Encoding Initiative, 20.3 Fragmentation and Reconstitution of Virtual Elements.
  24. ^ DeRose 2004, Segmentation.
  25. ^ Sperberg-McQueen & Huitfeldt 2000, 2.5. Fragmentation.
  26. ^ DeRose 2004, Joins.
  27. ^ Sperberg-McQueen & Huitfeldt 2000, 2.6. Standoff Markup.
  28. ^ DeRose 2004, CLIX and LMNL.
  29. ^ Piez, Wendell (August 2012). Luminescent: parsing LMNL by XSLT upconversion. Balisage: The Markup Conference 2012. Montréal. doi:10.4242/BalisageVol8.Piez01. Retrieved 2014-10-14. 
  30. ^ Sperberg-McQueen & Huitfeldt 2000, 2.7. MECS.
  31. ^ Sperberg-McQueen & Huitfeldt 2000.
  32. ^ Huitfeldt, Claus; Sperberg-McQuen, C M (2003). "TexMECS: An experimental markup meta-language for complex documents". Retrieved 2014-10-14. 
  33. ^ Hilbert, Schonefeld & Witt 2005.
  34. ^ Witt et al. 2007.
  35. ^ Schonefeld 2008.
  36. ^ Marinelli, Vitali & Zacchiroli 2008.
  37. ^ Ide & Suderman 2007.
  38. ^ Cassidy 2010.
  39. ^ Chiarcos 2012.