Treebank

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Most syntactic treebanks annotate variants of either phrase structure (left) or dependency structure (right).

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.[1] The exploitation of treebank data has been important ever since the first large-scale treebank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of treebanks is becoming more widely appreciated in linguistics research as a whole. For example, annotated treebank data has been crucial in syntactic research to test linguistic theories of sentence structure against large quantities of naturally occurring examples.

Etymology[edit]

The term treebank was coined by linguist Geoffrey Leech in the 1980s, by analogy to other repositories such as a seedbank or bloodbank.[2] This is because both syntactic and semantic structure are commonly represented compositionally as a tree structure. The term parsed corpus is often used interchangeably with the term treebank, with the emphasis on the primacy of sentences rather than trees.

Construction[edit]

Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags. In turn, treebanks are sometimes enhanced with semantic or other linguistic information. Treebanks can be created completely manually, where linguists annotate each sentence with syntactic structure, or semi-automatically, where a parser assigns some syntactic structure which linguists then check and, if necessary, correct. In practice, fully checking and completing the parsing of natural language corpora is a labour-intensive project that can take teams of graduate linguists several years. The level of annotation detail and the breadth of the linguistic sample determine the difficulty of the task and the length of time required to build a treebank.

Example phrase structure tree for John loves Mary
Hybrid constituency/dependency tree from the Quranic Arabic Corpus

Some treebanks follow a specific linguistic theory in their syntactic annotation (e.g. the BulTreeBank follows HPSG) but most try to be less theory-specific. However, two main groups can be distinguished: treebanks that annotate phrase structure (for example the Penn Treebank or ICE-GB) and those that annotate dependency structure (for example the Prague Dependency Treebank or the Quranic Arabic Dependency Treebank).

It is important to clarify the distinction between the formal representation and the file format used to store the annotated data. Treebanks are necessarily constructed according to a particular grammar. The same grammar may be implemented by different file formats. For example, the syntactic analysis for John loves Mary, shown in the figure on the right, may be represented by simple labelled brackets in a text file, like this (following the Penn Treebank notation):

(S (NP (NNP John))
   (VP (VPZ loves)
       (NP (NNP Mary)))
   (. .))

This type of representation is popular because it is light on resources, and the tree structure is relatively easy to read without software tools. However as corpora become increasingly complex, other file formats may be preferred. Alternatives include treebank-specific XML schemes, numbered indentation and various types of standoff notation.

Applications[edit]

From a computational perspective, treebanks have been used to engineer state-of-the-art natural language processing systems such as part-of-speech taggers, parsers, semantic analyzers and machine translation systems.[3] Most computational systems utilize gold-standard treebank data. However, an automatically parsed corpus that is not corrected by human linguists can still be useful. It can provide evidence of rule frequency for a parser. A parser may be improved by applying it to large amounts of text and gathering rule frequencies. However, it should be obvious that only by a process of correcting and completing a corpus by hand is it possible then to identify rules absent from the parser knowledge base. In addition, frequencies are likely to be more accurate.

In corpus linguistics, treebanks are used to study syntactic phenomena (for example, diachronic corpora can be used to study the time course of syntactic change). Once parsed, a corpus will contain frequency evidence showing how common different grammatical structures are in use. Treebanks also provide evidence of coverage and support the discovery of new, unanticipated, grammatical phenomena.

Another use of treebanks in theoretical linguistics and psycholinguistics is interaction evidence. A completed treebank can help linguists carry out experiments as to how the decision to use one grammatical construction tends to influence the decision to form others, and to try to understand how speakers and writers make decisions as they form sentences. Interaction research is particularly fruitful as further layers of annotation, e.g. semantic, pragmatic, are added to a corpus. It is then possible to evaluate the impact of non-syntactic phenomena on grammatical choices.

Semantic treebanks[edit]

A semantic treebank is a collection of natural language sentences annotated with a meaning representation. These resources use a formal representation of each sentence's semantic structure. Semantic treebanks vary in the depth of their semantic representation. A notable example of deep semantic annotation is the Groningen Meaning Bank, developed at the University of Groningen and annotated using Discourse Representation Theory. An example of a shallow semantic treebank is PropBank, which provides annotation of verbal propositions and their arguments, without attempting to represent every word in the corpus in logical form.

Syntactic treebanks[edit]

Many syntactic treebanks have been developed for a wide variety of languages:

Language Treebank Syntactic Formalism Distribution / License
Arabic Penn Arabic Treebank Phrase structure Linguistic Data Consortium
Arabic Prague Arabic Dependency Treebank (PADT) Dependency Linguistic Data Consortium
Arabic Columbia Arabic Treebank (CATiB) Dependency Linguistic Data Consortium
Arabic (classical) Quranic Arabic Dependency Treebank (QADT) Dependency Open source (GNU general public license)
Bulgarian BulTreeBank HPSG Freely available for research
Catalan Cat3LB Phrase structure Freely available for research
Chinese Penn Chinese Treebank Phrase structure Linguistic Data Consortium
Chinese Sinica Treebank Case grammar Not freely available
Chinese Chinese Dependency Treebank Dependency Linguistic Data Consortium
Croatian Croatian Dependency Treebank Dependency Open source (Creative Commons license)
Czech Prague Dependency Treebank Dependency Linguistic Data Consortium
Danish Danish Dependency Treebank Dependency Open source (GNU general public license)
Danish Arboretum: A syntactic tree corpus of Danish Phrase structure License fee
Dutch Spoken Dutch Corpus (CGN) Phrase structure License fee
Dutch Alpino Treebank Dependency Open source (GNU general public license)
Dutch LASSY Small and Large Dependency License fee
English Penn Treebank Phrase structure Linguistic Data Consortium
English CCGbank Combinatory categorial grammar Linguistic Data Consortium
English Prague English Dependency Treebank Dependency Linguistic Data Consortium
English BLLIP WSJ corpus Phrase structure Linguistic Data Consortium
English British Component of the International Corpus of English (ICE-GB) Phrase structure License fee
English Diachronic Corpus of Present-Day Spoken English (DCPSE) Phrase structure License fee
English Lancaster Parsed Corpus Phrase structure ?
English Susanne Corpus Phrase structure Freely available for research
English Christine Corpus Phrase structure Freely available for research
English Lucy Corpus Phrase structure Freely available for research
English Tübingen Treebank of English / Spontaneous Speech (TüBa-E/S) HPSG Freely available for research
English LinGO Redwoods HPSG ?
English Multi-Treebank Phrase structure Available online for comparison purposes
English The PARC 700 Dependency Bank Dependency ?
English CHILDES Brown Eve corpus with dependency annotation Dependency Open source (Creative Commons license)
English SMULTRON - Parallel Treebank EN-DE-SV Phrase structure Freely available for research
English (historical) Penn Parsed Corpora of Historical English; Phrase structure License fee
English (historical) York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE) Phrase structure Freely available for research
Estonian Syntactically analyzed and disambiguated text corpus ? ?
Estonian Arborest Phrase structure ?
Finnish Turku Dependency Treebank (TDT) Dependency Open source (Creative Commons license)
French Paris 7 Phrase structure Freely available for research
French (spoken) Rhapsodie Dependency and macrosyntactic annotation Open source (Creative Commons license)
French L'Arboratoire Phrase structure ?
French (historical) Corpus MCVF Phrase structure Freely available for research
German NEGRA Phrase structure Freely available for research
German TIGER Phrase structure Freely available for research
German Tübingen Treebank of Written German (TüBa-D/Z) Phrase structure Freely available for research
German Tübingen Treebank of German / Spontaneous Speech (TüBa-D/S) Phrase structure Freely available for research
German Tübingen Partially Parsed Corpus of Written German (TüPP-D/Z) Phrase structure License fee
German SMULTRON - Parallel Treebank EN-DE-SV Phrase structure Freely available for research
Greek Greek Dependency Treebank Dependency Not freely available
Greek (ancient) Ancient Greek Dependency Treebank Dependency Open source (Creative Commons license)
Greek (ancient) PROIEL Corpus ? ?
Hebrew Hebrew Dependency Treebank Dependency Open source (GNU general public license)
Hindi AnnCorra Dependency ?
Hungarian Hungarian Treebank Phrase structure ?
Icelandic IcePaHC - Icelandic Parsed Historical Corpus Phrase structure Open source (GNU Lesser General Public License)
Italian TUT - Turin University Treebank Dependency Open source (Creative Commons license)
Italian VIT - Venice Italian Treebank Phrase structure and dependency License fee
Italian ISST - Italian Syntactic-Semantic Treebank Phrase structure License fee
Italian SUT - Siena University Treebank ? ?
Japanese ATR Dependency corpus Dependency ?
Japanese Kyoto Text Corpus ? ?
Japanese Tübingen Treebank of Japanese / Spontaneous Speech (TüBa-J/S) Phrase structure Freely available for research
Korean Korean Treebank Phrase structure Linguistic Data Consortium
Latin Latin Dependency Treebank Dependency Open source (Creative Commons license)
Latin Index Thomisticus Treebank Dependency Open source (Creative Commons license)
Latin PROIEL Corpus ? ?
Norwegian INESS treebanking infrastructure ? ?
Persian PerTreeBank HPSG Freely available for research
Persian Persian Dependency Treebank (PerDT) Dependency Freely available for research
Polish A Treebank / Test Suite for Polish HPSG ?
Polish Składnica Phrase structure and Dependency Open source (GNU general public license)
Portuguese Projecto Floresta Sintá(c)tica ? ?
Portuguese (historical) Tycho Brahe corpus Phrase structure ?
Romanian Romanian Dependency Treebank Dependency ?
Russian SynTagRus Dependency Treebank (Russian National Corpus) Dependency ?
Slovene Slovene Dependency Treebank Dependency Freely available for research
Spanish Cast3LB Phrase structure and dependency Freely available for research
Spanish UAM Treebank of Spanish Phrase structure Freely available for research
Swedish Talbanken05 Phrase structure and dependency Freely available for research
Swedish Swedish Treebank Phrase structure Freely available for research
Swedish SMULTRON - Parallel Treebank EN-DE-SV Phrase structure Freely available for research
Thai NAiST Thai Treebank Dependency Open source (GNU general public license)
Turkish METU-Sabanci Turkish Treebank Dependency Freely available for research
Urdu NU-FAST Treebank Phrase structure ?
Vietnamese Vietnamese Treebank Phrase structure Freely available for research

Search tools[edit]

One of the key ways to extract evidence from a treebank is through search tools. Search tools for parsed corpora typically depend on the annotation scheme that was applied to the corpus. User interfaces range in sophistication from expression-based query systems aimed at computer programmers to full exploration environments aimed at general linguists. Wallis (2008) discusses the principles of searching treebanks in detail and reviews the state of the art.[5]

See also[edit]

References[edit]

  1. ^ Alexander Clark, Chris Fox and Shalom Lappin (2010). The handbook of computational linguistics and natural language processing. Wiley.
  2. ^ Sampson, G. (2003) ‘Reflections of a dendrographer.’ In A. Wilson, P. Rayson and T. McEnery (eds.) Corpus Linguistics by the Lune: A Festschrift for Geoffrey Leech, Frankfurt am Main: Peter Lang, pp. 157-184
  3. ^ Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He, Shuo Li, and Ling Zhu (September 2013). "Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation". Proceedings of the GSCL 2013. LNCS Vol. 8105, pp. 119-131. Springer-Verlag Berlin Heidelberg. 
  4. ^ Kais Dukes (2013). Semantic Annotation of Robotic Spatial Commands. Language and Technology Conference (LTC). Poznan, Poland.
  5. ^ Wallis, Sean (2008). Searching treebanks and other structured corpora. Chapter 34 in Lüdeling, A. & Kytö, M. (ed.) Corpus Linguistics: An International Handbook. Handbücher zur Sprache und Kommunikationswissenschaft series. Berlin: Mouton de Gruyter.