Survey of English Usage
The Survey of English Usage was the first research centre in Europe to carry out research with corpora. The Survey is based in the Department of English Language and Literature at University College London.
The Survey of English Usage was founded in 1959 by Randolph (now Lord) Quirk. Many well-known linguists have spent time doing research at the Survey, including Bas Aarts, Valerie Adams, John Algeo, Dwight Bolinger, Noël Burton-Roberts, David Crystal, Derek Davy, Jan Firbas, Sidney Greenbaum, Liliane Haegeman, Robert Ilson, Ruth Kempson, Geoffrey Leech, Jan Rusiecki, Jan Svartvik, and Joe Taglicht.
The original Survey Corpus predated modern computing. It was recorded on reel-to-reel tapes, transcribed on paper, filed in filing cabinets, and indexed on paper cards. Transcriptions were annotated with a detailed prosodic and paralinguistic annotation developed by Crystal and Quirk (1964). Sets of paper cards were manually annotated for grammatical structures and filed, so, for example, all noun phrases could be found in the noun phrase filing cabinet in the Survey. Naturally, corpus searches required a visit to the Survey.
This corpus is now known more widely as the London-Lund Corpus (LLC), as it was the responsibility of co-workers in Lund, Sweden, to computerise the corpus. Thirty-four of the spoken texts were published in book form as Svartvik and Quirk (1980), and the corpus was used as the basis for the famous Comprehensive Grammar (Quirk et al. 1985).
In 1988 Sidney Greenbaum proposed a new project, ICE, the International Corpus of English. ICE was to be an international project, carried out at research centres around the world, to compile corpora of English varieties where English was the first or second official language. ICE texts would contain spoken and written English in a balanced sample of one million words per component so that these samples could be compared in a wide varieties of ways. The ICE project continues around the world to the present day.
ICE-GB, the British Component of ICE, was compiled at the Survey. ICE-GB was annotated to a very detailed level, including constructing a full grammatical analysis (parse) for every sentence in the corpus. The first release of ICE-GB took place in 1998. ICE-GB was distributed with software for searching and exploring the parsed corpus called ICECUP. Release 2 of ICE-GB has now been released and is available on CD.
As well as contrasting varieties of English, many researchers are interested in language development and change over time. A recent project at the Survey undertook the parsing of a large (400,000 word) selection of the spoken part of the LLC in a manner directly comparable with ICE-GB, forming a new, 800,000 word diachronic corpus, called the Diachronic Corpus of Present-Day Spoken English (DCPSE). DCPSE has now been released and is available on CD from the Survey.
These two corpora comprise the largest collection of parsed and corrected, orthographically transcribed spoken English language data in the world, with over one million words of spoken English in this form.
Parsed corpora are large databases containing detailed grammatical tree structures. One of the consequences of forming large collections of valuable linguistic data is a pressing need for methods and tools to help researchers and other users make the most of them. So in parallel with the parsing of natural language data, the Survey team have carried out research and development of software tools to help linguists use these corpora. The ICECUP research platform uses an intuitive grammatical query representation called Fuzzy Tree Fragments (FTFs) to search parsed corpora.
Linguistic research with corpora
As well as distributing corpora and tools to the corpus linguistics research community, the SEU carries out research into English language. Recent projects include research on the English Noun Phrase, Subordination in Spoken and Written English, and the English Verb Phrase. The Survey also provides support for PhD students who carry out research into English language corpora.
- Crystal, David, and Quirk, Randolph (1964). Systems of Prosodic and Paralinguistic Features in English. The Hague: Mouton.
- Svartvik, Jan and Quirk, Randolph (1980) (eds.). A Corpus of English Conversation Lund: CWK Gleerup.
- Quirk, Randolph, Greenbaum, Sidney, Leech, Geoffrey and Svartvik, Jan (1985). A Comprehensive Grammar of the English Language London: Longman.