= Speech corpus =

A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions.
In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition or speaker identification engine).
In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.

A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases).

There are two types of speech corpora:

1. Read Speech, which includes:
2. * Book excerpts
3. * Broadcast news
4. * Lists of words
5. * Sequences of numbers
6. Spontaneous Speech, which includes:
7. * Dialogs – between two or more people (includes meetings; one such corpus is the KEC);
8. * Narratives – a person telling a story (one such corpus is the Buckeye Corpus);
9. * Map-tasks – one person explains a route on a map to another;
10. * Appointment-tasks – two people try to find a common meeting time based on individual schedules.

A special kind of speech corpora are non-native speech databases that contain speech with a foreign accent.

==See also==
- Arabic Speech Corpus
- Common Voice
- EXMARaLDA
- Lingua Libre, an online libre tool
- List of children's speech corpora
- Non-native speech database
- Praat
- Spoken English Corpus
- The BABEL Speech Corpus
- TIMIT
- Transcriber
- Transcription (linguistics)
