A non-native speech database is a speech database of non-native pronunciations of English. Such databases are essential for the ongoing development of multilingual automatic speech recognition systems, text to speech systems, pronunciation trainers or even fully featured second language learning systems. Because of the comparably small size of the databases, however, many of them are not available through the common distributors of speech databases. This leads to the fact that it is hard for researchers in speech recognition to keep an overview of what kind of databases have already been collected, and for what purposes there are still no collections.[1]


In the table of non-native databases some abbreviations for language names are used. They are listed in Table 1. Table 2 gives the following information about each corpus: The name of the corpus, the institution where the corpus can be obtained, or at least further information should be available, the language which was actually spoken by the speakers, the number of speakers, the native language of the speakers, the total amount of non-native utterances the corpus contains, the duration in hours of the non-native part, the date of the first public reference to this corpus, some free text highlighting special aspects of this database and a reference to another publication. The reference in the last field is in most cases to the paper which is especially devoted to describe this corpus by the original collectors. In some cases it was not possible to identify such a paper. In these cases a paper is referenced which is using this corpus is.

Some entries are left blank and others are marked with unknown. The difference here is that blank entries refer to attributes where the value is just not known. Unknown entries, however, indicate that no information about this attribute is available in the database itself. As an example, in the Jupiter weather database[2] no information about the origin of the speakers is given. Therefore this data would be less useful for verifying accent detection or similar issues.

Where possible, the name is a standard name of the corpus, for some of the smaller corpora, however, there was no established name and hence an identifier had to be created. In such cases, a combination of the institution and the collector of the database is used.

In the case where the databases contain native and non-native speech, only attributes of the non-native part of the corpus are listed. Most of the corpora are collections of read speech. If the corpus instead consists either partly or completely of spontaneous utterances, this is mentioned in the Specials column.

Overview of non-native databases[edit]

Table 1: Abbreviations for languages used in Table 2
Arabic A Japanese J
Chinese C Korean K
Czech Cze Malaysian M
Danish D Norwegian N
Dutch Dut Portuguese P
English E Russian R
French F Spanish S
German G Swedish Swe
Greek Gre Thai T
Indonesian Ind Vietnamese V
Italian I    

The actual table with information about the different databases is shown in Table 2.

Table 2: Overview of non-native Databases
Corpus Author Available at Languages #Speakers Native Language #Utt. Duration Date Remarks
AMI [3] EU E Dut and other 100h meeting recordings
ATR-Gruhn [4] Gruhn ATR E 96 C G F J Ind 15000   2004 proficiency rating
BAS Strange Corpus I+II [5]   ELRA G 139 50 countries 7500   1998  
Berkeley Restaurant [6] ICSI E 55 G I H C F S J 2500 1994  
Broadcast News [7]   LDC E         1997  
Cambridge-Witt [8] Witt U. Cambridge E 10 J I K S 1200   1999  
Cambridge-Ye [9] Ye U. Cambridge E 20 C 1600   2005  
Children News [10] Tomokiyo CMU E 62 J C 7500   2000 partly spontaneous
CLIPS-IMAG [11] Tan CLIPS-IMAG F 15 C V   6h 2006  
CLSU [12]   LDC E   22 countries 5000   2007 telephone, spontaneous
CMU [13]   CMU E 64 G 452 0.9h   not available
Cross Towns [14] Schaden U. Bochum E F G I Cze Dut 161 E F G I S 72000 133h 2006 city names
Duke-Arslan [15] Arslan Duke University E 93 15 countries 2200   1995 partly telephone speech
ERJ [16] Minematsu U. Tokyo E 200 J 68000   2002 proficiency rating
Fischer [17] LDC E many 200h telephone speech
Fitt [18] Fitt U. Edinburgh F I N Gre 10 E 700   1995 city names
Fraenki [19]   U. Erlangen E 19 G 2148      
Hispanic [20] Byrne   E 22 S   20h 1998 partly spontaneous
HLTC [21]   HKUST E 44 C   3h 2010 available on request
IBM-Fischer [22]   IBM E 40 S F G I 2000   2002 digits
iCALL [23][24] Chen I2R, A*STAR C 305 24 countries 90841 142h 2015 phonetic and tonal transcriptions (in Pinyin), proficiency ratings
ISLE [25] Atwell EU/ELDA E 46 G I 4000 18h 2000  
Jupiter [26] Zue MIT E unknown unknown 5146   1999 telephone speech
K-SEC [27] Rhee SiTEC E unknown K     2004
LDC WSJ1 [28]   LDC   10   800 1h 1994  
LeaP [29] Gut University of Münster E G 127 41 different ones 73.941 words 12h 2003  
MIST [30]   ELRA E F G 75 Dut 2200   1996  
NATO HIWIRE [31]   NATO E 81 F Gre I S 8100   2007 clean speech
NATO M-ATC [32] Pigeon NATO E 622 F G I S 9833 17h 2007 heavy background noise
NATO N4 [33]   NATO E 115 unknown   7.5h 2006 heavy background noise
Onomastica [34]     D Dut E F G Gre I N P S Swe   (121000)   1995 only lexicon
PF-STAR [35]   U. Erlangen E 57 G 4627 3.4h 2005 children speech
Sunstar [36]   EU E 100 G S I P D 40000   1992 parliament speech
TC-STAR [37] Heuvel ELDA E S unknown EU countries   13h 2006 multiple data sets
TED [38] Lamel ELDA E 40(188) many   10h(47h) 1994 eurospeech 93
TLTS [39]   DARPA A   E   1h 2004  
Tokyo-Kikuko [40]   U. Tokyo J 140 10 countries 35000   2004 proficiency rating
Verbmobil [41]   U. Munich E 44 G   1.5h 1994 very spontaneous
VODIS [42]   EU F G 178 F G 2500   1998 about car navigation
WP Arabic [43] Rocca LDC A 35 E 800 1h 2002  
WP Russian [44] Rocca LDC R 26 E 2500 2h 2003  
WP Spanish [45] Morgan LDC S   E     2006  
WSJ Spoke [46]     E 10 unknown 800   1993  


