Common Voice

Common Voice
Developer(s)	Mozilla Foundation
Initial release	June 2017, 19; 6 years ago
Repository	https://github.com/mozilla/voice-web
Available in	Multilingual (List of languages)
License	Creative Commons CC0
Website	voice.mozilla.org

Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The transcribed sentences will be collected in a voice database available under the public domain license CC0. This license ensures that developers can use the database for voice-to-text applications without restrictions or costs.

Aims

Common Voice aims to provide diverse voice samples. According to Mozilla's Katharina Borchert, many existing projects took datasets from public radio or otherwise had datasets that underrepresented both women and people with pronounced accents.^[1]

Voice database

The English Common Voice database is the second largest freely accessible voice database after LibriSpeech. By the time the first data were published on 29 November 2017, more than 20,000 users worldwide had registered 400,000 validated sentences, with a total length of 500 hours.^[2]

In February 2019, the first batch of languages was released for use. This included 18 languages: English, French, German and Mandarin Chinese, but also less prevalent languages as Welsh and Kabyle. In total, this included almost 1,400 hours of recorded voice data from more than 42,000 contributors.^[3]

References

^ "Why do we gender AI? Voice tech firms move to be more inclusive". The Guardian. 11 January 2020. Retrieved 19 April 2020.
^ "Announcing the Initial Release of Mozilla's Open Source Speech Recognition Model and Voice Dataset". blog mozilla.org. November 29, 2017.
^ "Mozilla updates Common Voice dataset with 1,400 hours of speech across 18 languages". VentureBeat. February 28, 2019.

[1] "Why do we gender AI? Voice tech firms move to be more inclusive". The Guardian. 11 January 2020. Retrieved 19 April 2020.

[2] "Announcing the Initial Release of Mozilla's Open Source Speech Recognition Model and Voice Dataset". blog mozilla.org. November 29, 2017.

[3] "Mozilla updates Common Voice dataset with 1,400 hours of speech across 18 languages". VentureBeat. February 28, 2019.

[1]

[2]

[3]