This repository contains scripts to reproduce a merged version of multiple open-source german speech datasets. For german there is no large speech corpus for automatic speech recognition tasks, as in english with for example librispeech. Therefore this repository combines multiple german speech corpora into a single one. Check licenses in the list below or on the sites of the specific datasets, if you want use the data for any special purposes.
In order to recreate the same corpus as in this repository,
execute the commands in the scripts recreate.sh
.
The scripts does the following steps.
-
Download all corpora to
data/download
. Only the common-voice corpus has to be downloaded manually and placed insidedata/download/common_voice
. -
Merges all corpora into a single one. Furthermore creates specific subsets for train/dev/test.
-
Checks if the created corpus is equal to the given state of the repository. This is done by comparing hash values against the hash values in the file
data/state.json
. -
If needed the corpus can be converted to wave files only. This will make sure every utterance is in a separate wave file with a sampling rate of 16000.
The final corpus is stored in data/full
.
The format of the corpus is the default format of the audiomate library.
It is described in audiomate default format.
Audiomate also can be used to read the corpus:
import audiomate
corpus = audiomate.Corpus.load('data/full')
utt = corpus.utterances['utt-idx']
transcript = utt.label_lists[audiomate.corpus.LL_WORD_TRANSCRIPT].join()
samples = utt.read_samples(sr=16000)
Checkout https://github.com/ynop/audiomate for more information.
Part | h | Speakers |
---|---|---|
unfiltered | 1021.31 | not known due to the absence of info in M-Ailabs |
train | 536.90 | not known due to the absence of info in M-Ailabs |
dev | 17.75 | 1151 |
test | 18.22 | 2037 |
full_common_voice | 324.19 | 4852 |
train_common_voice | 10.20 | 552 |
dev_common_voice | 7.04 | 1010 |
test_common_voice | 7.71 | 1901 |
full_mailabs | 233.66 | - |
train_mailabs | 233.50 | - |
dev_mailabs | 0.00 | 0 |
test_mailabs | 0.00 | 0 |
full_swc | 248.47 | 569 |
train_swc | 238.01 | 527 |
dev_swc | 4.26 | 26 |
test_swc | 4.18 | 16 |
full_tuda | 183.30 | 179 |
train_tuda | 31.49 | 146 |
dev_tuda | 2.41 | 16 |
test_tuda | 2.38 | 17 |
full_voxforge | 31.69 | 328 |
train_voxforge | 23.70 | 126 |
dev_voxforge | 4.04 | 99 |
test_voxforge | 3.96 | 103 |
Name | URL | License |
---|---|---|
Common-Voice | https://voice.mozilla.org/en/datasets | CC-0 |
TuDa | https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/acoustic-models.html | CC-BY |
M-AILabs | https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/ | See Page |
VoxForge | http://www.voxforge.org/de | GPL |
SWC | https://nats.gitlab.io/swc/ | CC BY-SA 4.0 |
The scripts create.sh
contains the commands to create a new version of the corpus.
Version | Changes |
---|---|
v1 | Initial version |
v2 | Smaller test sets, Filter long utterances (> 25s) |