Skip to content

Commit defb953

Browse files
committed
Afegit README i LICENSE
1 parent 8f277f7 commit defb953

File tree

5 files changed

+307
-0
lines changed

5 files changed

+307
-0
lines changed

LICENSE

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) [year] [fullname]
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

+122
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Deepspeech Català
2+
3+
An ASR model created with the Mozilla [DeepSpeech](https://github.com/mozilla/DeepSpeech) engine. (Jump to [english](#deepspeech-catalan-asr-model))
4+
5+
Model de reconeixement de la parla creat amb el motor [DeepSpeech](https://github.com/mozilla/DeepSpeech) de Mozilla. Us podeu descarregar l'última versió [aquí](https://github.com/ccoreilly/deepspeech-catala/releases).
6+
7+
Pots provar el model enviant un missatge vocal al bot de Telegram [DeepSpeechCatalà](https://t.me/DeepSpeechCatalaBot)
8+
9+
## Motivació
10+
11+
La motivació principal és la d'aprendre, pel que el model evoluciona constantment a mida que vaig fent proves. També tenia curiositat per saber
12+
qué era possible amb el corpus lliure actual de [CommonVoice](https://voice.mozilla.org/ca/datasets) (la resposta hauria de motivar a tothom a contribuïr-hi encara més).
13+
14+
## Com fer-lo servir
15+
16+
Descarregueu-vos el model i l'scorer i feu servir el motor d'inferència deepspeech per a inferir el text d'un arxiu audio (16Hz mono WAV)
17+
18+
```
19+
$ pip install [email protected]
20+
$ deepspeech --model deepspeech-catala-0.6.0.pbmm --scorer kenlm.scorer --audio file.wav
21+
```
22+
23+
## Comparativa de models
24+
25+
A continuació una comparativa de les diferents versions del model, el corpus emprat i el resultats de l'avaluació.
26+
27+
Les versions anteriors a la 0.4.0 feien servir un alfabet sense vocals accentuades pel que no es consideren representatius.
28+
29+
### Corpus d'avaluació ParlamentParla
30+
31+
Nota: Per la versió 0.6.0 del model vaig combinar el corpus complet (train, dev i test) de CommonVoice amb el de [ParlamentParlaClean](https://collectivat.cat/asr) per després barrejar-lo i dividir-lo en tres sets: train (75%), dev (20%) i test(5%). D'aquesta manera s'ha augmentat el nombre de dades d'entrenament. Com que degut a això el set test conté dades del corpus CommonVoice que podrien haver estat emprades en l'entrenament dels altres models, s'han avaluat tots els models exclusivament amb 1713 frases que cap model ha mai vist (totes del corpus ParlamentParlaClean).
32+
33+
| Model | Corpus | Dades augmentades? | WER | CER | Loss |
34+
| --------------------------------------------------------------------- | ------------------------------- | ------------------ | ------ | ------ | ------ |
35+
| [email protected] | CommonVoice | No | 30,16% | 13,79% | 112,96 |
36+
| [email protected] | CommonVoice || 29,66% | 13,84% | 108,52 |
37+
| [email protected] | CommonVoice+ParlamentParlaClean | No | 13,85% | 5,62% | 50,49 |
38+
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 22,62% | 13,59% | 80,45 |
39+
40+
### Corpus d'avaluació [FestCat](http://festcat.talp.cat/devel.php)
41+
42+
| Model | Corpus | Dades augmentades? | WER | CER | Loss |
43+
| --------------------------------------------------------------------- | ------------------------------- | ------------------ | ------ | ------ | ------ |
44+
| [email protected] | CommonVoice | No | 77,60% | 65,62% | 243,25 |
45+
| [email protected] | CommonVoice || 78,12% | 65,61% | 235,60 |
46+
| [email protected] | CommonVoice+ParlamentParlaClean | No | 76,10% | 65,16% | 240,69 |
47+
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 80,58% | 66,82% | 180,81 |
48+
49+
Aquesta avaluació demostra com el models no generalitzen gaire bé.
50+
51+
El corpus FestCat té una variablititat major pel que fa al nombre de paraules per frase, amb el 90% entre 2 i 23 paraules, mentre que en el corpus de CommonVoice la major part de les frases contenen entre 3 i 16 paraules.
52+
53+
Com era d'esperar, avaluant els models només amb les frases del corpus d'avaluació que contenen 4 o més paraules el resultat millora:
54+
55+
| Model | Corpus | Dades augmentades? | WER | CER | Loss |
56+
| --------------------------------------------------------------------- | ------------------------------- | ------------------ | ------ | ------ | ------ |
57+
| [email protected] | CommonVoice | No | 58,78% | 46,61% | 193,85 |
58+
| [email protected] | CommonVoice || 58,94% | 46,47% | 188,42 |
59+
| [email protected] | CommonVoice+ParlamentParlaClean | No | 56,68% | 46,00% | 189,03 |
60+
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 61,11% | 48,16% | 144,78 |
61+
62+
## Possibles següents passos
63+
64+
- Ampliar el corpus de dades d'entrenament
65+
- Optimitzar els paràmetres del model
66+
- Avaluar el model amb un corpus més variat (variants dialectals, soroll, context informal)
67+
68+
# Deepspeech Catalan ASR Model
69+
70+
## Motivation
71+
72+
The main motivation of this project is to learn how to creat ASR models using Mozilla's DeepSpeech engine so the model is constantly evolving. Moreover I wanted to see what was possible with the currently released [CommonVoice](https://voice.mozilla.org/ca/datasets) catalan language dataset.
73+
74+
## Usage
75+
76+
Download the model and the scorer and use the deepspeech engine to infer text from an audio file (16Hz mono WAV)
77+
78+
```
79+
$ pip install [email protected]
80+
$ deepspeech --model deepspeech-catala-0.6.0.pbmm --scorer kenlm.scorer --audio file.wav
81+
```
82+
83+
## Model comparison
84+
85+
What follows is a comparison of the different published model versions, the dataset used and the accuracy of each model.
86+
87+
### Test corpus from ParlamentParla dataset
88+
89+
Note: For version 0.6.0 the whole CommonVoice dataset (train, dev and test files) was combined with the clean dataset of ParlamentParla, shuffled and split in train/dev/test files using a 75/20/5 ratio. Due to this fact, a comparison between the models can only be made by using 1713 sentences from the ParlamentParla dataset not seen by any model during training.
90+
91+
| Model | Corpus | Augmentation | WER | CER | Loss |
92+
| --------------------------------------------------------------------- | ------------------------------- | ------------ | ------ | ------ | ------ |
93+
| [email protected] | CommonVoice | No | 30,16% | 13,79% | 112,96 |
94+
| [email protected] | CommonVoice || 29,66% | 13,84% | 108,52 |
95+
| [email protected] | CommonVoice+ParlamentParlaClean | No | 13,85% | 5,62% | 50,49 |
96+
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 22,62% | 13,59% | 80,45 |
97+
98+
### Test corpus from the [FestCat](http://festcat.talp.cat/devel.php) dataset
99+
100+
| Model | Corpus | Augmentation | WER | CER | Loss |
101+
| --------------------------------------------------------------------- | ------------------------------- | ------------ | ------ | ------ | ------ |
102+
| [email protected] | CommonVoice | No | 77,60% | 65,62% | 243,25 |
103+
| [email protected] | CommonVoice || 78,12% | 65,61% | 235,60 |
104+
| [email protected] | CommonVoice+ParlamentParlaClean | No | 76,10% | 65,16% | 240,69 |
105+
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 80,58% | 66,82% | 180,81 |
106+
107+
Validating the models against the FestCat dataset shows that the models do not generalize well. This corpus has a higer variability in the word count of the test sentences, with 90% of the sentences containing an evenly distributed amount of words between 2 and 23, whilst most of the sentences in the CommonVoice corpus contain between 3 and 16 words.
108+
109+
As expected, validating the models against a test set containing only sentences with 4 or more words improves accuracy:
110+
111+
| Model | Corpus | Augmentation | WER | CER | Loss |
112+
| --------------------------------------------------------------------- | ------------------------------- | ------------ | ------ | ------ | ------ |
113+
| [email protected] | CommonVoice | No | 58,78% | 46,61% | 193,85 |
114+
| [email protected] | CommonVoice || 58,94% | 46,47% | 188,42 |
115+
| [email protected] | CommonVoice+ParlamentParlaClean | No | 56,68% | 46,00% | 189,03 |
116+
| [stashify@deepspeech_cat](https://github.com/stashify/deepspeech_cat) | CommonVoice? || 61,11% | 48,16% | 144,78 |
117+
118+
## Possible next steps
119+
120+
- Expand the training data with other free datasets
121+
- Tune the model parameters to improve performance
122+
- Validate the models with more varied test datasets (dialects, noise)

utils/clean_training_data.py

+79
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
#!/usr/bin/env python
2+
3+
import argparse
4+
import os
5+
import importlib
6+
import sys
7+
8+
from deepspeech_training.util.text import Alphabet
9+
10+
11+
def get_validate_label(args):
12+
if 'validate_label_locale' not in args or (args.validate_label_locale is None):
13+
print('ERROR: Required --validate_label_locale not specified. Please check.')
14+
return None
15+
if not os.path.exists(os.path.abspath(args.validate_label_locale)):
16+
print('ERROR: Inexistent --validate_label_locale specified. Please check.')
17+
return None
18+
module_dir = os.path.abspath(os.path.dirname(args.validate_label_locale))
19+
sys.path.insert(1, module_dir)
20+
fname = os.path.basename(args.validate_label_locale).replace('.py', '')
21+
locale_module = importlib.import_module(fname, package=None)
22+
return locale_module.validate_label
23+
24+
def process_data(input, print_invalid):
25+
input_file = os.path.abspath(input)
26+
if os.path.isfile(input_file):
27+
with open(input_file, encoding="utf-8") as input_file_data:
28+
for line in input_file_data:
29+
if line == 'wav_filename,wav_filesize,transcript\n':
30+
print(line)
31+
continue
32+
data = line.split(',')
33+
label = label_filter_fun(data[2])
34+
if label is not None and print_invalid is False:
35+
print(f"{data[0]},{data[1]},{label}")
36+
37+
if __name__ == "__main__":
38+
parser = argparse.ArgumentParser(description="Cleans labels for LM generation")
39+
parser.add_argument(
40+
"input",
41+
help="Text file containing the training data as expected by DeepSpeech",
42+
)
43+
44+
parser.add_argument(
45+
"--validate_label_locale",
46+
help="Path to a Python file defining a |validate_label| function for your locale.")
47+
48+
parser.add_argument(
49+
"--filter_alphabet",
50+
help="Exclude samples with characters not in provided alphabet",
51+
)
52+
53+
parser.add_argument(
54+
"--print_invalid",
55+
action="store_true",
56+
help="Prints invalid labels instead of valid ones",
57+
)
58+
59+
60+
PARAMS = parser.parse_args()
61+
validate_label = get_validate_label(PARAMS)
62+
63+
ALPHABET = Alphabet(PARAMS.filter_alphabet) if PARAMS.filter_alphabet else None
64+
65+
def label_filter_fun(label):
66+
validated_label = validate_label(label)
67+
if (PARAMS.print_invalid and validated_label is None):
68+
print(label, end='')
69+
70+
if ALPHABET and validated_label:
71+
try:
72+
ALPHABET.encode(validated_label)
73+
except KeyError:
74+
validated_label = None
75+
if (PARAMS.print_invalid and validated_label is None):
76+
print(label, end='')
77+
return validated_label
78+
79+
process_data(PARAMS.input, PARAMS.print_invalid)

utils/count_word_len.py

+45
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
#!/usr/bin/env python
2+
3+
import argparse
4+
import os
5+
6+
7+
def process_data(params):
8+
input_file = os.path.abspath(params.input)
9+
if os.path.isfile(input_file):
10+
with open(input_file, encoding="utf-8") as input_file_data:
11+
word_len_dict = {}
12+
max_word_len = 0
13+
total_occ = 0
14+
for line in input_file_data:
15+
if line == 'wav_filename,wav_filesize,transcript\n':
16+
continue
17+
18+
data = line.split(',')
19+
sentence = data[2].rstrip('\n')
20+
word_len = len(sentence.split())
21+
word_len_dict[word_len] = word_len_dict.get(word_len, 0) + 1
22+
max_word_len = word_len if word_len > max_word_len else max_word_len
23+
total_occ = total_occ + 1
24+
25+
acc_occ = 0
26+
print('Count\tOccur.\t%\t%acc')
27+
for x in range(max_word_len):
28+
occ = word_len_dict.get(x, 0)
29+
acc_occ = acc_occ + occ
30+
if acc_occ / total_occ > 0.05:
31+
print(f"{x}\t{occ}\t{occ/total_occ*100:.2f}\t{acc_occ/total_occ*100:.2f}")
32+
if acc_occ / total_occ > 0.95:
33+
break;
34+
35+
if __name__ == "__main__":
36+
parser = argparse.ArgumentParser(description="Cleans labels for LM generation")
37+
parser.add_argument(
38+
"input",
39+
help="Text file containing the labels. One label per row",
40+
)
41+
42+
43+
PARAMS = parser.parse_args()
44+
45+
process_data(PARAMS)

utils/prune_sentences.py

+40
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
#!/usr/bin/env python
2+
3+
import argparse
4+
import os
5+
6+
7+
def process_data(params):
8+
input_file = os.path.abspath(params.input)
9+
min_words = int(params.min_words) if params.min_words is not None else 0
10+
max_words = int(params.max_words) if params.max_words is not None else float('inf')
11+
if os.path.isfile(input_file):
12+
with open(input_file, encoding="utf-8") as input_file_data:
13+
for line in input_file_data:
14+
data = line.split(',')
15+
sentence = data[2].rstrip('\n')
16+
word_len = len(sentence.split())
17+
if (min_words <= word_len <= max_words):
18+
print(f"{data[0]},{data[1]},{sentence}")
19+
20+
21+
if __name__ == "__main__":
22+
parser = argparse.ArgumentParser(description="Cleans labels for LM generation")
23+
parser.add_argument(
24+
"input",
25+
help="Text file containing the training data as expected by DeepSpeech",
26+
)
27+
28+
parser.add_argument(
29+
"--min_words",
30+
help="Minimum number of words a sentence has to have in order to be kept. Default is 0.")
31+
32+
parser.add_argument(
33+
"--max_words",
34+
help="Maximum number of words a sentence has to have in order to be kept. Default is infinite.",
35+
)
36+
37+
38+
PARAMS = parser.parse_args()
39+
40+
process_data(PARAMS)

0 commit comments

Comments
 (0)