-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
setup basic FR preprocessing #87
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the commit!
However, before I review the code proper, could you change all the packaging code to adhere to the convention of this repo. In other words using setup.cfg as described here, for example.
thanks @kdavis-mozilla for the review. I've made requested changes |
|
||
from corporacreator.utils import maybe_normalize, replace_numbers, FIND_PUNCTUATIONS_REG, FIND_MULTIPLE_SPACES_REG | ||
|
||
FIND_ORDINAL_REG = re.compile(r"(\d+)([ème|éme|ieme|ier|iere]+)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicolaspanel Maybe we shoud include potential spaces? I saw data like "1 er".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lissyx since their is no such case in clips.tsv
I suggest to wait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lissyx ok for you ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There as several issues, see the comments, that I've questions on and/or need to be addressed.
@@ -8,5 +32,9 @@ def fr(client_id, sentence): | |||
Returns: | |||
(str): Cleaned up sentence. Returning None or a `str` of whitespace flags the sentence as invalid. | |||
""" | |||
# TODO: Clean up fr data | |||
return sentence | |||
text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these always make sense? (See comments above on FR_NORMALIZATIONS
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so.
If not then special cases should be handled using client_id
.
@kdavis-mozilla can you think of an example ?
|
||
|
||
FR_NORMALIZATIONS = [ | ||
[re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'], # "123 000 …" => "123000 …" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ideally want to not have digits. That said I'm not sure I understand the motivation for this change.
For example "123 000" might have been read "one hundred twenty three zero zero zero". However, now its changed to "123000" which I doubt would be read as "one hundred twenty three zero zero zero". So we'd introduce a miss-match between the audio and the text.
Are you assuming that replace_numbers()
fixes this?
If so, how can replace_numbers()
do this accurately as it does no know about splits like "123 000". All it would see is "123000", which, due to the split, may have been pronounced "one hundred twenty three zero zero zero".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of that is my code from CommonVoice-fr repo, so it was accurate enough on a dataset like the one from French parliament.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you assuming that
replace_numbers()
fixes this?
yes
For example "123 000" might have been read "one hundred twenty three zero zero zero".
I assume it is not the case and user said cent vingt trois mille
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of that is my code from CommonVoice-fr repo, so it was accurate enough on a dataset like the one from French parliament.
thanks @lissyx
It seems that some case were still not properly handled. for example, clips.tsv contains sentences like:
les trois,000 valeur du trésor de Loretto
à ma fille, et dix.000 fr.
Loretto contenait un trésor à peu près de trois,000 liv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicolaspanel Yep, those are actually part of another dataset, that was much less well formatted and some error slipped throuh :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is not the perfect place to discuss this, but....
I'm wondering if we could save a lot of time by simply having common.py
mark as invalid any sentences with digits in them.
It's Draconian, but I think there are many problems like the ones we are thinking about here in multiple languages and I don't think they will be solved soon in all the languages, and we want to get the data out the door as soon as possible.
I'd be interested in your opinions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ cut -f3 source/fr/validated.tsv | grep '[[:digit:]]' | wc -l
366
I guess we can skip numbers for now, fix it in the CV dataset, and hand-craft the current recording, 366 is not impossible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few leftover digits being searched for here.
src/corporacreator/utils.py
Outdated
try: | ||
ee = ''.join(e.split()) | ||
if int(e) >= 0: | ||
newinp = num2words(int(ee), lang=locale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this work in all cases?
For example, "Room 456" can validly be read as "Room four five six" or as "Room four hundred and fifty six" . This code can't catch that.
It is for reasons exactly like this that the client_id
is passed to fr()
so you can hear what the person said and provide the correct transcript.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we assume value
is not ambiguous. situation like "Room four five six" should have already been handle by maybe_normalize
step to produce "Room 4 5 6" instead of original "Room 456"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kdavis-mozilla is it ok for you ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we allowed to assume we are in a non-ambiguous case?
I don't see how we can assume such without hearing the audio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kdavis-mozilla I think in French we should be fine regarding ambigous cases unless for numbers > 1099 <= 9999. Those might (and are often in the case of dates) be spelled by hundreds. But as I said to @nicolaspanel if it's too much work for him and too tricky / edge cases to risk polluting the dataset, then I can dig into clips and listen, later.
@lissyx @kdavis-mozilla |
@kdavis-mozilla @lissyx I think I've made all requested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the numbers thing.
|
||
from corporacreator.utils import maybe_normalize, replace_numbers, FIND_PUNCTUATIONS_REG, FIND_MULTIPLE_SPACES_REG | ||
|
||
FIND_ORDINAL_REG = re.compile(r"(\d+)([ème|éme|ieme|ier|iere]+)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
# TODO: Clean up fr data | ||
return sentence | ||
text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS + [REPLACE_SPELLED_ACRONYMS]) | ||
text = replace_numbers(text, locale='fr', ordinal_regex=FIND_ORDINAL_REG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if we should do that now or later: as you shown, my heuristics were not perfect, so maybe it'd be best if I listened to recording and adjust with client_id
, and fix Common Voice dataset at the same time ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, it work just fine as is (I am also using it in trainingspeech projet).
It is a good idea to pick / listen a few examples to check but checking all the examples will take a lot of time...
Personally, I won't have such bandwidth anytime soon...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I was suggesting that I do it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lissyx why not doing it in another PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicolaspanel That's what was implied :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lissyx @nicolaspanel This is related to my "invalidate all sentences with digits" comment above.
I'd be interested in your take on the Draconian idea to have common.py
mark as invalid any sentences with digits in them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kdavis-mozilla I guess it's not such a bad idea, with a logging mode to help identify and fix the dataset as well.
@@ -0,0 +1,34 @@ | |||
import pytest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added in a few comments on the code, but I think more than anything I wanted to ask everyone following on this issue about the Draconian idea of having a separate PR that has common.py
mark as invalid any sentences with digits in them.
The most complicated part of this PR, and of other PR's in other languages, are the digits. So my thought was to just solve the problem once and for all in all languages. Throw out any sentence with digits.
That said, if a separate PR is made to have common.py
mark as invalid any sentences with digits in, then a lot of the code here is not needed.
@lissyx it'd be great to have your feed back too!
|
||
FIND_ORDINAL_REG = re.compile(r"(\d+)([ème|éme|ieme|ier|iere]+)") | ||
|
||
SPELLED_ACRONYMS = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this contains all the acronyms in fr for the current clips.tsv, then we're fine.
] | ||
|
||
|
||
FR_NORMALIZATIONS = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine here too
|
||
|
||
FR_NORMALIZATIONS = [ | ||
[re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'], # "123 000 …" => "123000 …" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is not the perfect place to discuss this, but....
I'm wondering if we could save a lot of time by simply having common.py
mark as invalid any sentences with digits in them.
It's Draconian, but I think there are many problems like the ones we are thinking about here in multiple languages and I don't think they will be solved soon in all the languages, and we want to get the data out the door as soon as possible.
I'd be interested in your opinions
# TODO: Clean up fr data | ||
return sentence | ||
text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS + [REPLACE_SPELLED_ACRONYMS]) | ||
text = replace_numbers(text, locale='fr', ordinal_regex=FIND_ORDINAL_REG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lissyx @nicolaspanel This is related to my "invalidate all sentences with digits" comment above.
I'd be interested in your take on the Draconian idea to have common.py
mark as invalid any sentences with digits in them.
src/corporacreator/utils.py
Outdated
try: | ||
ee = ''.join(e.split()) | ||
if int(e) >= 0: | ||
newinp = num2words(int(ee), lang=locale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we allowed to assume we are in a non-ambiguous case?
I don't see how we can assume such without hearing the audio.
@lissyx @nicolaspanel Sorry for my review comment being more about starting a discussion, but I think it's a discussion that needs to happen as complicated digit manipulation in many languages isn't going to happen on a timescale that's useful for a data release. (@nicolaspanel this is not reflective on you as you are the only person who's taken a real shot at doing the digits right!) |
@nicolaspanel @lissyx I am going to introduce a PR later today to mark as invalid any sentence in any language with digits, see issue 89. As the release date looms and the number of languages that deal with digits properly is almost zero, I talked with other members of the project to establish that this is the most practical way forward to maintain a high quality data set with a release in the foreseeable future. As a note, this would increase the number of invalid French sentences by 0.52% which is think is acceptable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, let's not loose time on numbers.
@nicolaspanel I just merged code that marks all sentences in all languages that have digits as invalid. For French this decreases the number of valid sentences by 0.52% which is think is acceptable. So could you can remove all code in your PR that deals with digits? |
@kdavis-mozilla done I marked related unit tests as skipped since we may want to support them later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for removing most of the digits code, but it seems like there are still digit relics in the regular expressions that should be removed.
I think I commented on all of them with "Looks like there are a few leftover digits being searched for here.", but I might have missed one or two.
|
||
FR_NORMALIZATIONS = [ | ||
['Jean-Paul II', 'Jean-Paul deux'], | ||
[re.compile(r'(^|\s)(\d+)T(\s|\.|,|\?|!|$)'), r'\1\2 tonnes\3'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few leftover digits being searched for here.
|
||
|
||
FR_NORMALIZATIONS = [ | ||
[re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'], # "123 000 …" => "123000 …" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few leftover digits being searched for here.
[re.compile(r'(^|\s)/an(\s|\.|,|\?|!|$)'), r'\1par an\2'], | ||
[re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'], # "123 000 …" => "123000 …" | ||
[re.compile(r'(^|\s)km(\s|\.|,|\?|!|$)'), r'\1 kilomètres \2'], | ||
[re.compile(r'(^|\s)0(\d)(\s|\.|,|\?|!|$)'), r'\1zéro \2 \3'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few leftover digits being searched for here.
[re.compile(r'(^|\s)0(\d)(\s|\.|,|\?|!|$)'), r'\1zéro \2 \3'], | ||
['%', ' pourcent'], | ||
[re.compile(r'(^|\s)\+(\s|\.|,|\?|!|$)'), r'\1 plus \2'], | ||
[re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few leftover digits being searched for here.
[re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'], | ||
[re.compile(r'(^|\s)m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1mètre carré\2'], | ||
[re.compile(r'/\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r' par mètre carré\1'], | ||
[re.compile(r'(^|\s)(\d+),(\d{2})\s?€(\s|\.|,|\?|!|$)'), r'\1\2 euros \3 \4'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few leftover digits being searched for here.
|
||
|
||
FIND_MULTIPLE_SPACES_REG = re.compile(r'\s{2,}') | ||
FIND_PUNCTUATIONS_REG = re.compile(r"[/°\-,;!?.()\[\]*…—«»]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In parallel with removal of the above commented out code, I'd ask that this is also removed as it's not used.
text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS + [REPLACE_SPELLED_ACRONYMS]) | ||
# TODO: restore this once we are clear on which punctuation marks should be kept or removed | ||
# text = FIND_PUNCTUATIONS_REG.sub(' ', text) | ||
text = FIND_MULTIPLE_SPACES_REG.sub(' ', text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has no effect as multiple white spaces are already removed here. So it seems like it should be removed.
from corporacreator import preprocessors | ||
|
||
|
||
@pytest.mark.parametrize('locale, client_id, sentence, expected', [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest removing tests that no longer make sense in like of digits being banned. For example
('fr', '*', "donc, ce sera 299 € + 99 €", "donc, ce sera deux cent quatre-vingt-dix-neuf euros plus quatre-vingt-dix-neuf euros"),
Some of the tests here, for example
('fr', '*', "Jean-Paul II.", "Jean-Paul deux.")
have nothing to do with digits and can actually be run independent of this comment.
['%', ' pourcent'], | ||
[re.compile(r'(^|\s)\+(\s|\.|,|\?|!|$)'), r'\1 plus \2'], | ||
[re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'], | ||
[re.compile(r'(^|\s)m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1mètre carré\2'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few leftover digits being searched for here.
[re.compile(r'(^|\s)\+(\s|\.|,|\?|!|$)'), r'\1 plus \2'], | ||
[re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'], | ||
[re.compile(r'(^|\s)m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1mètre carré\2'], | ||
[re.compile(r'/\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r' par mètre carré\1'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few leftover digits being searched for here.
@kdavis-mozilla you're right, sorry I missed it. fixed in 149e960 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming my eyes are parsing the regex's correctly, it looks like there's still some regexs that deal with digits.
For example
[re.compile(r'(^|\s)m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1mètre carré\2']
Also there is some "dead code"
text = FIND_MULTIPLE_SPACES_REG.sub(' ', text)
that's "dead" as a result of code in common.py
No description provided.