Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setup basic FR preprocessing #87

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

nicolaspanel
Copy link

No description provided.

Copy link
Contributor

@kdavis-mozilla kdavis-mozilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the commit!

However, before I review the code proper, could you change all the packaging code to adhere to the convention of this repo. In other words using setup.cfg as described here, for example.

requirements.txt Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
@nicolaspanel
Copy link
Author

thanks @kdavis-mozilla for the review. I've made requested changes


from corporacreator.utils import maybe_normalize, replace_numbers, FIND_PUNCTUATIONS_REG, FIND_MULTIPLE_SPACES_REG

FIND_ORDINAL_REG = re.compile(r"(\d+)([ème|éme|ieme|ier|iere]+)")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicolaspanel Maybe we shoud include potential spaces? I saw data like "1 er".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lissyx since their is no such case in clips.tsv I suggest to wait

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lissyx ok for you ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

src/corporacreator/preprocessors/fr.py Outdated Show resolved Hide resolved
src/corporacreator/preprocessors/fr.py Outdated Show resolved Hide resolved
Copy link
Contributor

@kdavis-mozilla kdavis-mozilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There as several issues, see the comments, that I've questions on and/or need to be addressed.

setup.cfg Outdated Show resolved Hide resolved
src/corporacreator/preprocessors/fr.py Outdated Show resolved Hide resolved
@@ -8,5 +32,9 @@ def fr(client_id, sentence):
Returns:
(str): Cleaned up sentence. Returning None or a `str` of whitespace flags the sentence as invalid.
"""
# TODO: Clean up fr data
return sentence
text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these always make sense? (See comments above on FR_NORMALIZATIONS.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.
If not then special cases should be handled using client_id.
@kdavis-mozilla can you think of an example ?



FR_NORMALIZATIONS = [
[re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'], # "123 000 …" => "123000 …"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ideally want to not have digits. That said I'm not sure I understand the motivation for this change.

For example "123 000" might have been read "one hundred twenty three zero zero zero". However, now its changed to "123000" which I doubt would be read as "one hundred twenty three zero zero zero". So we'd introduce a miss-match between the audio and the text.

Are you assuming that replace_numbers() fixes this?

If so, how can replace_numbers() do this accurately as it does no know about splits like "123 000". All it would see is "123000", which, due to the split, may have been pronounced "one hundred twenty three zero zero zero".

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of that is my code from CommonVoice-fr repo, so it was accurate enough on a dataset like the one from French parliament.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you assuming that replace_numbers() fixes this?

yes

For example "123 000" might have been read "one hundred twenty three zero zero zero".

I assume it is not the case and user said cent vingt trois mille.

Copy link
Author

@nicolaspanel nicolaspanel Feb 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of that is my code from CommonVoice-fr repo, so it was accurate enough on a dataset like the one from French parliament.

thanks @lissyx
It seems that some case were still not properly handled. for example, clips.tsv contains sentences like:

  • les trois,000 valeur du trésor de Loretto
  • à ma fille, et dix.000 fr.
  • Loretto contenait un trésor à peu près de trois,000 liv.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicolaspanel Yep, those are actually part of another dataset, that was much less well formatted and some error slipped throuh :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not the perfect place to discuss this, but....

I'm wondering if we could save a lot of time by simply having common.py mark as invalid any sentences with digits in them.

It's Draconian, but I think there are many problems like the ones we are thinking about here in multiple languages and I don't think they will be solved soon in all the languages, and we want to get the data out the door as soon as possible.

I'd be interested in your opinions

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kdavis-mozilla @nicolaspanel

$ cut -f3 source/fr/validated.tsv | grep '[[:digit:]]' | wc -l
366

I guess we can skip numbers for now, fix it in the CV dataset, and hand-craft the current recording, 366 is not impossible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a few leftover digits being searched for here.

src/corporacreator/preprocessors/fr.py Show resolved Hide resolved
src/corporacreator/preprocessors/fr.py Outdated Show resolved Hide resolved
src/corporacreator/preprocessors/fr.py Outdated Show resolved Hide resolved
src/corporacreator/utils.py Outdated Show resolved Hide resolved
try:
ee = ''.join(e.split())
if int(e) >= 0:
newinp = num2words(int(ee), lang=locale)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this work in all cases?

For example, "Room 456" can validly be read as "Room four five six" or as "Room four hundred and fifty six" . This code can't catch that.

It is for reasons exactly like this that the client_id is passed to fr() so you can hear what the person said and provide the correct transcript.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we assume value is not ambiguous. situation like "Room four five six" should have already been handle by maybe_normalize step to produce "Room 4 5 6" instead of original "Room 456"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kdavis-mozilla is it ok for you ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we allowed to assume we are in a non-ambiguous case?

I don't see how we can assume such without hearing the audio.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kdavis-mozilla I think in French we should be fine regarding ambigous cases unless for numbers > 1099 <= 9999. Those might (and are often in the case of dates) be spelled by hundreds. But as I said to @nicolaspanel if it's too much work for him and too tricky / edge cases to risk polluting the dataset, then I can dig into clips and listen, later.

src/corporacreator/utils.py Outdated Show resolved Hide resolved
@nicolaspanel
Copy link
Author

nicolaspanel commented Feb 20, 2019

@lissyx @kdavis-mozilla
I won't have time to handle all specific situations.
The point with this PR was to share some basic normalisation code I have.
Lets focus then on remaining issues (uppercase, punctuation, etc). It will still be possible to improve later

@nicolaspanel
Copy link
Author

@kdavis-mozilla @lissyx I think I've made all requested changes.

Copy link

@lissyx lissyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the numbers thing.


from corporacreator.utils import maybe_normalize, replace_numbers, FIND_PUNCTUATIONS_REG, FIND_MULTIPLE_SPACES_REG

FIND_ORDINAL_REG = re.compile(r"(\d+)([ème|éme|ieme|ier|iere]+)")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

# TODO: Clean up fr data
return sentence
text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS + [REPLACE_SPELLED_ACRONYMS])
text = replace_numbers(text, locale='fr', ordinal_regex=FIND_ORDINAL_REG)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if we should do that now or later: as you shown, my heuristics were not perfect, so maybe it'd be best if I listened to recording and adjust with client_id, and fix Common Voice dataset at the same time ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, it work just fine as is (I am also using it in trainingspeech projet).
It is a good idea to pick / listen a few examples to check but checking all the examples will take a lot of time...
Personally, I won't have such bandwidth anytime soon...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I was suggesting that I do it :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lissyx why not doing it in another PR ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicolaspanel That's what was implied :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lissyx @nicolaspanel This is related to my "invalidate all sentences with digits" comment above.

I'd be interested in your take on the Draconian idea to have common.py mark as invalid any sentences with digits in them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kdavis-mozilla I guess it's not such a bad idea, with a logging mode to help identify and fix the dataset as well.

@@ -0,0 +1,34 @@
import pytest
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@kdavis-mozilla kdavis-mozilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added in a few comments on the code, but I think more than anything I wanted to ask everyone following on this issue about the Draconian idea of having a separate PR that has common.py mark as invalid any sentences with digits in them.

The most complicated part of this PR, and of other PR's in other languages, are the digits. So my thought was to just solve the problem once and for all in all languages. Throw out any sentence with digits.

That said, if a separate PR is made to have common.py mark as invalid any sentences with digits in, then a lot of the code here is not needed.

@lissyx it'd be great to have your feed back too!


FIND_ORDINAL_REG = re.compile(r"(\d+)([ème|éme|ieme|ier|iere]+)")

SPELLED_ACRONYMS = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this contains all the acronyms in fr for the current clips.tsv, then we're fine.

]


FR_NORMALIZATIONS = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine here too



FR_NORMALIZATIONS = [
[re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'], # "123 000 …" => "123000 …"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not the perfect place to discuss this, but....

I'm wondering if we could save a lot of time by simply having common.py mark as invalid any sentences with digits in them.

It's Draconian, but I think there are many problems like the ones we are thinking about here in multiple languages and I don't think they will be solved soon in all the languages, and we want to get the data out the door as soon as possible.

I'd be interested in your opinions

# TODO: Clean up fr data
return sentence
text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS + [REPLACE_SPELLED_ACRONYMS])
text = replace_numbers(text, locale='fr', ordinal_regex=FIND_ORDINAL_REG)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lissyx @nicolaspanel This is related to my "invalidate all sentences with digits" comment above.

I'd be interested in your take on the Draconian idea to have common.py mark as invalid any sentences with digits in them.

try:
ee = ''.join(e.split())
if int(e) >= 0:
newinp = num2words(int(ee), lang=locale)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we allowed to assume we are in a non-ambiguous case?

I don't see how we can assume such without hearing the audio.

@kdavis-mozilla
Copy link
Contributor

kdavis-mozilla commented Feb 21, 2019

@lissyx @nicolaspanel Sorry for my review comment being more about starting a discussion, but I think it's a discussion that needs to happen as complicated digit manipulation in many languages isn't going to happen on a timescale that's useful for a data release. (@nicolaspanel this is not reflective on you as you are the only person who's taken a real shot at doing the digits right!)

@kdavis-mozilla
Copy link
Contributor

@nicolaspanel @lissyx I am going to introduce a PR later today to mark as invalid any sentence in any language with digits, see issue 89.

As the release date looms and the number of languages that deal with digits properly is almost zero, I talked with other members of the project to establish that this is the most practical way forward to maintain a high quality data set with a release in the foreseeable future.

As a note, this would increase the number of invalid French sentences by 0.52% which is think is acceptable.

Copy link

@lissyx lissyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, let's not loose time on numbers.

@kdavis-mozilla
Copy link
Contributor

@nicolaspanel I just merged code that marks all sentences in all languages that have digits as invalid. For French this decreases the number of valid sentences by 0.52% which is think is acceptable.

So could you can remove all code in your PR that deals with digits?

@nicolaspanel
Copy link
Author

So could you can remove all code in your PR that deals with digits?

@kdavis-mozilla done

I marked related unit tests as skipped since we may want to support them later

Copy link
Contributor

@kdavis-mozilla kdavis-mozilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for removing most of the digits code, but it seems like there are still digit relics in the regular expressions that should be removed.

I think I commented on all of them with "Looks like there are a few leftover digits being searched for here.", but I might have missed one or two.


FR_NORMALIZATIONS = [
['Jean-Paul II', 'Jean-Paul deux'],
[re.compile(r'(^|\s)(\d+)T(\s|\.|,|\?|!|$)'), r'\1\2 tonnes\3'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a few leftover digits being searched for here.



FR_NORMALIZATIONS = [
[re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'], # "123 000 …" => "123000 …"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a few leftover digits being searched for here.

[re.compile(r'(^|\s)/an(\s|\.|,|\?|!|$)'), r'\1par an\2'],
[re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'], # "123 000 …" => "123000 …"
[re.compile(r'(^|\s)km(\s|\.|,|\?|!|$)'), r'\1 kilomètres \2'],
[re.compile(r'(^|\s)0(\d)(\s|\.|,|\?|!|$)'), r'\1zéro \2 \3'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a few leftover digits being searched for here.

[re.compile(r'(^|\s)0(\d)(\s|\.|,|\?|!|$)'), r'\1zéro \2 \3'],
['%', ' pourcent'],
[re.compile(r'(^|\s)\+(\s|\.|,|\?|!|$)'), r'\1 plus \2'],
[re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a few leftover digits being searched for here.

[re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'],
[re.compile(r'(^|\s)m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1mètre carré\2'],
[re.compile(r'/\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r' par mètre carré\1'],
[re.compile(r'(^|\s)(\d+),(\d{2})\s?€(\s|\.|,|\?|!|$)'), r'\1\2 euros \3 \4'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a few leftover digits being searched for here.



FIND_MULTIPLE_SPACES_REG = re.compile(r'\s{2,}')
FIND_PUNCTUATIONS_REG = re.compile(r"[/°\-,;!?.()\[\]*…—«»]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In parallel with removal of the above commented out code, I'd ask that this is also removed as it's not used.

text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS + [REPLACE_SPELLED_ACRONYMS])
# TODO: restore this once we are clear on which punctuation marks should be kept or removed
# text = FIND_PUNCTUATIONS_REG.sub(' ', text)
text = FIND_MULTIPLE_SPACES_REG.sub(' ', text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has no effect as multiple white spaces are already removed here. So it seems like it should be removed.

from corporacreator import preprocessors


@pytest.mark.parametrize('locale, client_id, sentence, expected', [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest removing tests that no longer make sense in like of digits being banned. For example

('fr', '*', "donc, ce sera 299 € + 99 €", "donc, ce sera deux cent quatre-vingt-dix-neuf euros plus quatre-vingt-dix-neuf euros"),

Some of the tests here, for example

('fr', '*', "Jean-Paul II.", "Jean-Paul deux.")

have nothing to do with digits and can actually be run independent of this comment.

['%', ' pourcent'],
[re.compile(r'(^|\s)\+(\s|\.|,|\?|!|$)'), r'\1 plus \2'],
[re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'],
[re.compile(r'(^|\s)m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1mètre carré\2'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a few leftover digits being searched for here.

[re.compile(r'(^|\s)\+(\s|\.|,|\?|!|$)'), r'\1 plus \2'],
[re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'],
[re.compile(r'(^|\s)m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1mètre carré\2'],
[re.compile(r'/\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r' par mètre carré\1'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a few leftover digits being searched for here.

@nicolaspanel
Copy link
Author

Looks like there are a few leftover digits being searched for here.

@kdavis-mozilla you're right, sorry I missed it. fixed in 149e960

Copy link
Contributor

@kdavis-mozilla kdavis-mozilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming my eyes are parsing the regex's correctly, it looks like there's still some regexs that deal with digits.

For example

 [re.compile(r'(^|\s)m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1mètre carré\2']

Also there is some "dead code"

text = FIND_MULTIPLE_SPACES_REG.sub(' ', text)

that's "dead" as a result of code in common.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants