Number interpretation in Recognizers-Text #2577

tellarin · 2021-04-30T07:33:04Z

tellarin
Apr 30, 2021
Collaborator

The Recognizers-Text project offers models per languange with as wide language variant support as possible, without the need to specify fine-grained locales. For example, the English model should support as much of the variations of the English language as possible in a single model (ex: en-US, en-AU, en-GB, en-HK, ...).

However, there may be cases were an entity interpretation - in this case a number format - may be ambiguous between two language variants.

By design, the recognizers consider one variant of a language as the default. For example, for English, the default variant is en-US. If the default behaviour of the model is not as desired, a user can request a variant-specific version, instead of the general model (i.e., request "en-GB" instead of "en-*").

The current design of the recognizers for Numbers can be summarized as:

For languages that only accept one number format (only accept comma or dot as decimal separators), like Portuguese, only the language-specific number format (only comma as decimal separator and only dot as thousands separator in pt-*) will be treated as a number. Other formats will not be recognized.
For languages where different variants have different rules, like Spanish-Spain (es-ES) and Spanish-Mexico (es-MX), the model first assumes a default variant. In the case of Spanish, the Spain variant.
All cases in formats of the default language variant will work.
For formats in non-default variants, all cases that are not ambiguous will also work.
Ambiguous format cases will fallback to interpretation as if in the default variant.

For example, in English, which supports variants and uses en-US as default variant:

"1,2" or "1.2" are interpreted the same ("1.2") as both are unambiguous.
"1,234.00" only has one meaning, "1234.00". And "1.234,00" only has one meaning, "1234.00".
Potentially ambiguous cases, like "1.234", will be treated with dot as decimal separator (as per the default variant), and will be interpreted as "1.234". "1,234" will be interpreted as "1234".

As previously stated, if the behaviour for the ambigous cases is not as desired, a user should be able to specify a specific language variant.

There are two known issues in the current implementation:

Bug in ambiguous cases to be interpreted in non-default variants ([* Number] Improve number interpretation for potentially ambiguous formats when the default language variant interpretation is not the desired one #2578). A fix is on the way.
The "value" in the recognizers output uses inconsistent formatting based on culture. Some languages use dot as decimal separator, while others don't. The current design also returns one form of separator per culture (not per variant). This should be uniform as only a canonical form with dot as decimal separator ([* Number] Output isn't consistent across cultures #774) and consumers can change format. This has not been implemented yet as the change was considered a breaking change by consumers. The current planned design is to have a "cannonicalValue" output and to have a config flag to change the default "value" behaviour.

iMicknl · 2021-04-30T15:32:54Z

iMicknl
Apr 30, 2021
Collaborator

It would be good to take NumberRangeModel into account as well. Currently within the same locale, you can have different notations due to the limitations of the value. "value": "(0.25,0.5)"

Recognizers-Text/Specs/Number/Spanish/NumberModel.json

Line 1235 in 3cf716e

"value": "103,666666666667"

Recognizers-Text/Specs/Number/Spanish/NumberRangeModel.json

Line 285 in 3cf716e

"value": "(0.25,0.5)"

1 reply

tellarin May 6, 2021
Collaborator Author

NumberRange already uses the canonical form, which should be followed elsewhere at some point.

tellarin · 2021-07-29T09:30:40Z

tellarin
Jul 29, 2021
Collaborator Author

#774 and #513 should probably be addressed together, towards making sure all output follows the same canonical forms.
Presentation issues should not be in-scope for the recognizers.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Number interpretation in Recognizers-Text #2577

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Number interpretation in Recognizers-Text #2577

tellarin Apr 30, 2021 Collaborator

Replies: 2 comments · 1 reply

iMicknl Apr 30, 2021 Collaborator

tellarin May 6, 2021 Collaborator Author

tellarin Jul 29, 2021 Collaborator Author

tellarin
Apr 30, 2021
Collaborator

Replies: 2 comments 1 reply

iMicknl
Apr 30, 2021
Collaborator

tellarin May 6, 2021
Collaborator Author

tellarin
Jul 29, 2021
Collaborator Author