Create langcheck.utils.detect_language() #67

syamaco · 2023-12-06T08:07:36Z

Hello,
I have a question about the following test code.

Question: Is it possible to treat different languages uniformly?
1. Automatic detect languages. (e.g., EN & JA)
2. Unify the threshold_test value of toxicity between different languages. (e.g., 0.2 for both)

import langcheck

generated_outputs = [
    '適度な運動は健康に良いとされています。',
    '適度な運動は健康に悪いとされています。',
    '過度の運動は健康に良いとされています。',
    '過度の運動は健康に悪いとされています。',
    'Moderate exercise is good for your health.',
    'Moderate exercise is bad for your health.',
    'Excessive exercise is good for your health.',
    'Excessive exercise is bad for your health.',
]

# Toxicity
display(langcheck.metrics.ja.toxicity(generated_outputs) < 0.2)
display(langcheck.metrics.en.toxicity(generated_outputs) < 0.2)

Thank you in advance.

kennysong · 2023-12-06T09:56:56Z

Hi @syamaco, thanks for the question!

Automatic detect languages. (e.g., EN & JA)

I think it makes sense to include a langcheck.utils.detect_language() function in LangCheck. There should be some well known heuristics we can use to implement this.

Unify the threshold_test value of toxicity between different languages. (e.g., 0.2 for both)

Currently, English toxicity (detoxify) and Japanese toxicity (fine-tuned line-distilbert-base-japanese) are completely different models, so it may not be optimal to set a single threshold for both models.

If you use OpenAI to compute toxicity, I believe it's using the exact same model and prompt for both languages, so you might be able to set a single threshold in that case.

@liwii if you have any suggestions, feel free to chime in!

syamaco · 2023-12-07T01:01:02Z

Hi @kennysong, thanks for your response!

It would be helpful to have a language detection function like that.
Additionally, it would be even better if it could detect when multiple languages are mixed within a sentence.

I've come to understand that the threshold for toxicity can vary depending on the language model.
Also, considering that the threshold might change even with the same language model due to different versions, I feel it's important to exercise caution in determining the threshold.

I will try OpenAI's language model at least once, referring to the sample code.

Thank you.

kennysong · 2023-12-07T02:42:44Z

Additionally, it would be even better if it could detect when multiple languages are mixed within a sentence.

Got it! What would be a useful output of langcheck.utils.detect_language() if there are multiple languages? The simplest idea is a list like ['en', 'ja'], but I think there are many other options that could be useful.

Also, considering that the threshold might change even with the same language model due to different versions, I feel it's important to exercise caution in determining the threshold.

This is a good point – I think it's a good idea to pin a specific version of a HuggingFace model in LangCheck where possible. Then we can control model upgrades in LangCheck versions. We can track this as a separate feature request.

syamaco · 2023-12-08T00:02:31Z

@kennysong san, thank you for the suggestion.

Is it possible to set the output of langcheck.utils.detect_language() to the detected language and its probability, like {'en': 0.7, 'ja': 0.3} ?

kennysong · 2023-12-08T11:31:08Z

Yes, I think we can use https://github.com/pemistahl/lingua-py to output confidence scores for language detection. I'm not quite sure how they handle confidence scores for input with multiple languages, though. We'll need to dig into that later. From your perspective, what do you expect the probabilities to be when a sentence contains both English and Japanese? {'en': 1.0, 'ja': 1.0} or {'en': 0.5, 'ja': 0.5}?

…

On Fri, Dec 8, 2023 at 9:02 AM syamaco ***@***.***> wrote: @kennysong <https://github.com/kennysong> san, thank you for the suggestion. Is it possible to set the output of langcheck.utils.detect_language() to the detected language and its probability, like {'en': 0.7, 'ja': 0.3} ? — Reply to this email directly, view it on GitHub <#67 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQZCQ3MDWZRVH2UR6FNOQTYIJKKDAVCNFSM6AAAAABAI6QSEWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBWGI4DSOJUHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

syamaco · 2023-12-09T02:30:49Z

@kennysong san,

I felt that results similar to detector.compute_language_confidence_values() in Lingua were natural.
https://github-com.translate.goog/pemistahl/lingua-py?_x_tr_sl=auto&_x_tr_tl=ja&_x_tr_hl=ja&_x_tr_pto=wapp#113-confidence-values

ENGLISH: 0.93
FRENCH: 0.04
GERMAN: 0.02
SPANISH: 0.01

If we can identify the main language of the text using langcheck.utils.detect_language() along with its probability, I think it could be used as a basis for deciding whether to process it with langcheck.metrics.ja.toxicity() or langcheck.metrics.en.toxicity(). Additionally, in cases where the detection values are similar, we may also consider opting not to process the text.

thank you.

kennysong · 2023-12-09T11:12:47Z

Sounds good, we can try the default compute_language_confidence_values() first.

I'm not sure that it'll actually return {"en": 0.5, "ja": 0.5} for a sentence with equal amounts of English and Japanese, so we should test it later.

Other options are to use compute_language_confidence_values() on a single language at a time or detect_multiple_languages_of().

kennysong changed the title ~~[Questions] Is it possible to treat different languages uniformly?~~ Create langcheck.utils.detect_language() Dec 9, 2023

kennysong mentioned this issue Dec 9, 2023

Pin versions for HuggingFace models #71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create langcheck.utils.detect_language() #67

Create langcheck.utils.detect_language() #67

syamaco commented Dec 6, 2023

kennysong commented Dec 6, 2023

syamaco commented Dec 7, 2023

kennysong commented Dec 7, 2023 •

edited

Loading

syamaco commented Dec 8, 2023

kennysong commented Dec 8, 2023 via email

syamaco commented Dec 9, 2023

kennysong commented Dec 9, 2023

Create langcheck.utils.detect_language() #67

Create langcheck.utils.detect_language() #67

Comments

syamaco commented Dec 6, 2023

kennysong commented Dec 6, 2023

syamaco commented Dec 7, 2023

kennysong commented Dec 7, 2023 • edited Loading

syamaco commented Dec 8, 2023

kennysong commented Dec 8, 2023 via email

syamaco commented Dec 9, 2023

kennysong commented Dec 9, 2023

kennysong commented Dec 7, 2023 •

edited

Loading