-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create langcheck.utils.detect_language() #67
Comments
Hi @syamaco, thanks for the question!
I think it makes sense to include a
Currently, English toxicity (detoxify) and Japanese toxicity (fine-tuned line-distilbert-base-japanese) are completely different models, so it may not be optimal to set a single threshold for both models. If you use OpenAI to compute toxicity, I believe it's using the exact same model and prompt for both languages, so you might be able to set a single threshold in that case. @liwii if you have any suggestions, feel free to chime in! |
Hi @kennysong, thanks for your response! It would be helpful to have a language detection function like that. I've come to understand that the threshold for toxicity can vary depending on the language model. I will try OpenAI's language model at least once, referring to the sample code. Thank you. |
Got it! What would be a useful output of
This is a good point – I think it's a good idea to pin a specific version of a HuggingFace model in LangCheck where possible. Then we can control model upgrades in LangCheck versions. We can track this as a separate feature request. |
@kennysong san, thank you for the suggestion. Is it possible to set the output of langcheck.utils.detect_language() to the detected language and its probability, like {'en': 0.7, 'ja': 0.3} ? |
Yes, I think we can use https://github.com/pemistahl/lingua-py to output
confidence scores for language detection.
I'm not quite sure how they handle confidence scores for input with
multiple languages, though. We'll need to dig into that later.
From your perspective, what do you expect the probabilities to be when a
sentence contains both English and Japanese? {'en': 1.0, 'ja': 1.0}
or {'en': 0.5, 'ja': 0.5}?
…On Fri, Dec 8, 2023 at 9:02 AM syamaco ***@***.***> wrote:
@kennysong <https://github.com/kennysong> san, thank you for the
suggestion.
Is it possible to set the output of langcheck.utils.detect_language() to
the detected language and its probability, like {'en': 0.7, 'ja': 0.3} ?
—
Reply to this email directly, view it on GitHub
<#67 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQZCQ3MDWZRVH2UR6FNOQTYIJKKDAVCNFSM6AAAAABAI6QSEWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBWGI4DSOJUHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@kennysong san, I felt that results similar to detector.compute_language_confidence_values() in Lingua were natural.
If we can identify the main language of the text using langcheck.utils.detect_language() along with its probability, I think it could be used as a basis for deciding whether to process it with langcheck.metrics.ja.toxicity() or langcheck.metrics.en.toxicity(). Additionally, in cases where the detection values are similar, we may also consider opting not to process the text. thank you. |
Sounds good, we can try the default I'm not sure that it'll actually return {"en": 0.5, "ja": 0.5} for a sentence with equal amounts of English and Japanese, so we should test it later. Other options are to use |
Hello,
I have a question about the following test code.
Thank you in advance.
The text was updated successfully, but these errors were encountered: