Franc is providing wrong language #519

Zodiac1978 · 2023-08-04T14:12:58Z

Describe the bug
If I use the content from Line 962/F provided via our report form I get not English but sco (Scots) reported and not English.

Therefore the lang check is marking the comment as spam. (Needs reproducing.)

Maybe we need to narrow the languages down.
See: https://github.com/wooorm/franc#options

First reported in the forums: https://wordpress.org/support/topic/what-to-do-about-false-positives/

The text was updated successfully, but these errors were encountered:

Zodiac1978 · 2023-08-19T20:37:32Z

Maybe we should add a checkbox to our form if the report is for a false positive and if yes, what spam reason was set. This would help us to find the method which is the culprit.

Zodiac1978 · 2023-11-14T10:36:44Z

Maybe we should add a checkbox to our form if the report is for a false positive and if yes, what spam reason was set. This would help us to find the method which is the culprit.

This was added by me in the meantime:
https://docs.google.com/forms/d/e/1FAIpQLSeQlKVZZYsF1qkKz7U78B2wy_6s6I7aNSdQc-DGpjeqWx70-A/viewform?c=0&w=1

About narrowing down the languages, maybe @MatzeKitt can chime in here and help with the API side.

MatzeKitt · 2023-11-14T11:11:57Z

Since franc by default returns a list of potentially matching languages, the easiest way would be to just return a list of potential languages instead of a single language. That way we can determine on ASB side if the language is met (we also receive the percentage the string meets the language and can work with a threshold in ASB).

MatzeKitt · 2023-12-16T19:08:12Z

Would look like this:

franc -a "Das ist ein Test"
src 1
deu 0.9825403753819293
est 0.8485377564382366
glg 0.7952859013531209
fin 0.7206460061108686
tzm 0.6983849847228285
nld 0.6975120034919249
por 0.6900916630292449
nds 0.6700130947184635
ind 0.624181580096028
…

Zodiac1978 · 2024-02-13T09:24:10Z

Hey @MatzeKitt

as this is a different approach to narrowing down the languages on the API side, configured through the allowed languages in WP/ASB - why are you suggesting this approach? What is the advantage to the other approach?

And how complicated would it be to implement one of them?

MatzeKitt · 2024-02-13T10:11:50Z

It is more flexible since you don’t need to manage a list of languages and test only against them but you simply need to set a threshold. This would also make the code less complex.

The complexity of my solution is not high:

Define a threshold
Adjust the API to allow returning multiple languages
Filter the languages by the defined threshold

Your proposed solution:

Define languages to check (editable? then we need an option for that)
Adjust the API to allow limiting languages
Send the additional settings to the API

Especially the first point can be more complex, since you need to either define it for all available languages or have an additional option for that, which is harder to tell the user what to do (especially since there already is an option with languages).

Zodiac1978 · 2024-03-07T07:57:16Z

Define languages to check (editable? then we need an option for that)

Isn't this already there?
https://antispambee.pluginkollektiv.org/documentation/#allow-comments-only-in-certain-language

(especially since there already is an option with languages)

Why do you think this need to be an additional setting?

Adjust the API to allow limiting languages

There is an only option available:
https://github.com/wooorm/franc#options

MatzeKitt · 2024-03-08T20:29:32Z

A list of languages you want is completely different from languages franc should be able to detect. At least in my point of view. I don't know how franc behaves if you e.g. define only English and German but submit a French text. Since you would only allow English or German, I assume that franc would only output these two languages – both with a relatively low score. That wouldn’t help anything, thus you would need to define similar languages (e.g. Scottish should behave the same as English, etc.)

From my point of view, limiting franc in its detection would not help us here at all.

Zodiac1978 · 2024-11-19T11:29:37Z

My thought was to eliminate variants (Scots vs. English, Swiss vs. German, etc.) not typical widespread languages, like French in your example. But I see the complexity of this idea.

My first look would be to check which package of franc we are using. There are three, based on amount of speakers:

82, 186, or 419 languages

82 -> 8 million or more speakers
187 -> 1 million or more speakers
419 -> all possible languages

But I am fine with whatever is working best. I hoped this would be a small change (franc to franc-min for example) and could be fixing the issue reported.

Define a threshold
Adjust the API to allow returning multiple languages
Filter the languages by the defined threshold

Going this route means, if I understand the workflow correctly, that we need to have a second version for the API to distinguish between v1 requests (only 1 language in return, like now) and v2 requests (multiple languages in return) and we need to build the corresponding part in ASB to react accordingly. Correct?

MatzeKitt · 2024-11-20T15:08:10Z

I’m not a fan of managing languages available in franc. Especially since we don’t know any variants (of course, we know variants of European languages, but what about variants of e.g. African languages?). So it wouldn’t be a one-time job.

My alternative is a one-time change. And yes, we would need a second version for the API, but that’s a no-brainer and done in under an hour.

The main work needs to be done in ASB itself, but I also consider this as relatively light work. I could also implement it myself, if we agree on this solution.

stklcode · 2024-11-26T19:15:27Z

Language management or mapping can be a pain or at least a huge amount of maintenance required at scale. Yes, you can probably cover a great majority of problems with a handful or two of languages (or language families), but definitely not all.

+1 for a language list. Maybe capped at 3, 5 or whatever reasonable number of results) and a threshold.

Thinking further one might be able to build a rule that explicitly covers mixed language comments (e.g. badly auto-translated stuff) or "no language" (emojis or just random or encoded content, nothing with sufficient confidence) but that's just a quick thought.

Zodiac1978 added the bug label Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Franc is providing wrong language #519

Franc is providing wrong language #519

Zodiac1978 commented Aug 4, 2023

Zodiac1978 commented Aug 19, 2023

Zodiac1978 commented Nov 14, 2023

MatzeKitt commented Nov 14, 2023

MatzeKitt commented Dec 16, 2023

Zodiac1978 commented Feb 13, 2024

MatzeKitt commented Feb 13, 2024 •

edited

Loading

Zodiac1978 commented Mar 7, 2024

MatzeKitt commented Mar 8, 2024

Zodiac1978 commented Nov 19, 2024

MatzeKitt commented Nov 20, 2024

stklcode commented Nov 26, 2024

Franc is providing wrong language #519

Franc is providing wrong language #519

Comments

Zodiac1978 commented Aug 4, 2023

Zodiac1978 commented Aug 19, 2023

Zodiac1978 commented Nov 14, 2023

MatzeKitt commented Nov 14, 2023

MatzeKitt commented Dec 16, 2023

Zodiac1978 commented Feb 13, 2024

MatzeKitt commented Feb 13, 2024 • edited Loading

Zodiac1978 commented Mar 7, 2024

MatzeKitt commented Mar 8, 2024

Zodiac1978 commented Nov 19, 2024

MatzeKitt commented Nov 20, 2024

stklcode commented Nov 26, 2024

MatzeKitt commented Feb 13, 2024 •

edited

Loading