-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: add multi-language antispam support #1
Comments
The bot is built for Russian language. However, I'll consider that improvement |
I was thinking about how spammers could use the "write spam messages in English in a Russian-speaking chat" strategy :) |
С гомоглифами есть ещё решение - смотреть на перескоки по алфавитам. Ну или проще - на скачки между значениями unicode внутри слова: https://github.com/Kilagen/spam-removal-bot/blob/beta-dev/spam_detection/util.py#L39. Константы, как и сама идея, выбраны перебором. Ест мало ресурсов. На моей валидации False Positive = 0. С русским языком работает хорошо, отсеивает большинство спам-сообщений. Но с иероглифами, а также языками, где использование нескольких алфавитов нормально, будет беда :/ |
@Lopa10ko My model has mostly been trained on Russian-language data, but I plan to add more English examples to the dataset in the future. In the meantime, using translation APIs for English data processing could be an option, though it's not a final solution. You can now use my model’s API, though it will likely be slower since it first translates the text into English. Download the ruSpam library by running pip install ruSpam. Usage details are available on my GitHub or in the PyPi repository. Once I have the opportunity, I will start gathering a more suitable dataset for the English language. |
In the future I plan to generalize the code to accept any spam classification model as an input. It would make it a better tool, I suppose. |
Problem
The detector is incapable of determining the prevailing language of the message.
Consequently, if a spam message is written in English, it will be automatically converted into Cyrillic and will not be flagged as spam.
Note
Generating homoglyphs could be more informative using existing frameworks (e.g. https://github.com/life4/homoglyphs), starting by identifying the primary locale of the message.
Reproduction
For example, at this stage, the following test fails:
The text was updated successfully, but these errors were encountered: