Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: add multi-language antispam support #1

Open
Lopa10ko opened this issue Oct 3, 2024 · 5 comments
Open

feature request: add multi-language antispam support #1

Lopa10ko opened this issue Oct 3, 2024 · 5 comments

Comments

@Lopa10ko
Copy link

Lopa10ko commented Oct 3, 2024

Problem

The detector is incapable of determining the prevailing language of the message.
Consequently, if a spam message is written in English, it will be automatically converted into Cyrillic and will not be flagged as spam.

Note

Generating homoglyphs could be more informative using existing frameworks (e.g. https://github.com/life4/homoglyphs), starting by identifying the primary locale of the message.

Reproduction

For example, at this stage, the following test fails:

import pytest

from itmo_antispam_bot.rubert_bot import SpamDetector


@pytest.mark.parametrize('message, expected', [
    ('''Hello guys, Consider we a have time series with frequency of daily data.
    What is the minimum amount data required for fedot to forecast well?''', False),
    ('''Unlock the secrets to making millions with our exclusive Crypto Masterclass!
    Learn how to turn a small investment into life-changing wealth.
    Plus, get a FREE $500 bonus just for signing up today!
    Don't miss out on this limited-time opportunity—start your journey to financial freedom now!''', True),
    ('''Раскройте секреты заработка миллионов с нашим эксклюзивным курсом по криптовалюте!
    Узнайте, как превратить небольшие инвестиции в жизнеопределяющее богатство.
    А еще получите БОНУС $500 абсолютно бесплатно при регистрации сегодня!
    Не упустите шанс начать путь к финансовой свободе прямо сейчас!''', True)
])
def test_spam_classifieir(message, expected):
    classifier = SpamDetector('NeuroSpaceX/ruSpamNS_v1')
    assert classifier.classify_message(message) == expected
@jrzkaminski
Copy link
Owner

The bot is built for Russian language. However, I'll consider that improvement

@Lopa10ko
Copy link
Author

Lopa10ko commented Oct 3, 2024

The bot is built for Russian language. However, I'll consider that improvement

I was thinking about how spammers could use the "write spam messages in English in a Russian-speaking chat" strategy :)

@Kilagen
Copy link

Kilagen commented Oct 8, 2024

С гомоглифами есть ещё решение - смотреть на перескоки по алфавитам. Ну или проще - на скачки между значениями unicode внутри слова: https://github.com/Kilagen/spam-removal-bot/blob/beta-dev/spam_detection/util.py#L39.
И дальнейший можно считать количества слов, где таких скачков больше двух https://github.com/Kilagen/spam-removal-bot/blob/beta-dev/spam_detection/mixed_abc.py

Константы, как и сама идея, выбраны перебором. Ест мало ресурсов. На моей валидации False Positive = 0. С русским языком работает хорошо, отсеивает большинство спам-сообщений. Но с иероглифами, а также языками, где использование нескольких алфавитов нормально, будет беда :/

@SpaceNeuroX
Copy link

SpaceNeuroX commented Oct 11, 2024

@Lopa10ko My model has mostly been trained on Russian-language data, but I plan to add more English examples to the dataset in the future. In the meantime, using translation APIs for English data processing could be an option, though it's not a final solution. You can now use my model’s API, though it will likely be slower since it first translates the text into English. Download the ruSpam library by running pip install ruSpam. Usage details are available on my GitHub or in the PyPi repository. Once I have the opportunity, I will start gathering a more suitable dataset for the English language.

@jrzkaminski
Copy link
Owner

jrzkaminski commented Oct 11, 2024

@Lopa10ko My model has mostly been trained on Russian-language data, but I plan to add more English examples to the dataset in the future. In the meantime, using translation APIs for English data processing could be an option, though it's not a final solution. You can now use my model’s API, though it will likely be slower since it first translates the text into English. Download the ruSpam library by running pip install ruSpam. Usage details are available on my GitHub or in the PyPi repository. Once I have the opportunity, I will start gathering a more suitable dataset for the English language.

In the future I plan to generalize the code to accept any spam classification model as an input. It would make it a better tool, I suppose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants