feature request: add multi-language antispam support #1

Lopa10ko · 2024-10-03T15:09:26Z

Problem

The detector is incapable of determining the prevailing language of the message.
Consequently, if a spam message is written in English, it will be automatically converted into Cyrillic and will not be flagged as spam.

Note

Generating homoglyphs could be more informative using existing frameworks (e.g. https://github.com/life4/homoglyphs), starting by identifying the primary locale of the message.

Reproduction

For example, at this stage, the following test fails:

import pytest

from itmo_antispam_bot.rubert_bot import SpamDetector


@pytest.mark.parametrize('message, expected', [
    ('''Hello guys, Consider we a have time series with frequency of daily data.
    What is the minimum amount data required for fedot to forecast well?''', False),
    ('''Unlock the secrets to making millions with our exclusive Crypto Masterclass!
    Learn how to turn a small investment into life-changing wealth.
    Plus, get a FREE $500 bonus just for signing up today!
    Don't miss out on this limited-time opportunity—start your journey to financial freedom now!''', True),
    ('''Раскройте секреты заработка миллионов с нашим эксклюзивным курсом по криптовалюте!
    Узнайте, как превратить небольшие инвестиции в жизнеопределяющее богатство.
    А еще получите БОНУС $500 абсолютно бесплатно при регистрации сегодня!
    Не упустите шанс начать путь к финансовой свободе прямо сейчас!''', True)
])
def test_spam_classifieir(message, expected):
    classifier = SpamDetector('NeuroSpaceX/ruSpamNS_v1')
    assert classifier.classify_message(message) == expected

jrzkaminski · 2024-10-03T17:25:34Z

The bot is built for Russian language. However, I'll consider that improvement

Lopa10ko · 2024-10-03T18:08:19Z

The bot is built for Russian language. However, I'll consider that improvement

I was thinking about how spammers could use the "write spam messages in English in a Russian-speaking chat" strategy :)

Kilagen · 2024-10-08T17:18:34Z

С гомоглифами есть ещё решение - смотреть на перескоки по алфавитам. Ну или проще - на скачки между значениями unicode внутри слова: https://github.com/Kilagen/spam-removal-bot/blob/beta-dev/spam_detection/util.py#L39.
И дальнейший можно считать количества слов, где таких скачков больше двух https://github.com/Kilagen/spam-removal-bot/blob/beta-dev/spam_detection/mixed_abc.py

Константы, как и сама идея, выбраны перебором. Ест мало ресурсов. На моей валидации False Positive = 0. С русским языком работает хорошо, отсеивает большинство спам-сообщений. Но с иероглифами, а также языками, где использование нескольких алфавитов нормально, будет беда :/

SpaceNeuroX · 2024-10-11T19:42:00Z

@Lopa10ko My model has mostly been trained on Russian-language data, but I plan to add more English examples to the dataset in the future. In the meantime, using translation APIs for English data processing could be an option, though it's not a final solution. You can now use my model’s API, though it will likely be slower since it first translates the text into English. Download the ruSpam library by running pip install ruSpam. Usage details are available on my GitHub or in the PyPi repository. Once I have the opportunity, I will start gathering a more suitable dataset for the English language.

jrzkaminski · 2024-10-11T21:34:49Z

@Lopa10ko My model has mostly been trained on Russian-language data, but I plan to add more English examples to the dataset in the future. In the meantime, using translation APIs for English data processing could be an option, though it's not a final solution. You can now use my model’s API, though it will likely be slower since it first translates the text into English. Download the ruSpam library by running pip install ruSpam. Usage details are available on my GitHub or in the PyPi repository. Once I have the opportunity, I will start gathering a more suitable dataset for the English language.

In the future I plan to generalize the code to accept any spam classification model as an input. It would make it a better tool, I suppose.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: add multi-language antispam support #1

feature request: add multi-language antispam support #1

Lopa10ko commented Oct 3, 2024

jrzkaminski commented Oct 3, 2024

Lopa10ko commented Oct 3, 2024

Kilagen commented Oct 8, 2024

SpaceNeuroX commented Oct 11, 2024 •

edited

Loading

jrzkaminski commented Oct 11, 2024 •

edited

Loading

feature request: add multi-language antispam support #1

feature request: add multi-language antispam support #1

Comments

Lopa10ko commented Oct 3, 2024

Problem

Reproduction

jrzkaminski commented Oct 3, 2024

Lopa10ko commented Oct 3, 2024

Kilagen commented Oct 8, 2024

SpaceNeuroX commented Oct 11, 2024 • edited Loading

jrzkaminski commented Oct 11, 2024 • edited Loading

SpaceNeuroX commented Oct 11, 2024 •

edited

Loading

jrzkaminski commented Oct 11, 2024 •

edited

Loading