Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese language misclassified #98

Open
johnbumgarner opened this issue Jan 24, 2022 · 4 comments
Open

Chinese language misclassified #98

johnbumgarner opened this issue Jan 24, 2022 · 4 comments

Comments

@johnbumgarner
Copy link

I use langdect to classify the language of a website when the site does not have a lang attribute in the HTML. Occasionally langdect will misclassify a website written in Chinese. For example this website:

https://news.sina.com.cn/c/xl/2022-01-23/doc-ikyamrmz6973062.shtml

Is classified as Korean and not Chinese by langdect.

This is the title of the article -- 相约北京 习近平邀世界“共同见证”_手机新浪网

lang_code = langdetect.detect('相约北京 习近平邀世界“共同见证”_手机新浪网')
print(lang_code)
ko

Why does langdect classify the language of this website as Korean and not Chinese?

@johnbumgarner
Copy link
Author

johnbumgarner commented Jan 25, 2022

I see that this is a known issue with langdect -- https://github.com/Mimino666/langdetect/issues?q=Chinese

Why has this issue not be resolved after 7 years?

@myfingerhurt
Copy link

myfingerhurt commented Feb 6, 2023

Not even remotely close.

text = "你可以使用开源的 Python库 Requests,通过Telegram Bot发送MP3音频文件",
detect(text)='ca' detect_langs(text)=[ca:0.7142840022485835, en:0.14285863189692477, vi:0.1428560781690179]

Found a partial solution from chatGPT, before using this you have to fix the ko profile.

import jieba
from langdetect import detect

def detect_mixed_language(text):
    words = jieba.cut(text)
    lang_count = {
        'zh-cn': 0,
        'en': 0,
        'fr': 0,
        'ja': 0,
        'ko': 0,
        'ru': 0,
        'es': 0,
    }
    for word in words:
        try:
            lang = detect(word)
            lang_count[lang] += 1
        except:
            pass

    if lang_count['zh-cn'] > lang_count['en'] and lang_count['zh-cn'] > lang_count['fr'] and lang_count['zh-cn'] > lang_count['ja'] and lang_count['zh-cn'] > lang_count['ko'] and lang_count['zh-cn'] > lang_count['ru'] and lang_count['zh-cn'] > lang_count['es']:
        return 'zh-cn'
    elif lang_count['en'] > lang_count['fr'] and lang_count['en'] > lang_count['ja'] and lang_count['en'] > lang_count['ko'] and lang_count['en'] > lang_count['ru'] and lang_count['en'] > lang_count['es']:
        return 'en'
    elif lang_count['fr'] > lang_count['ja'] and lang_count['fr'] > lang_count['ko'] and lang_count['fr'] > lang_count['ru'] and lang_count['fr'] > lang_count['es']:
        return 'fr'
    elif lang_count['ja'] > lang_count['ko'] and lang_count['ja'] > lang_count['ru'] and lang_count['ja'] > lang_count['es']:
        return 'ja'
    elif lang_count['ko'] > lang_count['ru'] and lang_count['ko'] > lang_count['es']:
        return 'ko'
    elif lang_count['ru'] > lang_count['es']:
        return 'ru'
    else:
        return 'es'

@Dobatymo
Copy link

Dobatymo commented Feb 7, 2023

I suggest to use pycld2 instead. It has some issues as well, but none as grave as langdetect imo.

@myfingerhurt
Copy link

Thank you @Dobatymo.
Actually I have tried pycld2 but I was getting stuck in resolving dependencies, so I came back for this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants