Chinese language misclassified #98

johnbumgarner · 2022-01-24T14:49:46Z

I use langdect to classify the language of a website when the site does not have a lang attribute in the HTML. Occasionally langdect will misclassify a website written in Chinese. For example this website:

https://news.sina.com.cn/c/xl/2022-01-23/doc-ikyamrmz6973062.shtml

Is classified as Korean and not Chinese by langdect.

This is the title of the article -- 相约北京习近平邀世界“共同见证”_手机新浪网

lang_code = langdetect.detect('相约北京 习近平邀世界“共同见证”_手机新浪网')
print(lang_code)
ko

Why does langdect classify the language of this website as Korean and not Chinese?

The text was updated successfully, but these errors were encountered:

johnbumgarner · 2022-01-25T13:53:34Z

I see that this is a known issue with langdect -- https://github.com/Mimino666/langdetect/issues?q=Chinese

Why has this issue not be resolved after 7 years?

myfingerhurt · 2023-02-06T05:32:57Z

Not even remotely close.

text = "你可以使用开源的 Python库 Requests，通过Telegram Bot发送MP3音频文件",
detect(text)='ca' detect_langs(text)=[ca:0.7142840022485835, en:0.14285863189692477, vi:0.1428560781690179]

Found a partial solution from chatGPT, before using this you have to fix the ko profile.

import jieba
from langdetect import detect

def detect_mixed_language(text):
    words = jieba.cut(text)
    lang_count = {
        'zh-cn': 0,
        'en': 0,
        'fr': 0,
        'ja': 0,
        'ko': 0,
        'ru': 0,
        'es': 0,
    }
    for word in words:
        try:
            lang = detect(word)
            lang_count[lang] += 1
        except:
            pass

    if lang_count['zh-cn'] > lang_count['en'] and lang_count['zh-cn'] > lang_count['fr'] and lang_count['zh-cn'] > lang_count['ja'] and lang_count['zh-cn'] > lang_count['ko'] and lang_count['zh-cn'] > lang_count['ru'] and lang_count['zh-cn'] > lang_count['es']:
        return 'zh-cn'
    elif lang_count['en'] > lang_count['fr'] and lang_count['en'] > lang_count['ja'] and lang_count['en'] > lang_count['ko'] and lang_count['en'] > lang_count['ru'] and lang_count['en'] > lang_count['es']:
        return 'en'
    elif lang_count['fr'] > lang_count['ja'] and lang_count['fr'] > lang_count['ko'] and lang_count['fr'] > lang_count['ru'] and lang_count['fr'] > lang_count['es']:
        return 'fr'
    elif lang_count['ja'] > lang_count['ko'] and lang_count['ja'] > lang_count['ru'] and lang_count['ja'] > lang_count['es']:
        return 'ja'
    elif lang_count['ko'] > lang_count['ru'] and lang_count['ko'] > lang_count['es']:
        return 'ko'
    elif lang_count['ru'] > lang_count['es']:
        return 'ru'
    else:
        return 'es'

Dobatymo · 2023-02-07T01:59:56Z

I suggest to use pycld2 instead. It has some issues as well, but none as grave as langdetect imo.

myfingerhurt · 2023-02-07T07:49:45Z

Thank you @Dobatymo.
Actually I have tried pycld2 but I was getting stuck in resolving dependencies, so I came back for this one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chinese language misclassified #98

Chinese language misclassified #98

johnbumgarner commented Jan 24, 2022

johnbumgarner commented Jan 25, 2022 •

edited

Loading

myfingerhurt commented Feb 6, 2023 •

edited

Loading

Dobatymo commented Feb 7, 2023

myfingerhurt commented Feb 7, 2023

Chinese language misclassified #98

Chinese language misclassified #98

Comments

johnbumgarner commented Jan 24, 2022

johnbumgarner commented Jan 25, 2022 • edited Loading

myfingerhurt commented Feb 6, 2023 • edited Loading

Dobatymo commented Feb 7, 2023

myfingerhurt commented Feb 7, 2023

johnbumgarner commented Jan 25, 2022 •

edited

Loading

myfingerhurt commented Feb 6, 2023 •

edited

Loading