Skip to content

Releases: hyunwoongko/kss

v6.0.4

30 Apr 18:54
Compare
Choose a tag to compare
  • Reimplement hanja module because it was not able to install in colab environment.
  • Fix information about morpheme analyzer backend in README and docs.

v6.0.2

28 Apr 14:11
Compare
Choose a tag to compare
  • Add alias() function and fix some docs.

v6.0.1

28 Apr 13:44
Compare
Choose a tag to compare
  • [hotfix] Rename idiom.txt in MANIFEST.in to idioms.txt

v6.0.0

27 Apr 14:23
91355e2
Compare
Choose a tag to compare

KSS: Korean String processing Suite

GitHub release Issues Tests on Ubuntu Tests on MacOS Tests on Windows

KSS is a Korean string processing suite that provides various functions for processing Korean strings. It is designed to be simple and easy to use, and it is designed to be used in various fields such as natural language processing, data preprocessing, and data analysis.

Usage

1. Basic Usage

All functions can be used by creating an instance of the Kss class and calling the instance with the inputs.

from kss import Kss

module = Kss("MODULE_NAME")
output = module("YOUR_INPUT_STRING", **kwargs)

2. Available Modules

If you want to check the available modules, you can use the available() function.

from kss import Kss

Kss.available()
['augment', 'collocate', 'g2p', 'hangulize', 'split_hanja', 'is_hanja', 'hanja2hangul', 'h2j', 'h2hcj', 'j2h', 'j2hcj', 'hcj2h', 'hcj2j', 'is_jamo', 'is_jamo_modern', 'is_hcj', 'is_hcj_modern', 'is_hangul_char', 'select_josa', 'combine_josa', 'extract_keywords', 'split_morphemes', 'paradigm', 'anonymize', 'clean_news', 'is_completed_form', 'get_all_completed_form_hangul_chars', 'get_all_incompleted_form_hangul_chars', 'filter_out', 'half2full', 'reduce_char_repeats', 'reduce_emoticon_repeats', 'remove_invisible_chars', 'normalize', 'preprocess', 'qwerty', 'romanize', 'is_unsafe', 'split_sentences', 'correct_spacing', 'summarize_sentences']

3. Checking the usage of each module

If you want to check the usage of each module, you can use the help() function.

from kss import Kss

module = Kss("split_sentences")
module.help()
Split texts into sentences.

Args:
    text (Union[str, List[str], Tuple[str]]): single text or list/tuple of texts
    backend (str): morpheme analyzer backend. 'mecab', 'pecab', 'punct' are supported
    num_workers (Union[int, str]): the number of multiprocessing workers
    strip (bool): strip all sentences or not
    return_morphemes (bool): whether to return morphemes or not
    ignores (List[str]): list of strings to ignore

Returns:
    Union[List[str], List[List[str]]]: outputs of sentence splitting

Examples:
    >>> from kss import Kss
    >>> split_sentences = Kss("split_sentences")
    >>> text = "회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요 다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다 강남역 맛집 토끼정의 외부 모습."
    >>> split_sentences(text)
    ['회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요', '다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다', '강남역 맛집 토끼정의 외부 모습.']

4. Multiprocessing

If you input a list of strings, Kss will automatically use multiprocessing to process the strings in parallel.
And you can set the number of processes to use by setting the num_workers parameter.
If you input num_workers<2, Kss will not use multiprocessing.

from kss import Kss

module = Kss("MODULE_NAME")

# using all cores
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], **kwargs)
# using 4 cores
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], num_workers=4, **kwargs)
# using 1 core (no multiprocessing)
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], num_workers=1, **kwargs)

5. Backward Compatibility

The old version of Kss used functional usage. KSS also supports this for backward compatibility.

from kss import split_sentences

output = split_sentences("YOUR_INPUT_STRING", **kwargs)

Supported Modules

See here for more details.

v5.2.0

02 Apr 00:34
Compare
Choose a tag to compare
  • Add is_compliable() function to check Cython implementation is available for the user environment.
def is_compilable():
    try:
        # 1. Try to compile csrc/sentence_splitter.cpp
        extra_compile_args, extra_link_args = get_extra_compile_args()
        compiler = new_compiler()
        customize_compiler(compiler)
        compiler.compile(['csrc/sentence_splitter.cpp'], extra_postargs=extra_compile_args)
        return True
    except:
        # 2. Cannot compile csrc/sentence_splitter.cpp
        return False

v5.1.0

31 Mar 21:23
Compare
Choose a tag to compare

The fast backend

If you want to split sentences quickly, you can use the split_sentences function with the backend='fast' option from Kss 5.0.0. This method is based on the fast algorithm utilized in Kss versions prior to 3.0. It offers significantly faster processing compared to the mecab backend, but less accurate. Therefore, This feature could be useful when you need to split sentences very quickly but don't need high accuracy. Furthermore, the fast backend has been implemented in both Python and Cython.

  • If your environment supports the installation of Cython, Kss will use the Cython implementation, which boasts the fastest performance (x600 faster than mecab).
  • Otherwise, it will use the Python implementation, which is slower than the Cython version but faster than the mecab backend (x4 faster than mecab).

Given the substantial speed advantage of the Cython implementation, it is strongly recommended over the Python alternative. Kss automatically detects the availability of Cython in your environment and will install it if feasible, so you don't need to worry about Cython and C++ dependencies.

Accuracy (Normalized F1)

Backend blogs_ko blogs_lee nested sample tweets v_ending wikipedia
mecab 0.8860 0.8887 0.9206 0.9682 0.8137 0.4815 1.0000
fast (Python) 0.6281 0.7899 0.6899 0.7482 0.5315 0.1596 0.7358
fast (Cython) 0.6545 0.8132 0.6372 0.8407 0.5892 0.1596 0.9566

Speed (msec)

Backend blogs_ko blogs_lee nested sample tweets v_ending wikipedia
mecab 538.10 293.31 225.05 56.35 184.91 20.55 899.99
fast (Python) 146.75 70.94 52.84 12.11 37.80 4.69 255.90
fast (Cython) 0.91 0.55 0.46 0.09 0.40 0.05 1.12

Please note that while the core algorithm in the fast backend mirrors that of Kss C++ 1.3.1, several bugs identified in the original implementation have been rectified in Kss 5.0.0.

v4.5.4

14 Jul 06:18
Compare
Choose a tag to compare
  • Fix multiprocessing by #67

v4.5.3

17 May 03:09
4b79728
Compare
Choose a tag to compare
  • Add return_pos parameter to split_morphemes function. (#64)

v4.5.2

16 May 18:11
Compare
Choose a tag to compare

Fix a bug reported from #60.

v4.5.1

25 Jan 12:57
Compare
Choose a tag to compare
  • Hotfix of some bugs in 4.5.0