Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CJK on Full Text Search #21

Open
1l0 opened this issue Dec 11, 2019 · 4 comments
Open

Support CJK on Full Text Search #21

1l0 opened this issue Dec 11, 2019 · 4 comments
Assignees

Comments

@1l0
Copy link

1l0 commented Dec 11, 2019

CJK sentences are not separated by spaces. For now eliasdb can't handle an attempt which intended to search a specific word in some sentence in CJK. It would be great to be able to do that.

@krotik
Copy link
Owner

krotik commented Dec 11, 2019

Hey, I don't have any experience with CJK sentences. Do you have any suggestions on how eliasdb could support this? Maybe a config option for eliasdb.config.json which let's you define a list of "separator" characters?

@krotik krotik self-assigned this Dec 11, 2019
@beoran
Copy link

beoran commented Mar 9, 2020

If we look at the introduction of Ruby in Japanese here: https://www.ruby-lang.org/ja/, we see this:

オープンソースの動的なプログラミング言語で、 シンプルさと高い生産性を備えています。 エレガントな文法を持ち、自然に読み書きができます。

Spaces, nor anything else is used at all to separate the words, We only have the comma 、 and the end of sentence 。. In CJK languages the reader has to find the word boundaries based on grammar or dictionaries. So defining a list of separator characters will not solve this. Rather, EliasDB should be extended to make it possible to look for non-delimited sub strings, something which is generally useful.

@beoran
Copy link

beoran commented Mar 12, 2020

Another solution is to use a CJK text segregation library. I just found one for Go:

https://github.com/go-ego/gse

@gedw99
Copy link

gedw99 commented May 18, 2021

This requires stemming to do CJK

bleve has some of these
Gae also looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants