Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-English tokenizers #464

Open
dzcpy opened this issue Jun 7, 2021 · 2 comments
Open

Non-English tokenizers #464

dzcpy opened this issue Jun 7, 2021 · 2 comments
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested

Comments

@dzcpy
Copy link

dzcpy commented Jun 7, 2021

Describe the solution you'd like
For CJK languages, like for example Chinese, words are not separated by spaces. So there usually has a need to use a tokenizer to split sentences into word stems. Like this one: https://github.com/yanyiwu/cppjieba
Is it currently doable in Pisa? If not, is there any plan to add this feature in the future?

Additional context

@dzcpy dzcpy added the enhancement New feature or request label Jun 7, 2021
@amallia
Copy link
Member

amallia commented Jun 7, 2021

Yes, it is doable. If you want to see this implemented you can send a PR and we will review it.
Thanks

@elshize elshize added help wanted Extra attention is needed question Further information is requested labels Feb 27, 2022
@elshize
Copy link
Member

elshize commented Feb 27, 2022

Unfortunately, none of us regular contributors have much knowledge of these
languages, so we'll need someone with more knowledge step up to be able to
properly implement and test it.

If someone would want to help out with that, we can definitely provide some
help related to how parsing and tokenizing works within PISA.

@elshize elshize changed the title Does it support custom tokenizers? Non-English tokenizers Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants