Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"two-career" is being shown as "two career" #183

Open
arademaker opened this issue Dec 17, 2019 · 1 comment
Open

"two-career" is being shown as "two career" #183

arademaker opened this issue Dec 17, 2019 · 1 comment

Comments

@arademaker
Copy link
Member

arademaker commented Dec 17, 2019

Losing the original text? Is it the right thing to do?

((kind "wf")
   (form . "two")
   (lemmas "two")
   (tag . "ignore")
   (meta
    (sep . "-")
    (type . "num")))

we do have the sep for produze the original text. Question is:

  1. is it easier to have the text tokenized in the buffer?
  2. should we not distinguish between spaces and other separators?

Remember that default sep is space, so when a token doesn't have sep it is assumed sep=" ". See confusing explanation in https://github.com/own-pt/glosstag/blob/princeton/dtd/glosstag.dtd#L158-L161 for the glosstag corpus !!

@odanoburu
Copy link
Contributor

Losing the original text?

not really losing, but not showing it properly indeed.

the proper tokenization (using sep as separator when available and a space as default) could be implemented, but then I'm not sure if any other corpus will have sep attributes to make it worthwhile… how is tokenization described by other tokenizers? could touch.py produce something akin to sep? would it be useful to do so?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants