"two-career" is being shown as "two career" #183

arademaker · 2019-12-17T15:05:27Z

Losing the original text? Is it the right thing to do?

((kind "wf")
   (form . "two")
   (lemmas "two")
   (tag . "ignore")
   (meta
    (sep . "-")
    (type . "num")))

we do have the sep for produze the original text. Question is:

is it easier to have the text tokenized in the buffer?
should we not distinguish between spaces and other separators?

Remember that default sep is space, so when a token doesn't have sep it is assumed sep=" ". See confusing explanation in https://github.com/own-pt/glosstag/blob/princeton/dtd/glosstag.dtd#L158-L161 for the glosstag corpus !!

The text was updated successfully, but these errors were encountered:

odanoburu · 2020-01-02T20:10:56Z

Losing the original text?

not really losing, but not showing it properly indeed.

the proper tokenization (using sep as separator when available and a space as default) could be implemented, but then I'm not sure if any other corpus will have sep attributes to make it worthwhile… how is tokenization described by other tokenizers? could touch.py produce something akin to sep? would it be useful to do so?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"two-career" is being shown as "two career" #183

"two-career" is being shown as "two career" #183

arademaker commented Dec 17, 2019 •

edited by odanoburu

Loading

odanoburu commented Jan 2, 2020

"two-career" is being shown as "two career" #183

"two-career" is being shown as "two career" #183

Comments

arademaker commented Dec 17, 2019 • edited by odanoburu Loading

odanoburu commented Jan 2, 2020

arademaker commented Dec 17, 2019 •

edited by odanoburu

Loading