Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lemma of compound words contains only the headword #3

Open
pekoli opened this issue Jun 8, 2020 · 3 comments
Open

lemma of compound words contains only the headword #3

pekoli opened this issue Jun 8, 2020 · 3 comments

Comments

@pekoli
Copy link

pekoli commented Jun 8, 2020

I've noticed that for most compound words only the headword is stored in the lemma. This mainly concerns nouns as in the following examples:

# sent_id = hdt-s10009
7       Leitungsinfrastruktur   Infrastruktur   NOUN    NN      Gender=Fem|Number=Sing|Person=3 2       obj     _       _

# sent_id = hdt-s10011
6       Stellenstreichungen     Streichung      NOUN    NN      Gender=Fem|Number=Plur|Person=3 4       conj    _       _

# sent_id = hdt-s10015
17      Vorstandvorsitzender    Vorsitzender    NOUN    NN      Case=Nom|Gender=Masc|Number=Sing|Person=3       16      nsubj   _       

but also adjectives:

# sent_id = hdt-s10005
2       US-amerikanische        amerikanisch    ADJ     ADJA  Degree=Pos|Gender=Neut|Number=Sing      3       amod    _       _

However, there are examples where the whole compound is given in the lemma:

# sent_id = hdt-s10012
14      Geschäftsjahres Geschäftsjahr   NOUN    NN      Case=Gen|Gender=Neut|Number=Sing|Person=3       11      nmod:poss       _       _

Is it an artifact of converting the original treebank to UD format?

@akoehn
Copy link
Member

akoehn commented Jun 8, 2020

Yes, the lemma column is a copy from the "base" annotation in the original HDT annotation. I thought we doxumented this somewhere, but I don't remember where.

@pekoli
Copy link
Author

pekoli commented Jun 8, 2020

Thanks for the quick reply!
The papers linked in the README don't mention it explicitly if I haven't missed it.

I think it would be possible to restore the complete lemma from the word form and the headword using a script. Would you consider merging if I did a PR on this? Or should I just create a fork?
(Background is training neural lemmatizers - currently, they're forced to learn compound splitting in addition to lemmatisation which doesn't make it easier...)

This was referenced Sep 10, 2020
@akoehn
Copy link
Member

akoehn commented Sep 16, 2020

Sorry, I forgot this issue.

I think that the idea of creating the Lemma from the word and the base annotation can be sensible and I will have a closer look at the effect of the script in #6. If the script works well enough, we could also use it in the publication pipeline. I made a TODO to look into it next week.

In any case, can you add a proper header to the scripts including the license (i.e. Apache 2.0 or GPLv3 (or later)) and a copyright notice with yourself as the author?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants