Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rroot relation in model predictions #30

Open
LuDuerlich opened this issue Sep 3, 2021 · 5 comments
Open

rroot relation in model predictions #30

LuDuerlich opened this issue Sep 3, 2021 · 5 comments
Assignees

Comments

@LuDuerlich
Copy link

I have been training parsers for multiple languages and observed small number of instances, where the parser predicts rroot instead of root on the dev set.

At first I thought, this could be due to typos in the training data, but I could not find any instances in any of the UD treebanks (version 2.8). Instead, I found that rroot is introduced as part of a dummy root node in read_conll in utils.py.
I suppose this is not really a typo in the code, but a dummy value that is meant to be overwritten by the parser and in most cases is.

The options I set were
--dynet-mem 6000 --epochs 50 --k=2 --pos-emb-size 0 --char-emb-size 100 --disable-rlmost

and I observed it in some the dev predictions starting at epoch 22 for Basque-BDT (random seed of 2) and in some of the predictions starting at the first epoch for Hindi-HDTB (random seed of 5).

@mdelhoneux
Copy link
Member

Mmmh this is strange, rroot is used indeed as a dummy dependency relation for the dummy root token, it should never be used for any other token and should never be printed. This is quite hard to debug if it's that infrequent :/ It probably won't help but can you show me a sample of conllu output where this happens?

@LuDuerlich
Copy link
Author

Here is some output for Basque:

# sent_id = dev-s1144
# text = Kroaziarraren kasuan, normaltzat jo behar da hori, orain artean oso gutxi jokatu baitu.
1       Kroaziarraren   kroaziar        NOUN    _       Case=Gen|Definite=Def|Number=Sing       2       n
mod    _       _
2       kasuan  kasu    NOUN    _       Animacy=Inan|Case=Ine|Definite=Def|Number=Sing  0       obl     _
       SpaceAfter=No
3       ,       ,       PUNCT   _       _       2       punct   _       _
4       normaltzat      normal  ADJ     _       Case=Ess|Definite=Ind   5       obl     _       _
5       jo      jo      VERB    _       VerbForm=Part   3       xcomp   _       _
6       behar   behar   NOUN    _       Case=Abs|Definite=Ind   7       compound        _       _
7       da      izan    VERB    _       Aspect=Prog|Mood=Ind|Number[abs]=Sing|Person[abs]=3     14      rroot   _       _
8       hori    hori    DET     _       Case=Abs|Definite=Def|Number=Sing       14      nsubj   _       SpaceAfter=No
9       ,       ,       PUNCT   _       _       7       punct   _       _
10      orain   orain   ADV     _       Case=Ine        14      advmod  _       _
11      artean  arte    ADP     _       Case=Ine        10      case    _       _
12      oso     oso     ADV     _       _       13      advmod  _       _
13      gutxi   gutxi   ADV     _       _       14      obl     _       _
14      jokatu  jokatu  VERB    _       Aspect=Perf|VerbForm=Part       5       advcl   _       _
15      baitu   *edun   AUX     _       Mood=Ind|Number[abs]=Sing|Number[erg]=Sing|Person[abs]=3|Person[erg]=3  14      aux     _       SpaceAfter=No
16      .       .       PUNCT   _       _       7       punct   _       _

# sent_id = dev-s1366
# text = "Araudia ikusita, jendea orain baino lehenago irten beharko da etxetik anbientea sortzeko...".
1       "       "       PUNCT   _       _       0       punct   _       SpaceAfter=No
2       Araudia araudi  NOUN    _       Animacy=Inan|Case=Abs|Definite=Def|Number=Sing  3       obj     _
       _
3       ikusita ikusi   VERB    _       VerbForm=Part   1       advcl   _       SpaceAfter=No
4       ,       ,       PUNCT   _       _       3       punct   _       _
5       jendea  jende   NOUN    _       Case=Abs|Definite=Def|Number=Sing       9       nsubj   _       _
6       orain   orain   ADV     _       _       7       advmod  _       _
7       baino   baino   X       _       _       9       advmod  _       _
8       lehenago        lehenago        ADV     _       _       9       advmod  _       _
9       irten   irten   VERB    _       VerbForm=Part   4       xcomp   _       _
10      beharko behar_izan      VERB    _       _       9       rroot   _       _
11      da      izan    AUX     _       Mood=Ind|Number[abs]=Sing|Person[abs]=3 10      aux     _       _
12      etxetik etxe    NOUN    _       Animacy=Inan|Case=Abl|Definite=Def|Number=Sing  14      obl     _       _
13      anbientea       anbiente        NOUN    _       Case=Abs|Definite=Def|Number=Sing       14      obj     _       _
14      sortzeko        sortu   VERB    _       Case=Abs|Definite=Ind   10      advcl   _       SpaceAfter=No
15      ...     ...     PUNCT   _       _       10      punct   _       SpaceAfter=No
16      "       "       PUNCT   _       _       10      punct   _       SpaceAfter=No
17      .       .       PUNCT   _       _       10      punct   _       _

From what I could tell there are only about 4 sentences in the Basque dev set across all training epochs where rroot has been predicted, but per epoch, it gets predicted at most twice, so there is some variation.

And Hindi:

# sent_id = dev-s139
# text = लोकसभा में पेश की गई अपनी रिपोर्ट में कमेटी का कहना है कि रेलवे को केंद्रीय मदद अब ५० फीसदी से भी अधिक मिलने लग
ी है ।
1       लोकसभा  लोकसभा  NOUN    NN      Case=Acc|Gender=Fem|Number=Sing|Person=3        4       obl     _
       Vib=0_में|Tam=0|ChunkId=NP|ChunkType=head|Translit=lokasabhā
2       में       में       ADP     PSP     AdpType=Post    1       case    _       ChunkId=NP|ChunkType=chil
d|Translit=meṁ
3       पेश      पेश      ADJ     JJ      _       4       compound        _       ChunkId=JJP|ChunkType=hea
d|Translit=peśa
4       की      कर      VERB    VM      Aspect=Perf|Gender=Fem|Number=Sing|VerbForm=Part        7       a
cl     _       Vib=या_जा+या१|Tam=yA|ChunkId=VGNF|ChunkType=head|Translit=kī
5       गई      जा      AUX     VAUX    Aspect=Perf|Gender=Fem|Number=Sing|VerbForm=Part        4       a
ux:pass        _       Vib=या१|Tam=yA1|ChunkId=VGNF|ChunkType=child|Translit=gaī
6       अपनी    अपना    PRON    PRP     Case=Acc|Gender=Fem|PronType=Prs        7       nmod    _       V
ib=0|Tam=0|ChunkId=NP2|ChunkType=head|Translit=apanī
7       रिपोर्ट  रिपोर्ट  NOUN    NN      Case=Acc|Gender=Fem|Number=Sing|Person=3        0       obl     _
       Vib=0_में|Tam=0|ChunkId=NP3|ChunkType=head|Translit=riporṭa
8       में       में       ADP     PSP     AdpType=Post    7       case    _       ChunkId=NP3|ChunkType=chi
ld|Translit=meṁ
9       कमेटी    कमेटी    NOUN    NN      Case=Acc|Gender=Fem|Number=Sing|Person=3        11      nsubj   _
       Vib=0_का|Tam=0|ChunkId=NP4|ChunkType=head|Translit=kameṭī
10      का      का      ADP     PSP     AdpType=Post|Case=Nom|Gender=Masc|Number=Sing   9       case    _
       ChunkId=NP4|ChunkType=child|Translit=kā
11      कहना    कह      VERB    VM      Case=Nom|VerbForm=Inf   7       amod    _       Vib=ना|Tam=nA|Chu
nkId=VGNN|ChunkType=head|Translit=kahanā
12      है       है       VERB    VM      Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 24      rroot   _       Vib=है|Tam=hE|ChunkId=VGF|ChunkType=head|Stype=declarative|Translit=hai
13      कि      कि      SCONJ   CC      _       24      mark    _       AltTag=SCONJ-CONJ|ChunkId=CCP|ChunkType=head|Translit=ki
14      रेलवे     रेलवे     NOUN    NN      Case=Acc|Gender=Masc|Number=Sing|Person=3       24      nsubj   _       Vib=0_को|Tam=0|ChunkId=NP5|ChunkType=head|Translit=relave
15      को      को      ADP     PSP     AdpType=Post    14      case    _       ChunkId=NP5|ChunkType=child|Translit=ko
16      केंद्रीय   केंद्रीय   ADJ     JJ      Case=Nom        17      compound        _       ChunkId=NP6|ChunkType=child|Translit=keṁdrīya
17      मदद     मदद     NOUN    NN      Case=Nom|Gender=Fem|Number=Sing|Person=3        24      nsubj   _       Vib=0|Tam=0|ChunkId=NP6|ChunkType=head|Translit=madada
18      अब      अब      PRON    PRP     Case=Nom|PronType=Prs   24      obl     _       ChunkId=NP7|ChunkType=head|Translit=aba
19      ५०      ५०      NUM     QC      NumType=Card    20      nummod  _       ChunkId=NP8|ChunkType=child|Translit=50
20      फीसदी   फीसदी   NOUN    NN      Case=Acc|Gender=Fem|Number=Sing|Person=3        24      obl     _       Vib=0_से|Tam=0|ChunkId=NP8|ChunkType=head|Translit=phīsadī
21      से       से       ADP     PSP     AdpType=Post    20      case    _       ChunkId=NP8|ChunkType=child|Translit=se
22      भी      भी      PART    RP      _       20      dep     _       ChunkId=NP8|ChunkType=child|Translit=bhī
23      अधिक    अधिक    DET     QF      PronType=Ind    24      nsubj   _       AltTag=ADJ-DET|ChunkId=JJP2|ChunkType=head|Translit=adhika
24      मिलने    मिल     VERB    VM      Gender=Fem|Number=Sing|Person=3|VerbForm=Inf|Voice=Act  11      obj     _       Vib=ना_लग+या_है|Tam=nA|ChunkId=VGF2|ChunkType=head|Stype=declarative|Translit=milane
25      लगी     लग      AUX     VAUX    Aspect=Perf|Gender=Fem|Number=Sing|VerbForm=Part        24      aux     _       Vib=या|Tam=yA|ChunkId=VGF2|ChunkType=child|Translit=lagī
26      है       है       AUX     VAUX    Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   24      aux:pass        _       Vib=है|Tam=hE|ChunkId=VGF2|ChunkType=child|Translit=hai
27      ।       ।       PUNCT   SYM     _       12      punct   _       ChunkId=BLK|ChunkType=head|Translit=.



# sent_id = dev-s177
# text = लेकिन हम लोगों का मानना है कि राष्ट्रपति, प्रधानमंत्री और मुख्य न्यायाधीश को कम से कम इससे बाहर होना चाहिए ।
1       लेकिन    लेकिन    CCONJ   CC      _       0       cc      _       ChunkId=CCP|ChunkType=head|Transl
it=lekina
2       हम      हम      DET     DEM     Case=Nom|Number=Plur|Person=1|PronType=Dem      3       det     _
       ChunkId=NP|ChunkType=child|Translit=hama
3       लोगों    लोग     NOUN    NN      Case=Acc|Gender=Masc|Number=Plur|Person=3       5       nsubj   _
       Vib=0_का|Tam=0|ChunkId=NP|ChunkType=head|Translit=logoṁ
4       का      का      ADP     PSP     AdpType=Post|Case=Nom|Gender=Masc|Number=Sing   3       case    _
       ChunkId=NP|ChunkType=child|Translit=kā
5       मानना   मान     VERB    VM      Case=Nom|VerbForm=Inf   1       mark    _       Vib=ना|Tam=nA|Chu
nkId=VGNN|ChunkType=head|Translit=mānanā
6       है       है       VERB    VM      Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 20      rroot   _       Vib=है|Tam=hE|ChunkId=VGF|ChunkType=head|Stype=declarative|Translit=hai
7       कि      कि      SCONJ   CC      _       20      mark    _       AltTag=SCONJ-CONJ|ChunkId=CCP2|ChunkType=head|Translit=ki
8       राष्ट्रपति        राष्ट्रपति        PROPN   NNP     Case=Acc|Gender=Masc|Number=Sing|Person=3       20      nsubj   _       SpaceAfter=No|Vib=0|Tam=0|ChunkId=NP2|ChunkType=head|Translit=rāṣṭrapati
9       ,       ,       PUNCT   SYM     _       10      punct   _       ChunkId=NP2|ChunkType=child|Translit=,
10      प्रधानमंत्री       प्रधानमंत्री       PROPN   NNP     Case=Acc|Gender=Masc|Number=Sing|Person=3       8       conj    _       Vib=0|Tam=0|ChunkId=NP3|ChunkType=head|Translit=pradhānamaṁtrī
11      और      और      CCONJ   CC      _       13      cc      _       ChunkId=CCP3|ChunkType=head|Translit=aura
12      मुख्य     मुख्य     NOUN    NNC     Case=Nom|Gender=Masc|Number=Sing|Person=3       13      amod    _       Vib=0|Tam=0|ChunkId=NP4|ChunkType=child|Translit=mukhya
13      न्यायाधीश        न्यायाधीश        NOUN    NN      Case=Acc|Gender=Masc|Number=Sing|Person=3       8       conj    _       Vib=0_को|Tam=0|ChunkId=NP4|ChunkType=head|Translit=nyāyādhīśa
14      को      को      ADP     PSP     AdpType=Post    13      case    _       ChunkId=NP4|ChunkType=child|Translit=ko
15      कम      कम      DET     QF      PronType=Ind    18      det     _       ChunkId=NP5|ChunkType=child|Translit=kama
16      से       से       PART    RP      _       15      dep     _       ChunkId=NP5|ChunkType=child|Translit=se
17      कम      कम      DET     QF      PronType=Ind    18      det     _       AltTag=ADJ-DET|ChunkId=NP5|ChunkType=head|Translit=kama
18      इससे     यह      PRON    PRP     Case=Acc,Ins|Number=Sing|Person=3|PronType=Prs  20      obl     _       Vib=से|Tam=se|ChunkId=NP6|ChunkType=head|Translit=isase
19      बाहर    बाहर    ADV     NST     AdpType=Post|Case=Nom|Gender=Masc|Number=Sing|Person=3  18      case    _       AltTag=ADV-NOUN|ChunkId=NP7|ChunkType=head|Translit=bāhara
20      होना    हो      VERB    VM      Gender=Masc|VerbForm=Inf|Voice=Act      5       obj     _       Vib=ना_चाहिए|Tam=nA|ChunkId=VGF2|ChunkType=head|Stype=declarative|Translit=honā
21      चाहिए   चाहिए   AUX     VAUX    _       20      aux     _       Vib=0|Tam=0|ChunkId=VGF2|ChunkType=child|Translit=cāhie
22      ।       ।       PUNCT   SYM     _       6       punct   _       ChunkId=BLK|ChunkType=head|Translit=.

Here, there appear to be more instances. In some epochs, rroot gets predicted as much as 17 times.

@mdelhoneux mdelhoneux self-assigned this Sep 15, 2021
@mdelhoneux
Copy link
Member

Thanks! These two sentences are non-projective. My suspicion is that it might be due to the max_swap in Predict, in uuparser/arc_hybrid.py which should actually not be necessary, I used this in early debugging days but never went back to change it. Could you try setting max_swap to inf or len(sentence)*len(sentence)? In this line:

max_swap = 2*len(sentence)

@LuDuerlich
Copy link
Author

I tried both versions:

  • len(sentence)**2 does not change as much, the predictions for Basque are still the same and there are fewer occurrences of rroot in Hindi across all training epochs, but overall the same sentences still appear to be affected.
  • with inf, there is only one instance of rroot in a single epoch; the epochs where it gets predicted for Hindi are reduced from all 50 to only 9, but it still affects the same sentences.

@mdelhoneux
Copy link
Member

Ok, thanks! I still think it must have something to do with non-projectivity and the use of swap but I have no idea what specifically at this point. I will take a look but it probably won't be this week, sorry :/
Theoretically actually there should be no difference between len(sentence)**2 and inf. This is because any pair of two words can only be swapped once. So it probably has something to do with the conditions for swap lines 174 to 182. There might be an edge case we did not cover?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants