Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding USAS in TEI #827

Open
TomazErjavec opened this issue Nov 12, 2023 · 1 comment
Open

Encoding USAS in TEI #827

TomazErjavec opened this issue Nov 12, 2023 · 1 comment
Assignees
Labels
bug Something isn't working enhancement New feature or request
Milestone

Comments

@TomazErjavec
Copy link
Collaborator

This issue discusses the non-resloved problems from #202 and #204. The current encoding of USAS in TEI is given in the guidelines, which is arguably ok, even though other possibilites exist (in particular stand-off markup where there are no problems with crossing XML tags but resolving them then gets complicated). Also, it is not yet clear whether retaining per-word USAS tags is sensible in the context of MWEs. These dilemas should be solved here.

The conversion of CoNLL-U with USAS tags into TEI is done by the conllu2tei.pl script. This script is badly written (it first just inserts <name> and <phr> into a temporary TEI and then afterwards tries to resolve conflicts, but does so in a bad way, i.e. it removes <phr> elements even in cases where it shouldn't, in particular phr/name, (arguably) name/phr, and and when a phr is adjecent to name, which is a definite bug. Again, how to make the script better should be discussed here.

@TomazErjavec TomazErjavec added bug Something isn't working enhancement New feature or request labels Nov 12, 2023
@TomazErjavec TomazErjavec added this to the Future milestone Nov 12, 2023
This was referenced Nov 12, 2023
@TomazErjavec
Copy link
Collaborator Author

Here is the alternative proposal on how to encode USAS in TEI:

<seg xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1" xml:lang="en" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1">
   <s xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.1" n="1" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1.1">
    <w xml:id="tok01" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="Mr.">Mr.</w>
    <w xml:id="tok02" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="President" join="right">President</w>
    <pc xml:id="tok03" pos="Z" msd="UPosTag=PUNCT" join="right">-</pc>
<!-- ... -->
    <spanGrp type="sem">
      <span target="#tok01 #tok02" type="Z1mf,Z3c" ana="sem:Z1"/>
      <span target="#tok03" type="Z9" ana="sem:Z9"/>
    </spanGrp>
  </s>
</seg>

My objections would be that:

  • it introduces a completely new construct for linguistic analysis, so far not used in ParlaMint or Parla-CLARIN
  • it just postones the problem of conflict with <name> to whatever application or conversion that will try to use both

So, I would advocate sticking to the current encoding but fix the converson script to do a better job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants