Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tagUsage calculation in AT corpus #662

Closed
matyaskopp opened this issue May 16, 2023 · 3 comments
Closed

tagUsage calculation in AT corpus #662

matyaskopp opened this issue May 16, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@matyaskopp
Copy link
Collaborator

AT corpus has wrong numbers in tagUsage in /project/corpora/Parla/ParlaMint/ParlaMint-full/Data/Corpora folder:

  • ParlaMint-AT.TEI.ana/ParlaMint-AT.ana.xml
  • ParlaMint-AT.TEI/ParlaMint-AT.xml
  • Sample-ParlaMint-AT.TEI.ana/ParlaMint-AT.ana.xml
  • Sample-ParlaMint-AT.TEI.ana/ParlaMint-AT.ana.xml

All corpus files look like this:

<tagsDecl><!--These numbers do not reflect the size of the sample!-->
<namespace name="http://www.tei-c.org/ns/1.0">
<tagUsage gi="body" occurs="1197"/>
<tagUsage gi="desc" occurs="1197"/>
<tagUsage gi="div" occurs="1197"/>
<tagUsage gi="gap" occurs="1197"/>
<tagUsage gi="incident" occurs="1197"/>
<tagUsage gi="kinesic" occurs="1197"/>
<tagUsage gi="note" occurs="1197"/>
<tagUsage gi="pb" occurs="1197"/>
<tagUsage gi="seg" occurs="1197"/>
<tagUsage gi="text" occurs="1197"/>
<tagUsage gi="time" occurs="1197"/>
<tagUsage gi="u" occurs="1197"/>
<tagUsage gi="vocal" occurs="1197"/>
</namespace>
</tagsDecl>

And component files:

<tagsDecl><!--These numbers do not reflect the size of the sample!-->
<namespace name="http://www.tei-c.org/ns/1.0">
<tagUsage gi="text" occurs="1"/>
<tagUsage gi="body" occurs="1"/>
<tagUsage gi="div" occurs="1"/>
<tagUsage gi="note" occurs="1"/>
<tagUsage gi="pb" occurs="1"/>
<tagUsage gi="u" occurs="1"/>
<tagUsage gi="seg" occurs="1"/>
<tagUsage gi="kinesic" occurs="1"/>
<tagUsage gi="vocal" occurs="1"/>
<tagUsage gi="incident" occurs="1"/>
<tagUsage gi="gap" occurs="1"/>
<tagUsage gi="desc" occurs="1"/>
<tagUsage gi="time" occurs="1"/>
</namespace>
</tagsDecl>

I guess that the finalization script does not calculate these numbers and only AT set 1 into component files

@matyaskopp matyaskopp added the bug Something isn't working label May 16, 2023
@matyaskopp
Copy link
Collaborator Author

Now I see:

$scriptFinal = "$Bin/parlamint2final.xsl";

- inserts tagCounts in root (taken from component files and not changed there!)

parlamint2final is not calculating tagUsage


tagUsage calculation is implemented in


which is not used in the finalization

@TomazErjavec
Copy link
Collaborator

I thought everybody computes their tagUsages, but notied AT a couple of day ago myself.
I now inserted your calculation into finalize but it is a doomed effort, because I change the countable markup for ES-GA and now also IS (names without words, a but which floated to the top only in the MTed corpus), hm. I guess we should do my fixings first, and then just use add-common (although my version of add-common does things yours doesn't:).
Would you dare try it, or is that too much to hope for, I'm afraid of introducing even more confusion!
Or maybe we live with the fact that tagusages will be slightly off for 3.0, and hope to do better in 3.1?

@TomazErjavec
Copy link
Collaborator

Discussion on this continues in #675, closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants