Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactorise parlamint-add-common-content and parlamint2final #675

Closed
TomazErjavec opened this issue Jun 1, 2023 · 9 comments
Closed

Refactorise parlamint-add-common-content and parlamint2final #675

TomazErjavec opened this issue Jun 1, 2023 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@TomazErjavec
Copy link
Collaborator

TomazErjavec commented Jun 1, 2023

We now have to somewhat similar scripts parlamint-add-common-content and parlamint2final, i.e. they both:

  • fix some known and fixable bugs
  • add common content for a release

It is not good practice anyway to have simiar code in two places, but parlamint2final also has the bug (cf. #662 (comment))) that it calculates tagUsages on the original data, but then changes various tags in the bodies of components, leading to incorrect final tagUsages.

I propose that we change this so that:

  • parlamin2final does all the fixes to the corpus
  • parlamint-add-common-content does exactly what it's name says: sets the hande, date, extents and tagUsage, maybe fixes the title
  • (this will also mean that I will have to change parlamint2distro so that it first runs 2final, stores the corpus in a tmp folder, and then runs add-common-content on that and outputs the final corpus)

@matyaskopp, do you agree with this change? I can start implementing it, if so.

@TomazErjavec TomazErjavec added the enhancement New feature or request label Jun 1, 2023
@TomazErjavec TomazErjavec added this to the ParlaMint 3.0 release milestone Jun 1, 2023
@TomazErjavec TomazErjavec self-assigned this Jun 1, 2023
@matyaskopp
Copy link
Collaborator

My suggestions:

  • parlamin2final should be renamed to something like parlamint-fix-known-bugs to avoid temptations for adding features that should be in a different script
  • as a part of the fixing process can be done factorization
    • in near future, we will not support embedded taxonomies,listPerson and listOrg - only xinclude will be supported
  • parlamint-add-common-content can be extended with

@TomazErjavec
Copy link
Collaborator Author

Nice ones; I'd call it parlamint4v3 :)
And I just found #157, will look at it tomorow.

@TomazErjavec
Copy link
Collaborator Author

Here is my proposal for what parlamint-add-common-content should set or correct (both in root and component files, and taking into account number and date formatting, where applicable):

  • release date (shoud be parameter, default is today)
  • version number (parameter)
  • handle (parameter)
  • top level ID to be the same as filename
  • ParlaMint stamp in main title
  • speech and word extents
  • tag usages
  • project description in English
  • maybe langUsage, as you suggest, need to look into this

Anything else @matyaskopp?
Any change of you implementing this, or shall I? I think parlamint2final does some of the above things better than add-common, so it might be worth comparint the templates.

@matyaskopp matyaskopp assigned matyaskopp and unassigned TomazErjavec Jun 2, 2023
@TomazErjavec
Copy link
Collaborator Author

Any change of you implementing this, or shall I?

As next week is pretty busy, and there is time pressure to finalise the distro scripts (it will take 3 weeks to reprocess all the corpora), I did this myself, with the final dev commit being cc7adde.

I think add-common-content works fine (except for some doubts abut BE commission sessions), cf.

<!-- This template deserves another think, it shuld be extened to any parmilamentary body of the meeting, not just house.
- WHAT ABOUT BE, WHO HAVE ALSO COMMITTEES AND USE #parla.meeting.committee INSTEAD OF #parla.committee
- DO ANY OTHERS HAVE COMMITTEES BUT ARE NOT MARKED IN MEETINGS?
- WHAT ABOUT UNICAMERAL, THEY SHOULD HAVE THIS INFO TOO IN MEETINGS?
-->

Also, parlamint2release might be ok too. The idea was also to leave the "fixing" of metadata (like removal GB "special" speakers) in add-common-content (even though it isn't really), but to move everything to do with annotated words into parlamint2release.xsl.

What doesn't work is factorisation, this might be my fault (and, to an extent probably is), but right now the parlamint2distro script has factorisation commented out:

#Doesn't work!
#$Saxon noAna=\"$factoriseFiles\" $teiRootTaxonomies outDir=$tmpOutDir -xsl:$scriptFactor $Root`;
#`cp $tmpOutDir/*.xml $Dir`;

The reason is, it produces a mess. Options:

  • @matyaskopp tries to fix it
  • I use the factorisation script already on the submitted corpora, and then run the distro script
  • I reimplement factorisation, as I can't seem to really understand the current code
  • I modify parlamint2release.xsl to defactorise the corpora, and then it will be easier, as the script only has to deal with non-factorised files (problems are esp. with already factorised files)

Thoughts?

@matyaskopp
Copy link
Collaborator

What doesn't work is factorisation, this might be my fault (and, to an extent probably is), but right now the parlamint2distro script has factorisation commented out:

It never worked in a partially factorized corpus, because it checks whether any of the types listOrg, listPerson, taxonomy is factorized:

if (-e $inListOrg) {$factorised = 1}
elsif (not $procFactor) {print STDERR "WARN: $inListOrg not found\n"}
if (-e $inListPerson) {$factorised = 1}
elsif (not $procFactor) {print STDERR "WARN: $inListPerson not found\n"}
if (@inTaxonomies) {$factorised = 1}
elsif (not $procFactor) {print STDERR "WARN: $inTaxonomies not found\n"}

If any of these types is factorized, then factorization is skipped:
if ($factorised) {print STDERR "INFO: $Dir already factorised\n"}
else {
print STDERR "INFO: Factorising $Root\n";
$tmpOutDir = "$tmpDir/factorise";
#Doesn't work!
#$Saxon noAna=\"$factoriseFiles\" $teiRootTaxonomies outDir=$tmpOutDir -xsl:$scriptFactor $Root`;
#`cp $tmpOutDir/*.xml $Dir`;
}

I believe the factorization script is working on partially factorized files, but it does not copy files in xi:include...
Should I add it there?

@TomazErjavec
Copy link
Collaborator Author

because it checks whether any of the types listOrg, listPerson, taxonomy is factorized
...
believe the factorization script is working on partially factorized files

So maybe the fix here would be to check if all the above are factorised?

but it does not copy files in xi:include... Should I add it there?

Yes please!

matyaskopp added a commit that referenced this issue Jun 4, 2023
@matyaskopp
Copy link
Collaborator

because it checks whether any of the types listOrg, listPerson, taxonomy is factorized
...
believe the factorization script is working on partially factorized files

So maybe the fix here would be to check if all the above are factorised?

I changed the distro script to factorize every time - even if it is already factorized. Not tested !!!

if (-e $inListOrg) {$factorised = 1}
elsif (not $procFactor) {print STDERR "WARN: $inListOrg not found\n"}
if (-e $inListPerson) {$factorised = 1}
elsif (not $procFactor) {print STDERR "WARN: $inListPerson not found\n"}
if (@inTaxonomies) {$factorised = 1}
elsif (not $procFactor) {print STDERR "WARN: $inTaxonomies not found\n"}
if ($procFactor or $procCommon) {
if ($factorised) {print STDERR "INFO: $Dir already (fully/partially) factorised\n"}
print STDERR "INFO: Factorising $Root\n";
$tmpOutDir = "$tmpDir/factorise";
`$Saxon noAna=\"$factoriseFiles\" $teiRootTaxonomies outDir=$tmpOutDir -xsl:$scriptFactor $Root`;
`cp $tmpOutDir/*.xml $Dir`;
if ($procCommon) {
foreach my $taxonomy (sort keys %taxonomy) {
#Eventually we will need an XSLT to extract from common taxonomies catDesc with relevant @xml:lang(s)!
`cp $taxonomy{$taxonomy} $Dir/$taxonomy.xml`
}
}
}

but it does not copy files in xi:include... Should I add it there?

Yes please!

Done and tested (copy firstly parse xml - so indentation can change in the file)

@TomazErjavec
Copy link
Collaborator Author

Alas, after another 3 hours trying parlamint2distro.pl still doesn't work. I don't really understand the factorisation stuff it seems, and the whole distro with add-common, parlamint2release and factorisation has gotten so complicated that my head hurts.

The distro script now dies with

INFO: using (TEI+TEI.ana)-shared taxonomies from /home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Distro/Test/In/ParlaMint-LV.TEI
INFO: Factorising /home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Distro/Test/Out/ParlaMint-LV.TEI.ana/ParlaMint-LV.ana.xml
INFO: Starting to process ParlaMint-LV.ana
INFO: processing root
INFO: Copying ParlaMint-taxonomy-parla.legislature.xml to /home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Scripts/tmp/eZRJEbv_Cx/factorise/Pa\
rlaMint-taxonomy-parla.legislature.xml
Error at char 14 in expression in xsl:apply-templates/@select on line 127 column 70 of parlamint-factorize-teiHeader.xsl:
  FODC0002  I/O error reported by XML parser processing
  file:/home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Distro/Test/Out/ParlaMint-LV.TEI.ana/ParlaMint-taxonomy-parla.legislature.xml: /home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Distro/Test/Out/ParlaMint-LV.TEI.ana/ParlaMint-taxonomy-parla.legislature.xml (No such file or directory). Caused by java.io.FileNotFoundException: /home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Distro/Test/Out/ParlaMint-LV.TEI.ana/ParlaMint-taxonomy-parla.legislature.xml (No such file or directory)

I would be of coruse grateful for any help, can be directly on tantra of course.

@TomazErjavec
Copy link
Collaborator Author

This has all been now fixed, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants