Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize ParlaMint script #157

Closed
6 tasks done
matyaskopp opened this issue Feb 16, 2022 · 10 comments
Closed
6 tasks done

Finalize ParlaMint script #157

matyaskopp opened this issue Feb 16, 2022 · 10 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@matyaskopp
Copy link
Collaborator

matyaskopp commented Feb 16, 2022

This issue collects ideas on what should (and probably shouldn't) do the finalization script.

  • count tagUsage numbers
  • count cumulative numbers for extent/measure. (Numbers of speeches in component files should be provided by partners)
  • set release date
  • set version
  • set handle
  • add correct subcorpus flag reference/covid
@matyaskopp matyaskopp added enhancement New feature or request help wanted Extra attention is needed labels Feb 16, 2022
@matyaskopp
Copy link
Collaborator Author

matyaskopp commented Feb 25, 2022

The finalization script can be split into two tasks:

  1. To pass the validation (counting numbers, croak when affiliations or relations(coalition and opposition) overlap).
    • parlamint-add-common-content.xsl
  2. To make it ready for release (version, release date, handle), there wouldn't be any data propagation from component files to the root.
    • parlamint2release.xsl

Do not use parlamint2final.xsl script - it should be replaced with the two scripts above

@matyaskopp
Copy link
Collaborator Author

The script currently contains country-specific(/ParlaMint I corpus) modifications

  • fixing _ lemma value:
    <!-- Bug in STANZA, sometimes a word lemma is set to "_" -->
    <!-- We set lemma to @norm, if it exists, else to text() of the word -->
    <xsl:template mode="comp" match="tei:w/@lemma[. = '_']">
    <xsl:attribute name="lemma">
    <xsl:choose>
    <xsl:when test="../@norm">
    <xsl:message select="concat('WARN ', /tei:TEI/@xml:id,
    ': changing _ lemma to @norm ', ../@norm, ' in ', ../@xml:id)"/>
    <xsl:value-of select="../@norm"/>
    </xsl:when>
    <xsl:otherwise>
    <xsl:message select="concat('WARN ', /tei:TEI/@xml:id,
    ': changing _ lemma to token ', ../text(), ' in ', ../@xml:id)"/>
    <xsl:value-of select="../text()"/>
    </xsl:otherwise>
    </xsl:choose>
    </xsl:attribute>
    </xsl:template>
  • adding textCass if missing
    <textClass>
    <catRef scheme="{$house-refs/tei:ref[. = 'Legislature']/@target}">
    <xsl:attribute name="target">
    <xsl:variable name="targets">
    <xsl:for-each select="$house-refs/tei:ref">
    <xsl:if test=". != 'Legislature'">
    <xsl:value-of select="@target"/>
    <xsl:text>&#32;</xsl:text>
    </xsl:if>
    </xsl:for-each>
    </xsl:variable>
    <xsl:value-of select="normalize-space($targets)"/>
    </xsl:attribute>
    </catRef>
    </textClass>
  • adding parla.lower or parla.upper if missing and is Bicameralism
    <xsl:when test="normalize-space($house)">
    <xsl:variable name="refs">
    <xsl:for-each select="$house/tei:ref">
    <xsl:value-of select="@target"/>
    <xsl:text>&#32;</xsl:text>
    </xsl:for-each>
    </xsl:variable>
    <xsl:message select="concat('INFO ', /tei:TEI/@xml:id,
    ': inserting ', $refs, 'into meeting/@ana')"/>
    <xsl:attribute name="ana" select="concat($refs, @ana)"/>
    </xsl:when>
    <xsl:otherwise>

I think that all these changes can be moved to fixings/v2tov3 and validate-parlamint.xsl validation should be extended to cover these known issues. The new partners should add this content themselves.
@TomazErjavec, do you agree?

@TomazErjavec
Copy link
Collaborator

Now made new script parlamint2release in 084d3ec.
As suggested it only performes fixes for a release, but does not duplicate add-common content tasks.
In particular:

Changes to root file:
- sort XIncluded component files
- give correct type and subtype to idno
- delete old and now redundant pubPlace
- insert textClass if missing
- fix sprurious spaces in text content (multiple, leading and trailing spaces)
Changes to component files:
- set references to subcorpora ('reference' 'COVID', 'War')
- add reference to parliamentary body of the meeting, if missing
- change div/@type for divs without utterances
- remove empty notes
- assign IDs to segments without them
- in .ana remove body name tag if name contains no words
- in .ana change tag from <w> to <pc> for punctuation
- in .ana change UPoS tag from - to X
- in .ana change lemma tag from _ to normalised form or wordform
- in .ana change root syntactic dependency to dep, if node is not sentence root
- in .ana change <PAD> syntactic dependency to dep
- fix sprurious spaces in text content (multiple, leading and trailing spaces)

@matyaskopp
Copy link
Collaborator Author

@TomazErjavec I am now checking old issues, and I discovered a suggestion about deleting and reintroducing brackets in notes: #195

  • the final suggestion was to add it to parlamint2final, which is now parlamint2release

Would you like me to do it?

@TomazErjavec
Copy link
Collaborator

We have to be a bit careful now, as the notes have already been transtated to English, and I match them to originals based of the form of the original note. But if the transformation is deterministic and commented, I guess I can apply the same transformation in the matching process, so, yes, pls. do it. I guess parlamint2release is the right script for this.

@matyaskopp
Copy link
Collaborator Author

We have to be a bit careful now, as the notes have already been transtated to English, and I match them to originals based of the form of the original note. But if the transformation is deterministic and commented, I guess I can apply the same transformation in the matching process, so, yes, pls. do it. I guess parlamint2release is the right script for this.

Understand. I will implement function mk:normalize-note() in parlamint-lib.xsl so that you can use this normalization instead of normalize-space().

Giving it a second thought - do we want to normalize all notes with remove and reintroduce parentheses, as suggested here: #195 (comment)

At the time of discussing this, we did not have an experience with this kind of long-sequence of notes:

<div type="commentSection">
<head xml:id="i-83YmkLiKduSU4EqjazDjf8">§ 1 Justering av protokoll</head>
<note xml:id="i-TPBwzzivUQYT7QYA1pckNH">Protokollet för den 21 november justerades.</note>
</div>
<div type="commentSection">
<head xml:id="i-9XFC5jmFcsZ5xTSkupreer">§ 2 Ärenden för hänvisning till utskott</head>
<note xml:id="i-PQk44B9Cd5Z1nPHf7ssZV9">Följande dokument hänvisades till utskott:</note>
<note xml:id="i-5b8D4vm9Zd75VyfmeM9GDq">EU-dokument</note>
<note xml:id="i-6c7V56bHPt4iGcuH3XYbb2">KOM(2017) 825 och KOM(2017) 827 till finansutskottet</note>
<note xml:id="i-4seoNNS6y1j51wyDV7t2sP">Åttaveckorsfristen för att avge ett motiverat yttrande skulle gå ut den 2 februari 2018 .</note>
</div>
<div type="commentSection">
<head xml:id="i-ArE5Y1t6cW7ASWxxu8GyUG">§ 3 Ärenden för bordläggning</head>
<note xml:id="i-5vmuQMV1AfWRZCtQs24R6a">Följande dokument anmäldes och bordlades:</note>
<note xml:id="i-FAKVxsBqM1L11FjyidgWJs">Arbetsmarknadsutskottets betänkanden</note>
<note xml:id="i-x7GjSXWKADuGAxpjrpfd9">2017/18:AU2 Utgiftsområde 14 Arbetsmarknad och arbetsliv</note>
<note xml:id="i-wNDasVVBqbT9N6vds6rM1">2017/18:AU4 Arbetsmarknadspolitik och arbetslöshetsförsäkringen</note>
<note xml:id="i-NToNxZYwo4fhiyEr35Y65g">Finansutskottets betänkanden</note>
<note xml:id="i-MBjAN5U8ScuJBLu1jVvQJH">2017/18:FiU3 Utgiftsområde 25 Allmänna bidrag till kommuner</note>
<note xml:id="i-C1bBRxdhvh4FcVzmkrwioa">2017/18:FiU19 Ytterligare verktyg för makrotillsyn</note>
<note xml:id="i-Hvz7sRzbjzDJf64Y1WejRw">Socialutskottets betänkande</note>
<note xml:id="i-4aVmrJGwVfg5DDunQLhuLS">2017/18:SoU1 Utgiftsområde 9 Hälsovård, sjukvård och social omsorg</note>
<note xml:id="i-2mgmCrQKWeHKKgGtsmU6CA">Utbildningsutskottets betänkande</note>
<note xml:id="i-W7p3jmCXXNiX4oRWoVNjor">2017/18:UbU2 Utgiftsområde 15 Studiestöd</note>
<note xml:id="i-HrSGiZFkQu2sfSNfURcrS4">Civilutskottets betänkande</note>
<note xml:id="i-M6KGu4JWkWYkpJBKeM68Zp">2017/18:CU7 Associationsrätt</note>
</div>
<div type="debateSection">
<head xml:id="i-JuDDRgvRVUqWtKqLsJXiGv">§ 4 Ekonomisk trygghet vid ålderdom</head>
<note xml:id="i-H2goS2xHLoErpPQuZdu37">Socialförsäkringsutskottets betänkande 2017/18:SfU2</note>
<note xml:id="i-MEawvhS9akdehveaSN3bhK">Utgiftsområde 11 Ekonomisk trygghet vid ålderdom (prop. 2017/18:1 delvis)</note>
<note xml:id="i-EK3jW47gNQaSSUvMMNjJK3">föredrogs.</note>

I suggest normalizing spaces based on context:

  • if ancestor::tei:u then (text of note) or text of note become [[text of note]]
  • otherwise?? no change or remove parentheses and do not introduce new ones

@TomazErjavec
Copy link
Collaborator

Hm, good points. Thinking about this further, maybe we should:

  • leave everything as it is, not to complicate an alraeady complicated process
  • just remove any brackets

Namely, I am a bit frightned of having complicated and context dependent rules for the transformation, and we are bound to overlook something in some of the corpora, i.e. make a mess.

@matyaskopp
Copy link
Collaborator Author

ok, then removing pairing boundary brackets is the best way. I will implement that.
It was your very first suggestion (I should probably not challenge your opinion/wisdom so often :-) ) and TEI recommended solution.

@TomazErjavec
Copy link
Collaborator

ok, then removing pairing boundary brackets is the best way. I will implement that.

OK, great, thanks.

It was your very first suggestion (I should probably not challenge your opinion/wisdom so often :-) ) and TEI recommended solution.

Far from it that my suggestions are always right but nice of you to say so :)

@TomazErjavec
Copy link
Collaborator

I think we are done here (37d6946) so, closing.
If new issues crop up, we can open another issue :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants