Finalize ParlaMint script #157

matyaskopp · 2022-02-16T09:01:57Z

This issue collects ideas on what should (and probably shouldn't) do the finalization script.

count tagUsage numbers
count cumulative numbers for extent/measure. (Numbers of speeches in component files should be provided by partners)
set release date
set version
set handle
add correct subcorpus flag reference/covid

The text was updated successfully, but these errors were encountered:

matyaskopp · 2022-02-25T11:06:26Z

The finalization script can be split into two tasks:

To pass the validation (counting numbers, croak when affiliations or relations(coalition and opposition) overlap).
- parlamint-add-common-content.xsl
To make it ready for release (version, release date, handle), there wouldn't be any data propagation from component files to the root.
- parlamint2release.xsl

Do not use parlamint2final.xsl script - it should be replaced with the two scripts above

matyaskopp · 2022-06-23T08:11:21Z

The script currently contains country-specific(/ParlaMint I corpus) modifications

fixing _ lemma value:

ParlaMint/Scripts/parlamint-add-common-content.xsl

Lines 443 to 460 in da1bf73

    
             <!-- Bug in STANZA, sometimes a word lemma is set to "_" --> 
        
             <!-- We set lemma to @norm, if it exists, else to text() of the word --> 
        
             <xsl:template mode="comp" match="tei:w/@lemma[. = '_']"> 
        
               <xsl:attribute name="lemma"> 
        
                 <xsl:choose> 
        
                   <xsl:when test="../@norm"> 
        
                     <xsl:message select="concat('WARN ', /tei:TEI/@xml:id, 
        
                                          ': changing _ lemma to @norm ', ../@norm, ' in ', ../@xml:id)"/> 
        
                     <xsl:value-of select="../@norm"/> 
        
                   </xsl:when> 
        
                   <xsl:otherwise> 
        
                     <xsl:message select="concat('WARN ', /tei:TEI/@xml:id, 
        
                                          ': changing _ lemma to token ', ../text(), ' in ', ../@xml:id)"/> 
        
                     <xsl:value-of select="../text()"/> 
        
                   </xsl:otherwise> 
        
                 </xsl:choose> 
        
               </xsl:attribute> 
        
             </xsl:template>

adding textCass if missing

ParlaMint/Scripts/parlamint-add-common-content.xsl

Lines 577 to 591 in da1bf73

    
           <textClass> 
        
             <catRef scheme="{$house-refs/tei:ref[. = 'Legislature']/@target}"> 
        
               <xsl:attribute name="target"> 
        
                 <xsl:variable name="targets"> 
        
                   <xsl:for-each select="$house-refs/tei:ref"> 
        
                     <xsl:if test=". != 'Legislature'"> 
        
                       <xsl:value-of select="@target"/> 
        
                       <xsl:text>&#32;</xsl:text> 
        
                     </xsl:if> 
        
                   </xsl:for-each> 
        
                 </xsl:variable> 
        
                 <xsl:value-of select="normalize-space($targets)"/> 
        
               </xsl:attribute> 
        
             </catRef> 
        
           </textClass>

adding parla.lower or parla.upper if missing and is Bicameralism

ParlaMint/Scripts/parlamint-add-common-content.xsl

Lines 649 to 661 in da1bf73

    
                     <xsl:when test="normalize-space($house)"> 
        
                       <xsl:variable name="refs"> 
        
                         <xsl:for-each select="$house/tei:ref"> 
        
                           <xsl:value-of select="@target"/> 
        
                           <xsl:text>&#32;</xsl:text> 
        
                         </xsl:for-each> 
        
                       </xsl:variable> 
        
                       <xsl:message select="concat('INFO ', /tei:TEI/@xml:id, 
        
                                        ': inserting ', $refs, 'into meeting/@ana')"/> 
        
                       <xsl:attribute name="ana" select="concat($refs, @ana)"/> 
        
                     </xsl:when> 
        
                     <xsl:otherwise>

I think that all these changes can be moved to fixings/v2tov3 and validate-parlamint.xsl validation should be extended to cover these known issues. The new partners should add this content themselves.
@TomazErjavec, do you agree?

…or both root and component files (#157)

TomazErjavec · 2023-06-02T15:15:59Z

Now made new script parlamint2release in 084d3ec.
As suggested it only performes fixes for a release, but does not duplicate add-common content tasks.
In particular:

ParlaMint/Scripts/parlamint2release.xsl

Lines 10 to 29 in 084d3ec

    
                Changes to root file: 
        
                - sort XIncluded component files 
        
                - give correct type and subtype to idno 
        
                - delete old and now redundant pubPlace 
        
                - insert textClass if missing 
        
                - fix sprurious spaces in text content (multiple, leading and trailing spaces) 
        
                Changes to component files: 
        
                - set references to subcorpora ('reference' 'COVID', 'War') 
        
                - add reference to parliamentary body of the meeting, if missing 
        
                - change div/@type for divs without utterances 
        
                - remove empty notes 
        
                - assign IDs to segments without them 
        
                - in .ana remove body name tag if name contains no words 
        
                - in .ana change tag from <w> to <pc> for punctuation 
        
                - in .ana change UPoS tag from - to X 
        
                - in .ana change lemma tag from _ to normalised form or wordform 
        
                - in .ana change root syntactic dependency to dep, if node is not sentence root 
        
                - in .ana change <PAD> syntactic dependency to dep 
        
                - fix sprurious spaces in text content (multiple, leading and trailing spaces)

matyaskopp · 2023-06-07T06:03:26Z

@TomazErjavec I am now checking old issues, and I discovered a suggestion about deleting and reintroducing brackets in notes: #195

the final suggestion was to add it to parlamint2final, which is now parlamint2release

Would you like me to do it?

TomazErjavec · 2023-06-07T07:58:33Z

We have to be a bit careful now, as the notes have already been transtated to English, and I match them to originals based of the form of the original note. But if the transformation is deterministic and commented, I guess I can apply the same transformation in the matching process, so, yes, pls. do it. I guess parlamint2release is the right script for this.

matyaskopp · 2023-06-07T08:43:03Z

We have to be a bit careful now, as the notes have already been transtated to English, and I match them to originals based of the form of the original note. But if the transformation is deterministic and commented, I guess I can apply the same transformation in the matching process, so, yes, pls. do it. I guess parlamint2release is the right script for this.

Understand. I will implement function mk:normalize-note() in parlamint-lib.xsl so that you can use this normalization instead of normalize-space().

Giving it a second thought - do we want to normalize all notes with remove and reintroduce parentheses, as suggested here: #195 (comment)

At the time of discussing this, we did not have an experience with this kind of long-sequence of notes:

ParlaMint/Data/ParlaMint-SE/ParlaMint-SE_2017-12-12-prot-201718--48.xml

Lines 85 to 116 in 53c4c19

    
           <div type="commentSection"> 
        
              <head xml:id="i-83YmkLiKduSU4EqjazDjf8">§ 1 Justering av protokoll</head> 
        
              <note xml:id="i-TPBwzzivUQYT7QYA1pckNH">Protokollet för den 21 november justerades.</note> 
        
           </div> 
        
           <div type="commentSection"> 
        
              <head xml:id="i-9XFC5jmFcsZ5xTSkupreer">§ 2 Ärenden för hänvisning till utskott</head> 
        
              <note xml:id="i-PQk44B9Cd5Z1nPHf7ssZV9">Följande dokument hänvisades till utskott:</note> 
        
              <note xml:id="i-5b8D4vm9Zd75VyfmeM9GDq">EU-dokument</note> 
        
              <note xml:id="i-6c7V56bHPt4iGcuH3XYbb2">KOM(2017) 825 och KOM(2017) 827 till finansutskottet</note> 
        
              <note xml:id="i-4seoNNS6y1j51wyDV7t2sP">Åttaveckorsfristen för att avge ett motiverat yttrande skulle gå ut den 2 februari 2018 .</note> 
        
           </div> 
        
           <div type="commentSection"> 
        
              <head xml:id="i-ArE5Y1t6cW7ASWxxu8GyUG">§ 3 Ärenden för bordläggning</head> 
        
              <note xml:id="i-5vmuQMV1AfWRZCtQs24R6a">Följande dokument anmäldes och bordlades:</note> 
        
              <note xml:id="i-FAKVxsBqM1L11FjyidgWJs">Arbetsmarknadsutskottets betänkanden</note> 
        
              <note xml:id="i-x7GjSXWKADuGAxpjrpfd9">2017/18:AU2 Utgiftsområde 14 Arbetsmarknad och arbetsliv</note> 
        
              <note xml:id="i-wNDasVVBqbT9N6vds6rM1">2017/18:AU4 Arbetsmarknadspolitik och arbetslöshetsförsäkringen</note> 
        
              <note xml:id="i-NToNxZYwo4fhiyEr35Y65g">Finansutskottets betänkanden</note> 
        
              <note xml:id="i-MBjAN5U8ScuJBLu1jVvQJH">2017/18:FiU3 Utgiftsområde 25 Allmänna bidrag till kommuner</note> 
        
              <note xml:id="i-C1bBRxdhvh4FcVzmkrwioa">2017/18:FiU19 Ytterligare verktyg för makrotillsyn</note> 
        
              <note xml:id="i-Hvz7sRzbjzDJf64Y1WejRw">Socialutskottets betänkande</note> 
        
              <note xml:id="i-4aVmrJGwVfg5DDunQLhuLS">2017/18:SoU1 Utgiftsområde 9 Hälsovård, sjukvård och social omsorg</note> 
        
              <note xml:id="i-2mgmCrQKWeHKKgGtsmU6CA">Utbildningsutskottets betänkande</note> 
        
              <note xml:id="i-W7p3jmCXXNiX4oRWoVNjor">2017/18:UbU2 Utgiftsområde 15 Studiestöd</note> 
        
              <note xml:id="i-HrSGiZFkQu2sfSNfURcrS4">Civilutskottets betänkande</note> 
        
              <note xml:id="i-M6KGu4JWkWYkpJBKeM68Zp">2017/18:CU7 Associationsrätt</note> 
        
           </div> 
        
           <div type="debateSection"> 
        
              <head xml:id="i-JuDDRgvRVUqWtKqLsJXiGv">§ 4 Ekonomisk trygghet vid ålderdom</head> 
        
              <note xml:id="i-H2goS2xHLoErpPQuZdu37">Socialförsäkringsutskottets betänkande 2017/18:SfU2</note> 
        
              <note xml:id="i-MEawvhS9akdehveaSN3bhK">Utgiftsområde 11 Ekonomisk trygghet vid ålderdom (prop. 2017/18:1 delvis)</note> 
        
              <note xml:id="i-EK3jW47gNQaSSUvMMNjJK3">föredrogs.</note>

I suggest normalizing spaces based on context:

if ancestor::tei:u then (text of note) or text of note become [[text of note]]
otherwise?? no change or remove parentheses and do not introduce new ones

TomazErjavec · 2023-06-07T08:56:49Z

Hm, good points. Thinking about this further, maybe we should:

leave everything as it is, not to complicate an alraeady complicated process
just remove any brackets

Namely, I am a bit frightned of having complicated and context dependent rules for the transformation, and we are bound to overlook something in some of the corpora, i.e. make a mess.

matyaskopp · 2023-06-07T09:17:16Z

ok, then removing pairing boundary brackets is the best way. I will implement that.
It was your very first suggestion (I should probably not challenge your opinion/wisdom so often :-) ) and TEI recommended solution.

TomazErjavec · 2023-06-07T09:18:50Z

ok, then removing pairing boundary brackets is the best way. I will implement that.

OK, great, thanks.

It was your very first suggestion (I should probably not challenge your opinion/wisdom so often :-) ) and TEI recommended solution.

Far from it that my suggestions are always right but nice of you to say so :)

…ted elements (#157 #195)

… result) (#195 #157)

TomazErjavec · 2023-06-08T17:43:21Z

I think we are done here (37d6946) so, closing.
If new issues crop up, we can open another issue :)

matyaskopp added enhancement New feature or request help wanted Extra attention is needed labels Feb 16, 2022

matyaskopp mentioned this issue Feb 23, 2022

DK ES FR IT TR missing text element in tagUsage #105

Closed

5 tasks

matyaskopp added a commit that referenced this issue Jun 10, 2022

script for calculating common content (not finished) #157

da1bf73

matyaskopp added a commit that referenced this issue Jun 24, 2022

refactorization - inserting tagDecl with separate template - common f…

16b9977

…or both root and component files (#157)

matyaskopp mentioned this issue Jun 24, 2022

Trascriber notes containing brackets #195

Closed

matyaskopp added a commit that referenced this issue Jun 27, 2022

print tagUsage changes (#157)

cb2346b

TomazErjavec mentioned this issue Jun 1, 2023

Refactorise parlamint-add-common-content and parlamint2final #675

Closed

matyaskopp added a commit that referenced this issue Jun 8, 2023

simple notes and incidents normalization - no nested brackets, no nes…

0763803

…ted elements (#157 #195)

matyaskopp added a commit that referenced this issue Jun 8, 2023

make note normalization recursive (=double normalization has the same…

db9eaf5

… result) (#195 #157)

TomazErjavec closed this as completed Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finalize ParlaMint script #157

Finalize ParlaMint script #157

matyaskopp commented Feb 16, 2022 •

edited by TomazErjavec

Loading

matyaskopp commented Feb 25, 2022 •

edited

Loading

matyaskopp commented Jun 23, 2022

TomazErjavec commented Jun 2, 2023

matyaskopp commented Jun 7, 2023

TomazErjavec commented Jun 7, 2023

matyaskopp commented Jun 7, 2023

TomazErjavec commented Jun 7, 2023

matyaskopp commented Jun 7, 2023

TomazErjavec commented Jun 7, 2023

TomazErjavec commented Jun 8, 2023

Finalize ParlaMint script #157

Finalize ParlaMint script #157

Comments

matyaskopp commented Feb 16, 2022 • edited by TomazErjavec Loading

matyaskopp commented Feb 25, 2022 • edited Loading

matyaskopp commented Jun 23, 2022

TomazErjavec commented Jun 2, 2023

matyaskopp commented Jun 7, 2023

TomazErjavec commented Jun 7, 2023

matyaskopp commented Jun 7, 2023

TomazErjavec commented Jun 7, 2023

matyaskopp commented Jun 7, 2023

TomazErjavec commented Jun 7, 2023

TomazErjavec commented Jun 8, 2023

matyaskopp commented Feb 16, 2022 •

edited by TomazErjavec

Loading

matyaskopp commented Feb 25, 2022 •

edited

Loading