Make script converting tokenized corpus contents to non-tokenized versions #112

iljackb · 2021-10-06T20:43:31Z

Make standing XSLT script that converts tokenized corpus documents to non-tokenized ones and copying only contents

This can then be the basis for a 3rd main variety of every corpus document in which we will store and edit IGT glosses

The following can be the other outputs:

1) Basic structure with original Mixtec(non-tokenized), English and Spanish sentence translations

<seg xml:id="d1e140a" n="3" xml:lang="mix" resp="#TS" type="S">Nikitsi Shanty ka tsi mee ncha aueroperto S.F </seg>
<spanGrp type="annotations">
   <span ana="#S" target="#d1e140a" xml:lang="en" type="translation">Shanty came with me to the S.F airport.</span>
   <span ana="#S" target="#d1e140a" xml:lang="es" type="translation">Shanty vino conmigo al aeropuerto de S.F.</span>
</spanGrp>

This can then be copied (in a slightly modified XSLT) and (mostly) manually edited to become:

2) IGT centered data structure

<seg xml:id="d1e140igt" n="3" xml:lang="mix" resp="#TS" type="IGT">Ni-kits-i Shanty=ka tsi mee ncha aueroperto S.F.</seg>
<spanGrp type="annotations">
   <span ana="#S" target="#d1e140igt" xml:lang="en" type="IGT">PFV-come-3s Shanty=TPC with PRON-EMPH.1s ADPOS.until S.F airport</span> <!-- DECIDE ON TYPOLOGY -->
   <span ana="#S" target="#d1e140igt" xml:lang="en" type="translation">Shanty came with me to the S.F airport.</span>
  <span ana="#S" target="#d1e140igt" xml:lang="es" type="translation">Shanty vino conmigo al aeropuerto de S.F </span>
</spanGrp>

Important note: I will need to create proper typology for the values of //seg to express that it is both the sentence (#S) and segmented as an interlinear glossed text (#IGT) and for the value of the //span that contains the interlinear glosses corresponding to that //seg

The text was updated successfully, but these errors were encountered:

iljackb · 2021-10-06T21:44:41Z

in doing this beware of:

                  <foreign xml:lang="es">
                     <w xml:id="d1e11698">
                        <w xml:id="d1e11699">perro</w>
                        <w xml:id="d1e11701">caliente</w>
                     </w>
                  </foreign>

and

                     <w xml:id="d1e11698">
                        <w xml:id="d1e11699">perro</w>
                        <w xml:id="d1e11701">caliente</w>
                     </w>

iljackb added to-do scripts-stylesheets labels Oct 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make script converting tokenized corpus contents to non-tokenized versions #112

Make script converting tokenized corpus contents to non-tokenized versions #112

iljackb commented Oct 6, 2021 •

edited

Loading

iljackb commented Oct 6, 2021

Make script converting tokenized corpus contents to non-tokenized versions #112

Make script converting tokenized corpus contents to non-tokenized versions #112

Comments

iljackb commented Oct 6, 2021 • edited Loading

1) Basic structure with original Mixtec(non-tokenized), English and Spanish sentence translations

2) IGT centered data structure

iljackb commented Oct 6, 2021

iljackb commented Oct 6, 2021 •

edited

Loading