Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make script converting tokenized corpus contents to non-tokenized versions #112

Open
iljackb opened this issue Oct 6, 2021 · 1 comment

Comments

@iljackb
Copy link
Owner

iljackb commented Oct 6, 2021

Make standing XSLT script that converts tokenized corpus documents to non-tokenized ones and copying only contents

This can then be the basis for a 3rd main variety of every corpus document in which we will store and edit IGT glosses

The following can be the other outputs:

1) Basic structure with original Mixtec(non-tokenized), English and Spanish sentence translations

<seg xml:id="d1e140a" n="3" xml:lang="mix" resp="#TS" type="S">Nikitsi Shanty ka tsi mee ncha aueroperto S.F </seg>
<spanGrp type="annotations">
   <span ana="#S" target="#d1e140a" xml:lang="en" type="translation">Shanty came with me to the S.F airport.</span>
   <span ana="#S" target="#d1e140a" xml:lang="es" type="translation">Shanty vino conmigo al aeropuerto de S.F.</span>
</spanGrp>

This can then be copied (in a slightly modified XSLT) and (mostly) manually edited to become:

2) IGT centered data structure

<seg xml:id="d1e140igt" n="3" xml:lang="mix" resp="#TS" type="IGT">Ni-kits-i Shanty=ka tsi mee ncha aueroperto S.F.</seg>
<spanGrp type="annotations">
   <span ana="#S" target="#d1e140igt" xml:lang="en" type="IGT">PFV-come-3s Shanty=TPC with PRON-EMPH.1s ADPOS.until S.F airport</span> <!-- DECIDE ON TYPOLOGY -->
   <span ana="#S" target="#d1e140igt" xml:lang="en" type="translation">Shanty came with me to the S.F airport.</span>
  <span ana="#S" target="#d1e140igt" xml:lang="es" type="translation">Shanty vino conmigo al aeropuerto de S.F </span>
</spanGrp>
  • Important note: I will need to create proper typology for the values of //seg to express that it is both the sentence (#S) and segmented as an interlinear glossed text (#IGT) and for the value of the //span that contains the interlinear glosses corresponding to that //seg
@iljackb
Copy link
Owner Author

iljackb commented Oct 6, 2021

in doing this beware of:

                  <foreign xml:lang="es">
                     <w xml:id="d1e11698">
                        <w xml:id="d1e11699">perro</w>
                        <w xml:id="d1e11701">caliente</w>
                     </w>
                  </foreign>

and

                     <w xml:id="d1e11698">
                        <w xml:id="d1e11699">perro</w>
                        <w xml:id="d1e11701">caliente</w>
                     </w>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant