Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controlling the tokenizing? / Order of replacements #2

Open
ebeshero opened this issue Apr 23, 2022 · 9 comments
Open

Controlling the tokenizing? / Order of replacements #2

ebeshero opened this issue Apr 23, 2022 · 9 comments

Comments

@ebeshero
Copy link
Member

@Arithmeticus Can you offer a little guidance on how to control the tokenization?

Context:
I'm getting some problematic collation output when I'm collating with tags. Here's a sample around a simple name with some bumpy normalized tags in my manuscript witness. We begin with a <p/> marker in all but the MS witness, which has some highlighting going on.

 <u>
         <txt>&lt;p/&gt; &lt;p</txt>
         <wit ref="1818_fullFlat_C27" pos="3533"/>
         <wit ref="Thomas_fullFlat_C27" pos="3545"/>
         <wit ref="1823_fullFlat_C27" pos="3531"/>
         <wit ref="1831_fullFlat_C27" pos="3535"/>
      </u>
      <u>
         <txt> &amp;gt;M&lt;shi rend="sup"&gt;r&lt;</txt>
         <wit ref="msColl_C27" pos="3024"/>
      </u>
      <c>
         <txt>/</txt>
         <wit ref="1818_fullFlat_C27" pos="3540"/>
         <wit ref="Thomas_fullFlat_C27" pos="3552"/>
         <wit ref="1823_fullFlat_C27" pos="3538"/>
         <wit ref="1831_fullFlat_C27" pos="3542"/>
         <wit ref="msColl_C27" pos="3048"/>
      </c>
      <u>
         <txt>&gt;Mr</txt>
         <wit ref="1818_fullFlat_C27" pos="3541"/>
         <wit ref="Thomas_fullFlat_C27" pos="3553"/>
         <wit ref="1823_fullFlat_C27" pos="3539"/>
         <wit ref="1831_fullFlat_C27" pos="3543"/>
      </u>
      <u>
         <txt>shi&gt;</txt>
         <wit ref="msColl_C27" pos="3049"/>
      </u>
      <c>
         <txt>. </txt>
         <wit ref="1818_fullFlat_C27" pos="3544"/>
         <wit ref="Thomas_fullFlat_C27" pos="3556"/>
         <wit ref="1823_fullFlat_C27" pos="3542"/>
         <wit ref="1831_fullFlat_C27" pos="3546"/>
         <wit ref="msColl_C27" pos="3053"/>
      </c>
      <u>
         <txt>Kirwin,</txt>
         <wit ref="1818_fullFlat_C27" pos="3546"/>
         <wit ref="Thomas_fullFlat_C27" pos="3558"/>
         <wit ref="1823_fullFlat_C27" pos="3544"/>
         <wit ref="1831_fullFlat_C27" pos="3548"/>
      </u>
      <u>
         <txt>Kirwin</txt>
         <wit ref="msColl_C27" pos="3055"/>
      </u>
@ebeshero
Copy link
Member Author

I'm reworking this b/c I realize I don't actually care about <shi> tags in the collation (just superscripts / subscripts marked in the ms witness). Screening those out in a replacement pattern generated a new wrinkle, and I think this time due to the order in which the replacements are made:

There's a lot being normalized away ("munched") even from the MS witness in the middle of this sequence:

<c>
         <txt>corpse.</txt>
         <wit ref="1818_fullFlat_C27" pos="3526"/>
         <wit ref="Thomas_fullFlat_C27" pos="3538"/>
         <wit ref="1823_fullFlat_C27" pos="3524"/>
         <wit ref="1831_fullFlat_C27" pos="3528"/>
         <wit ref="msColl_C27" pos="2972"/>
      </c>
      <u>
         <txt>&lt;p/&gt; &lt;p/&gt;Mr. Kirwin, on hearing this evidence, desired that I should be taken into the room where the body lay for interment, that it might be observed what effect the sight of it would produce</txt>
         <wit ref="1818_fullFlat_C27" pos="3533"/>
         <wit ref="Thomas_fullFlat_C27" pos="3545"/>
         <wit ref="1823_fullFlat_C27" pos="3531"/>
         <wit ref="1831_fullFlat_C27" pos="3535"/>
      </u>
      <u>
         <txt> &amp;gt;Mrd&lt;del/&gt;</txt>
         <wit ref="msColl_C27" pos="2979"/>
      </u>
      <c>
         <txt> upon me. This idea was probably suggested by the extreme agitation I had exhibited </txt>
         <wit ref="1818_fullFlat_C27" pos="3726"/>
         <wit ref="Thomas_fullFlat_C27" pos="3738"/>
         <wit ref="1823_fullFlat_C27" pos="3724"/>
         <wit ref="1831_fullFlat_C27" pos="3728"/>
         <wit ref="msColl_C27" pos="2993"/>
      </c>

Problem passage:

 <u>
         <txt> &amp;gt;Mrd&lt;del/&gt;</txt>
         <wit ref="msColl_C27" pos="2979"/>
      </u>

@ebeshero
Copy link
Member Author

Trying to fix this by changing the order of the replacements, to do the little <shi> adjustment before I process <del>s...

@ebeshero
Copy link
Member Author

...And discovered the unhappy cause in the source document, a stray right angle bracket, which can throw everything off. Fortunately, we can normalize it away as a pattern, early on.

<lb n="c57-0119__main__18"/> &gt;M<shi rend="sup">r</shi>. Kirwin 

@ebeshero
Copy link
Member Author

Corrected! Good collation attained!

 <u>
         <txt>&lt;p/&gt; &lt;p/&gt;Mr</txt>
         <wit ref="1818_fullFlat_C27" pos="3533"/>
         <wit ref="Thomas_fullFlat_C27" pos="3545"/>
         <wit ref="1823_fullFlat_C27" pos="3531"/>
         <wit ref="1831_fullFlat_C27" pos="3535"/>
      </u>
      <u>
         <txt> Mr</txt>
         <wit ref="msColl_C27" pos="3004"/>
      </u>
      <c>
         <txt>. </txt>
         <wit ref="1818_fullFlat_C27" pos="3544"/>
         <wit ref="Thomas_fullFlat_C27" pos="3556"/>
         <wit ref="1823_fullFlat_C27" pos="3542"/>
         <wit ref="1831_fullFlat_C27" pos="3546"/>
         <wit ref="msColl_C27" pos="3007"/>
      </c>
      <u>
         <txt>Kirwin,</txt>
         <wit ref="1818_fullFlat_C27" pos="3546"/>
         <wit ref="Thomas_fullFlat_C27" pos="3558"/>
         <wit ref="1823_fullFlat_C27" pos="3544"/>
         <wit ref="1831_fullFlat_C27" pos="3548"/>
      </u>
      <u>
         <txt>Kirwin</txt>
         <wit ref="msColl_C27" pos="3009"/>
      </u>
      <c>
         <txt> on hearing this </txt>
         <wit ref="1818_fullFlat_C27" pos="3553"/>
         <wit ref="Thomas_fullFlat_C27" pos="3565"/>
         <wit ref="1823_fullFlat_C27" pos="3551"/>
         <wit ref="1831_fullFlat_C27" pos="3555"/>
         <wit ref="msColl_C27" pos="3015"/>
      </c>

@ebeshero
Copy link
Member Author

@Arithmeticus Standing question: Can we / should we be able to define a tag: &lt;/?.+?/>&gt; as an unbreakable token, not to be divided up? I'm watching for this as a sign of trouble...

@Arithmeticus
Copy link
Collaborator

@ebeshero

I am experimenting with this basic token definition, which treats serialized tags as tokens on par with "standard" word tokens:

<token-definition pattern="&lt;/?\i\c*.*?&gt;|[\w&#xad;​&#x200b;&#x200d;-[&lt;&gt;]]+" flags=""/>

Is this the sort of basic raw output you're hoping for, when you snap to word?:

<diff xmlns="tag:textalign.net,2015:ns">
   <common>&lt;xml xml:lang="en"&gt;
   &lt;anchor type="collate" xml:id="C11"/&gt;
        </common>
   <a>&lt;milestone unit="chapter" type="start" n="5"/&gt;</a>
   <b>&lt;milestone unit="chapter" type="start" n="6"/&gt;</b>
   <common>
          </common>
   <a>&lt;head sID="novel1_letter4_chapter5_div4_div5_head1"/&gt;</a>
   <b>&lt;head sID="novel1_letter4_chapter6_div4_div6_head1"/&gt;</b>
   <common>CHAPTER </common>
   <a>V</a>
   <b>VI</b>
   <common>.</common>
   <a>&lt;head eID="novel1_letter4_chapter5_div4_div5_head1"/&gt;</a>
   <b>&lt;head eID="novel1_letter4_chapter6_div4_div6_head1"/&gt;</b>
   <common>
          </common>
   <a>&lt;p sID="novel1_letter4_chapter5_div4_div5_p1"/&gt;</a>
   <b>&lt;p sID="novel1_letter4_chapter6_div4_div6_p1"/&gt; </b>
   <common>C</common>
   <a>&lt;hi sID="novel1_letter4_chapter5_div4_div5_p1_hi1"/&gt;</a>
   <b>&lt;hi sID="novel1_letter4_chapter6_div4_div6_p1_hi1"/&gt;</b>
   <common>LERVAL</common>
   <a>&lt;hi eID="novel1_letter4_chapter5_div4_div5_p1_hi1"/&gt;</a>
   <!--Trimming next 603 nodes (deep skip)-->
</diff>

@ebeshero
Copy link
Member Author

@Arithmeticus @Yuying-Jin Yes indeed, that's the output we were hoping for! (I am sorry I didn't reply to this when you posted, but we're returning to work on it now!)

@ebeshero
Copy link
Member Author

@Yuying-Jin Welcome to the TAN DIFF XSLT experiment on Frankenstein! Let's see what we can do here... :-)

@Arithmeticus
Copy link
Collaborator

Just FYI, there is a global parameter, $tan:token-definition-default under parameters/param-application.xsl that will change the base default value of the definition of token. That can be configured as you like.

Because you're comparing serialized XML, that definition isn't cutting it. Strange thing is, I have run across the same need at work. I've done some experimentation, and I'm getting better results when I redefine the global parameter as follows:

<xsl:param name="tan:token-definition-default" as="element()">
      <token-definition pattern="&#x3c;[^&#x3e;]+&#x3e;|[\w&#xad;​&#x200b;&#x200d;]+|[^\w&#xad;​&#x200b;&#x200d;\s]" flags=""/>
</xsl:param>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants