Controlling the tokenizing? / Order of replacements #2

ebeshero · 2022-04-23T23:59:55Z

@Arithmeticus Can you offer a little guidance on how to control the tokenization?

Context:
I'm getting some problematic collation output when I'm collating with tags. Here's a sample around a simple name with some bumpy normalized tags in my manuscript witness. We begin with a <p/> marker in all but the MS witness, which has some highlighting going on.

 <u>
         <txt>&lt;p/&gt; &lt;p</txt>
         <wit ref="1818_fullFlat_C27" pos="3533"/>
         <wit ref="Thomas_fullFlat_C27" pos="3545"/>
         <wit ref="1823_fullFlat_C27" pos="3531"/>
         <wit ref="1831_fullFlat_C27" pos="3535"/>
      </u>
      <u>
         <txt> &amp;gt;M&lt;shi rend="sup"&gt;r&lt;</txt>
         <wit ref="msColl_C27" pos="3024"/>
      </u>
      <c>
         <txt>/</txt>
         <wit ref="1818_fullFlat_C27" pos="3540"/>
         <wit ref="Thomas_fullFlat_C27" pos="3552"/>
         <wit ref="1823_fullFlat_C27" pos="3538"/>
         <wit ref="1831_fullFlat_C27" pos="3542"/>
         <wit ref="msColl_C27" pos="3048"/>
      </c>
      <u>
         <txt>&gt;Mr</txt>
         <wit ref="1818_fullFlat_C27" pos="3541"/>
         <wit ref="Thomas_fullFlat_C27" pos="3553"/>
         <wit ref="1823_fullFlat_C27" pos="3539"/>
         <wit ref="1831_fullFlat_C27" pos="3543"/>
      </u>
      <u>
         <txt>shi&gt;</txt>
         <wit ref="msColl_C27" pos="3049"/>
      </u>
      <c>
         <txt>. </txt>
         <wit ref="1818_fullFlat_C27" pos="3544"/>
         <wit ref="Thomas_fullFlat_C27" pos="3556"/>
         <wit ref="1823_fullFlat_C27" pos="3542"/>
         <wit ref="1831_fullFlat_C27" pos="3546"/>
         <wit ref="msColl_C27" pos="3053"/>
      </c>
      <u>
         <txt>Kirwin,</txt>
         <wit ref="1818_fullFlat_C27" pos="3546"/>
         <wit ref="Thomas_fullFlat_C27" pos="3558"/>
         <wit ref="1823_fullFlat_C27" pos="3544"/>
         <wit ref="1831_fullFlat_C27" pos="3548"/>
      </u>
      <u>
         <txt>Kirwin</txt>
         <wit ref="msColl_C27" pos="3055"/>
      </u>

The text was updated successfully, but these errors were encountered:

ebeshero · 2022-04-24T00:02:02Z

I'm reworking this b/c I realize I don't actually care about <shi> tags in the collation (just superscripts / subscripts marked in the ms witness). Screening those out in a replacement pattern generated a new wrinkle, and I think this time due to the order in which the replacements are made:

There's a lot being normalized away ("munched") even from the MS witness in the middle of this sequence:

<c>
         <txt>corpse.</txt>
         <wit ref="1818_fullFlat_C27" pos="3526"/>
         <wit ref="Thomas_fullFlat_C27" pos="3538"/>
         <wit ref="1823_fullFlat_C27" pos="3524"/>
         <wit ref="1831_fullFlat_C27" pos="3528"/>
         <wit ref="msColl_C27" pos="2972"/>
      </c>
      <u>
         <txt>&lt;p/&gt; &lt;p/&gt;Mr. Kirwin, on hearing this evidence, desired that I should be taken into the room where the body lay for interment, that it might be observed what effect the sight of it would produce</txt>
         <wit ref="1818_fullFlat_C27" pos="3533"/>
         <wit ref="Thomas_fullFlat_C27" pos="3545"/>
         <wit ref="1823_fullFlat_C27" pos="3531"/>
         <wit ref="1831_fullFlat_C27" pos="3535"/>
      </u>
      <u>
         <txt> &amp;gt;Mrd&lt;del/&gt;</txt>
         <wit ref="msColl_C27" pos="2979"/>
      </u>
      <c>
         <txt> upon me. This idea was probably suggested by the extreme agitation I had exhibited </txt>
         <wit ref="1818_fullFlat_C27" pos="3726"/>
         <wit ref="Thomas_fullFlat_C27" pos="3738"/>
         <wit ref="1823_fullFlat_C27" pos="3724"/>
         <wit ref="1831_fullFlat_C27" pos="3728"/>
         <wit ref="msColl_C27" pos="2993"/>
      </c>

Problem passage:

 <u>
         <txt> &amp;gt;Mrd&lt;del/&gt;</txt>
         <wit ref="msColl_C27" pos="2979"/>
      </u>

ebeshero · 2022-04-24T00:02:33Z

Trying to fix this by changing the order of the replacements, to do the little <shi> adjustment before I process <del>s...

ebeshero · 2022-04-24T00:18:26Z

...And discovered the unhappy cause in the source document, a stray right angle bracket, which can throw everything off. Fortunately, we can normalize it away as a pattern, early on.

<lb n="c57-0119__main__18"/> &gt;M<shi rend="sup">r</shi>. Kirwin

ebeshero · 2022-04-24T00:28:27Z

Corrected! Good collation attained!

 <u>
         <txt>&lt;p/&gt; &lt;p/&gt;Mr</txt>
         <wit ref="1818_fullFlat_C27" pos="3533"/>
         <wit ref="Thomas_fullFlat_C27" pos="3545"/>
         <wit ref="1823_fullFlat_C27" pos="3531"/>
         <wit ref="1831_fullFlat_C27" pos="3535"/>
      </u>
      <u>
         <txt> Mr</txt>
         <wit ref="msColl_C27" pos="3004"/>
      </u>
      <c>
         <txt>. </txt>
         <wit ref="1818_fullFlat_C27" pos="3544"/>
         <wit ref="Thomas_fullFlat_C27" pos="3556"/>
         <wit ref="1823_fullFlat_C27" pos="3542"/>
         <wit ref="1831_fullFlat_C27" pos="3546"/>
         <wit ref="msColl_C27" pos="3007"/>
      </c>
      <u>
         <txt>Kirwin,</txt>
         <wit ref="1818_fullFlat_C27" pos="3546"/>
         <wit ref="Thomas_fullFlat_C27" pos="3558"/>
         <wit ref="1823_fullFlat_C27" pos="3544"/>
         <wit ref="1831_fullFlat_C27" pos="3548"/>
      </u>
      <u>
         <txt>Kirwin</txt>
         <wit ref="msColl_C27" pos="3009"/>
      </u>
      <c>
         <txt> on hearing this </txt>
         <wit ref="1818_fullFlat_C27" pos="3553"/>
         <wit ref="Thomas_fullFlat_C27" pos="3565"/>
         <wit ref="1823_fullFlat_C27" pos="3551"/>
         <wit ref="1831_fullFlat_C27" pos="3555"/>
         <wit ref="msColl_C27" pos="3015"/>
      </c>

ebeshero · 2022-04-24T00:30:28Z

@Arithmeticus Standing question: Can we / should we be able to define a tag: </?.+?/>> as an unbreakable token, not to be divided up? I'm watching for this as a sign of trouble...

Arithmeticus · 2022-04-27T18:57:08Z

@ebeshero

I am experimenting with this basic token definition, which treats serialized tags as tokens on par with "standard" word tokens:

<token-definition pattern="</?\i\c*.*?>|[\w‍-[<>]]+" flags=""/>

Is this the sort of basic raw output you're hoping for, when you snap to word?:

<diff xmlns="tag:textalign.net,2015:ns">
   <common>&lt;xml xml:lang="en"&gt;
   &lt;anchor type="collate" xml:id="C11"/&gt;
        </common>
   <a>&lt;milestone unit="chapter" type="start" n="5"/&gt;</a>
   <b>&lt;milestone unit="chapter" type="start" n="6"/&gt;</b>
   <common>
          </common>
   <a>&lt;head sID="novel1_letter4_chapter5_div4_div5_head1"/&gt;</a>
   <b>&lt;head sID="novel1_letter4_chapter6_div4_div6_head1"/&gt;</b>
   <common>CHAPTER </common>
   <a>V</a>
   <b>VI</b>
   <common>.</common>
   <a>&lt;head eID="novel1_letter4_chapter5_div4_div5_head1"/&gt;</a>
   <b>&lt;head eID="novel1_letter4_chapter6_div4_div6_head1"/&gt;</b>
   <common>
          </common>
   <a>&lt;p sID="novel1_letter4_chapter5_div4_div5_p1"/&gt;</a>
   <b>&lt;p sID="novel1_letter4_chapter6_div4_div6_p1"/&gt; </b>
   <common>C</common>
   <a>&lt;hi sID="novel1_letter4_chapter5_div4_div5_p1_hi1"/&gt;</a>
   <b>&lt;hi sID="novel1_letter4_chapter6_div4_div6_p1_hi1"/&gt;</b>
   <common>LERVAL</common>
   <a>&lt;hi eID="novel1_letter4_chapter5_div4_div5_p1_hi1"/&gt;</a>
   <!--Trimming next 603 nodes (deep skip)-->
</diff>

ebeshero · 2022-07-15T21:35:51Z

@Arithmeticus @Yuying-Jin Yes indeed, that's the output we were hoping for! (I am sorry I didn't reply to this when you posted, but we're returning to work on it now!)

ebeshero · 2022-07-15T21:36:57Z

@Yuying-Jin Welcome to the TAN DIFF XSLT experiment on Frankenstein! Let's see what we can do here... :-)

Arithmeticus · 2022-07-24T04:03:00Z

Just FYI, there is a global parameter, $tan:token-definition-default under parameters/param-application.xsl that will change the base default value of the definition of token. That can be configured as you like.

Because you're comparing serialized XML, that definition isn't cutting it. Strange thing is, I have run across the same need at work. I've done some experimentation, and I'm getting better results when I redefine the global parameter as follows:

<xsl:param name="tan:token-definition-default" as="element()">
      <token-definition pattern="&#x3c;[^&#x3e;]+&#x3e;|[\w&#xad;&#x200b;&#x200d;]+|[^\w&#xad;&#x200b;&#x200d;\s]" flags=""/>
</xsl:param>

ebeshero mentioned this issue Jul 16, 2022

How to deliver normalized and original tokens in XML output? #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controlling the tokenizing? / Order of replacements #2

Controlling the tokenizing? / Order of replacements #2

ebeshero commented Apr 23, 2022

ebeshero commented Apr 24, 2022

ebeshero commented Apr 24, 2022

ebeshero commented Apr 24, 2022

ebeshero commented Apr 24, 2022

ebeshero commented Apr 24, 2022

Arithmeticus commented Apr 27, 2022

ebeshero commented Jul 15, 2022

ebeshero commented Jul 15, 2022

Arithmeticus commented Jul 24, 2022

Controlling the tokenizing? / Order of replacements #2

Controlling the tokenizing? / Order of replacements #2

Comments

ebeshero commented Apr 23, 2022

ebeshero commented Apr 24, 2022

ebeshero commented Apr 24, 2022

ebeshero commented Apr 24, 2022

ebeshero commented Apr 24, 2022

ebeshero commented Apr 24, 2022

Arithmeticus commented Apr 27, 2022

ebeshero commented Jul 15, 2022

ebeshero commented Jul 15, 2022

Arithmeticus commented Jul 24, 2022