-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Controlling the tokenizing? / Order of replacements #2
Comments
I'm reworking this b/c I realize I don't actually care about There's a lot being normalized away ("munched") even from the MS witness in the middle of this sequence: <c>
<txt>corpse.</txt>
<wit ref="1818_fullFlat_C27" pos="3526"/>
<wit ref="Thomas_fullFlat_C27" pos="3538"/>
<wit ref="1823_fullFlat_C27" pos="3524"/>
<wit ref="1831_fullFlat_C27" pos="3528"/>
<wit ref="msColl_C27" pos="2972"/>
</c>
<u>
<txt><p/> <p/>Mr. Kirwin, on hearing this evidence, desired that I should be taken into the room where the body lay for interment, that it might be observed what effect the sight of it would produce</txt>
<wit ref="1818_fullFlat_C27" pos="3533"/>
<wit ref="Thomas_fullFlat_C27" pos="3545"/>
<wit ref="1823_fullFlat_C27" pos="3531"/>
<wit ref="1831_fullFlat_C27" pos="3535"/>
</u>
<u>
<txt> &gt;Mrd<del/></txt>
<wit ref="msColl_C27" pos="2979"/>
</u>
<c>
<txt> upon me. This idea was probably suggested by the extreme agitation I had exhibited </txt>
<wit ref="1818_fullFlat_C27" pos="3726"/>
<wit ref="Thomas_fullFlat_C27" pos="3738"/>
<wit ref="1823_fullFlat_C27" pos="3724"/>
<wit ref="1831_fullFlat_C27" pos="3728"/>
<wit ref="msColl_C27" pos="2993"/>
</c> Problem passage: <u>
<txt> &gt;Mrd<del/></txt>
<wit ref="msColl_C27" pos="2979"/>
</u> |
Trying to fix this by changing the order of the replacements, to do the little |
...And discovered the unhappy cause in the source document, a stray right angle bracket, which can throw everything off. Fortunately, we can normalize it away as a pattern, early on. <lb n="c57-0119__main__18"/> >M<shi rend="sup">r</shi>. Kirwin |
Corrected! Good collation attained! <u>
<txt><p/> <p/>Mr</txt>
<wit ref="1818_fullFlat_C27" pos="3533"/>
<wit ref="Thomas_fullFlat_C27" pos="3545"/>
<wit ref="1823_fullFlat_C27" pos="3531"/>
<wit ref="1831_fullFlat_C27" pos="3535"/>
</u>
<u>
<txt> Mr</txt>
<wit ref="msColl_C27" pos="3004"/>
</u>
<c>
<txt>. </txt>
<wit ref="1818_fullFlat_C27" pos="3544"/>
<wit ref="Thomas_fullFlat_C27" pos="3556"/>
<wit ref="1823_fullFlat_C27" pos="3542"/>
<wit ref="1831_fullFlat_C27" pos="3546"/>
<wit ref="msColl_C27" pos="3007"/>
</c>
<u>
<txt>Kirwin,</txt>
<wit ref="1818_fullFlat_C27" pos="3546"/>
<wit ref="Thomas_fullFlat_C27" pos="3558"/>
<wit ref="1823_fullFlat_C27" pos="3544"/>
<wit ref="1831_fullFlat_C27" pos="3548"/>
</u>
<u>
<txt>Kirwin</txt>
<wit ref="msColl_C27" pos="3009"/>
</u>
<c>
<txt> on hearing this </txt>
<wit ref="1818_fullFlat_C27" pos="3553"/>
<wit ref="Thomas_fullFlat_C27" pos="3565"/>
<wit ref="1823_fullFlat_C27" pos="3551"/>
<wit ref="1831_fullFlat_C27" pos="3555"/>
<wit ref="msColl_C27" pos="3015"/>
</c>
|
@Arithmeticus Standing question: Can we / should we be able to define a tag: |
I am experimenting with this basic token definition, which treats serialized tags as tokens on par with "standard" word tokens:
Is this the sort of basic raw output you're hoping for, when you snap to word?:
|
@Arithmeticus @Yuying-Jin Yes indeed, that's the output we were hoping for! (I am sorry I didn't reply to this when you posted, but we're returning to work on it now!) |
@Yuying-Jin Welcome to the TAN DIFF XSLT experiment on Frankenstein! Let's see what we can do here... :-) |
Just FYI, there is a global parameter, Because you're comparing serialized XML, that definition isn't cutting it. Strange thing is, I have run across the same need at work. I've done some experimentation, and I'm getting better results when I redefine the global parameter as follows:
|
@Arithmeticus Can you offer a little guidance on how to control the tokenization?
Context:
I'm getting some problematic collation output when I'm collating with tags. Here's a sample around a simple name with some bumpy normalized tags in my manuscript witness. We begin with a
<p/>
marker in all but the MS witness, which has some highlighting going on.The text was updated successfully, but these errors were encountered: