You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@Yuying-Jin and I discovered that many collation errors are generated by a problem with tokenization, which tags <.+?> are getting mashed in tokens with neighboring texts and spaces between are removed. Adjusting the extract() function in the Python script doesn't help: we can insert spaces as we wish, but the tokens still mash markup and text together.
Example of a "smashed token":
`<add/>some', 'text'
We can add space, but the tokenization is unaffected:
'<add/> some', 'text'
We need to determine how to change the tokenization to definitively split around XML tags.
The text was updated successfully, but these errors were encountered:
@Yuying-Jin and I discovered that many collation errors are generated by a problem with tokenization, which tags
<.+?>
are getting mashed in tokens with neighboring texts and spaces between are removed. Adjusting the extract() function in the Python script doesn't help: we can insert spaces as we wish, but the tokens still mash markup and text together.Example of a "smashed token":
We can add space, but the tokenization is unaffected:
We need to determine how to change the tokenization to definitively split around XML tags.
The text was updated successfully, but these errors were encountered: