Problem with tokenizing of elements and text #1

ebeshero · 2022-06-21T23:17:37Z

@Yuying-Jin and I discovered that many collation errors are generated by a problem with tokenization, which tags <.+?> are getting mashed in tokens with neighboring texts and spaces between are removed. Adjusting the extract() function in the Python script doesn't help: we can insert spaces as we wish, but the tokens still mash markup and text together.

Example of a "smashed token":

`<add/>some', 'text'

We can add space, but the tokenization is unaffected:

'<add/> some', 'text'

We need to determine how to change the tokenization to definitively split around XML tags.

The text was updated successfully, but these errors were encountered:

ebeshero · 2022-10-11T21:47:56Z

Is this resolved now with our longToken algorithm? @Yuying-Jin

ebeshero assigned ebeshero and Yuying-Jin Jun 21, 2022

ebeshero added the bug Something isn't working label Jun 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with tokenizing of elements and text #1

Problem with tokenizing of elements and text #1

ebeshero commented Jun 21, 2022

ebeshero commented Oct 11, 2022

Problem with tokenizing of elements and text #1

Problem with tokenizing of elements and text #1

Comments

ebeshero commented Jun 21, 2022

ebeshero commented Oct 11, 2022