NLP error in CS Gitdox #125

ctschroeder · 2019-02-12T22:39:17Z

Hi. I'm unable to run the NLP tools on "johannes.canons.FA205-252". In the first spreadsheet cell I get a message about unable to find the "norm" for something (and no other text is in the spreadsheet). When I validate the XML I get the message "No applicable XML schemas."

Not sure what's going on (NLP pipeline/API? doc too large? XML validation problem? other...?). I'd appreciate your help. Thanks so much!

ctschroeder · 2019-02-14T05:15:31Z

Hi. FYI @amir-zeldes @lgessler I also couldn't get NLP to run on a shorter document johannes.canons.FA253-255 . Either I'm doing it all wrong or something is wonky with the NLP API or Pipeline. Thanks in advance.

lgessler · 2019-02-14T05:47:44Z

Oh hey, sorry just saw this. Will take a look tomorrow

ctschroeder · 2019-02-14T06:05:25Z

No need to apologize! Thank you!

lgessler · 2019-02-14T20:48:30Z

Bug's been fixed for FA253-255, still need to find out what's going wrong with the other one.

lgessler · 2019-02-14T21:42:08Z

Actually, might have spoken too soon. @ctschroeder, when I take FA253-255 and remove all occurrences of "-", the document parses correctly. A few questions:

What are those "-" characters representing?
Have you used "-" in the past in documents that parsed successfully?
Are there other characters you could use instead of "-"?

@amir-zeldes I think the NLP engine is assuming that the "-" are segmentation markers.

ctschroeder · 2019-02-14T21:58:31Z

Ah ok. Those are from the contributing scholar Dr.Atanassova. I will take a look and find a substitute. Thank you for figuring out the problem and I’m sorry I didn’t realize this was the hitch.

lgessler · 2019-02-14T22:01:38Z

no problem! Yeah, the engine just assumes that the meaning of "-" is always to mark a morpheme boundary, so it unfortunately can't be used to represent other things.

I'm guessing that these dashes are representing a long horizontal stroke in the original document? One of these characters (esp. en dash and em dash) might be a good substitute.

lgessler · 2019-02-14T22:44:08Z

Ok, for FA205-252, on line 51 we have:

<lb>ⲧⲱⲛ_·>_ϫⲉ|ⲟⲩ|ⲣⲱ</lb>

That is not legal SGML, since > would need to be >. This looks like a typo, though--you didn't really want a > there, right?

ctschroeder · 2019-02-14T23:05:32Z

It probably is a typo. I’ll take a look tonight. Thanks. I did not see it coming up as an error in the code.

…

Sent from my iPhone

amir-zeldes · 2019-02-15T14:55:34Z

Hm, these are quite subtle and insidious errors. The problem is that, since we work with SGML, we can't use standard XML validators. It might be worth building a custom JS validator to point out format errors of this kind. I'll make an issue so we can think about dealing with these.

amir-zeldes · 2019-02-15T14:59:33Z

#126

ctschroeder · 2019-02-16T00:05:08Z

Hey, so ·> is actually Dr. Atanassova's attempt to represent some punctuation in the manuscript. there is a little character that looks like a > with a dot in the middle. I can't figure out how to represent it (there doesn't seem to be a good unicode character for this in Coptic or anywhere else). If either of you have an idea, let me know. Otherwise I will make it a raised dot and put in a note.
For the dashes I will choose one of the squiggly dashes. These both examples come from text in a digital document contributed by a colleague (not transcribed by CS folks) fyi, which is interesting.

lgessler · 2019-02-16T00:36:06Z

You can write it as ·> and it should work, so I think that's worth a try. I'll be the first to admit that it's disappointing we need to tell annotators to remember to use > and < for > and <, but at the moment we don't have a better alternative. I think Amir and I are going to keep thinking about solutions, even if it's just making it a validation error if either of the two errors we've mentioned in this thread occur.

ctschroeder · 2019-02-16T00:38:07Z

This is something digitized on a totally different project, and so we need to take what we get and adjust. Thanks for this suggestion!

lgessler · 2019-02-16T00:44:07Z

Oh, that kind of CS! Ok, hm, yeah, for those texts we'll just have to hunt these special cases down, then.

ctschroeder · 2019-02-16T01:01:37Z

And also I am now noticing not all of those dashes in the other doc were dashes in the ms. There's some decoration over the letters. Also some of the dashes are little bracket type decorations. I'm going to use ❮ (hex 0x276E ) if no one has any objections.

ctschroeder · 2019-02-16T01:15:37Z

Ok both have gone through NLP now. Will close

ctschroeder closed this as completed Feb 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLP error in CS Gitdox #125

NLP error in CS Gitdox #125

ctschroeder commented Feb 12, 2019

ctschroeder commented Feb 14, 2019

lgessler commented Feb 14, 2019

ctschroeder commented Feb 14, 2019

lgessler commented Feb 14, 2019

lgessler commented Feb 14, 2019

ctschroeder commented Feb 14, 2019

lgessler commented Feb 14, 2019

lgessler commented Feb 14, 2019

ctschroeder commented Feb 14, 2019 via email

amir-zeldes commented Feb 15, 2019

amir-zeldes commented Feb 15, 2019

ctschroeder commented Feb 16, 2019

lgessler commented Feb 16, 2019

ctschroeder commented Feb 16, 2019

lgessler commented Feb 16, 2019

ctschroeder commented Feb 16, 2019

ctschroeder commented Feb 16, 2019

NLP error in CS Gitdox #125

NLP error in CS Gitdox #125

Comments

ctschroeder commented Feb 12, 2019

ctschroeder commented Feb 14, 2019

lgessler commented Feb 14, 2019

ctschroeder commented Feb 14, 2019

lgessler commented Feb 14, 2019

lgessler commented Feb 14, 2019

ctschroeder commented Feb 14, 2019

lgessler commented Feb 14, 2019

lgessler commented Feb 14, 2019

ctschroeder commented Feb 14, 2019 via email

amir-zeldes commented Feb 15, 2019

amir-zeldes commented Feb 15, 2019

ctschroeder commented Feb 16, 2019

lgessler commented Feb 16, 2019

ctschroeder commented Feb 16, 2019

lgessler commented Feb 16, 2019

ctschroeder commented Feb 16, 2019

ctschroeder commented Feb 16, 2019