-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NLP error in CS Gitdox #125
Comments
Hi. FYI @amir-zeldes @lgessler I also couldn't get NLP to run on a shorter document johannes.canons.FA253-255 . Either I'm doing it all wrong or something is wonky with the NLP API or Pipeline. Thanks in advance. |
Oh hey, sorry just saw this. Will take a look tomorrow |
No need to apologize! Thank you! |
Bug's been fixed for FA253-255, still need to find out what's going wrong with the other one. |
Actually, might have spoken too soon. @ctschroeder, when I take FA253-255 and remove all occurrences of "-", the document parses correctly. A few questions:
@amir-zeldes I think the NLP engine is assuming that the "-" are segmentation markers. |
Ah ok. Those are from the contributing scholar Dr.Atanassova. I will take a look and find a substitute. Thank you for figuring out the problem and I’m sorry I didn’t realize this was the hitch. |
no problem! Yeah, the engine just assumes that the meaning of "-" is always to mark a morpheme boundary, so it unfortunately can't be used to represent other things. I'm guessing that these dashes are representing a long horizontal stroke in the original document? One of these characters (esp. en dash and em dash) might be a good substitute. |
Ok, for FA205-252, on line 51 we have:
That is not legal SGML, since |
It probably is a typo. I’ll take a look tonight. Thanks. I did not see it coming up as an error in the code.
…Sent from my iPhone
|
Hm, these are quite subtle and insidious errors. The problem is that, since we work with SGML, we can't use standard XML validators. It might be worth building a custom JS validator to point out format errors of this kind. I'll make an issue so we can think about dealing with these. |
Hey, so ·> is actually Dr. Atanassova's attempt to represent some punctuation in the manuscript. there is a little character that looks like a > with a dot in the middle. I can't figure out how to represent it (there doesn't seem to be a good unicode character for this in Coptic or anywhere else). If either of you have an idea, let me know. Otherwise I will make it a raised dot and put in a note. |
You can write it as |
This is something digitized on a totally different project, and so we need to take what we get and adjust. Thanks for this suggestion! |
Oh, that kind of CS! Ok, hm, yeah, for those texts we'll just have to hunt these special cases down, then. |
And also I am now noticing not all of those dashes in the other doc were dashes in the ms. There's some decoration over the letters. Also some of the dashes are little bracket type decorations. I'm going to use ❮ (hex 0x276E ) if no one has any objections. |
Ok both have gone through NLP now. Will close |
Hi. I'm unable to run the NLP tools on "johannes.canons.FA205-252". In the first spreadsheet cell I get a message about unable to find the "norm" for something (and no other text is in the spreadsheet). When I validate the XML I get the message "No applicable XML schemas."
Not sure what's going on (NLP pipeline/API? doc too large? XML validation problem? other...?). I'd appreciate your help. Thanks so much!
The text was updated successfully, but these errors were encountered: