Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NLP error in CS Gitdox #125

Closed
ctschroeder opened this issue Feb 12, 2019 · 17 comments
Closed

NLP error in CS Gitdox #125

ctschroeder opened this issue Feb 12, 2019 · 17 comments

Comments

@ctschroeder
Copy link
Collaborator

Hi. I'm unable to run the NLP tools on "johannes.canons.FA205-252". In the first spreadsheet cell I get a message about unable to find the "norm" for something (and no other text is in the spreadsheet). When I validate the XML I get the message "No applicable XML schemas."

Not sure what's going on (NLP pipeline/API? doc too large? XML validation problem? other...?). I'd appreciate your help. Thanks so much!

@ctschroeder
Copy link
Collaborator Author

Hi. FYI @amir-zeldes @lgessler I also couldn't get NLP to run on a shorter document johannes.canons.FA253-255 . Either I'm doing it all wrong or something is wonky with the NLP API or Pipeline. Thanks in advance.

@lgessler
Copy link
Collaborator

Oh hey, sorry just saw this. Will take a look tomorrow

@ctschroeder
Copy link
Collaborator Author

No need to apologize! Thank you!

@lgessler
Copy link
Collaborator

Bug's been fixed for FA253-255, still need to find out what's going wrong with the other one.

@lgessler
Copy link
Collaborator

Actually, might have spoken too soon. @ctschroeder, when I take FA253-255 and remove all occurrences of "-", the document parses correctly. A few questions:

  1. What are those "-" characters representing?
  2. Have you used "-" in the past in documents that parsed successfully?
  3. Are there other characters you could use instead of "-"?

@amir-zeldes I think the NLP engine is assuming that the "-" are segmentation markers.

@ctschroeder
Copy link
Collaborator Author

Ah ok. Those are from the contributing scholar Dr.Atanassova. I will take a look and find a substitute. Thank you for figuring out the problem and I’m sorry I didn’t realize this was the hitch.

@lgessler
Copy link
Collaborator

no problem! Yeah, the engine just assumes that the meaning of "-" is always to mark a morpheme boundary, so it unfortunately can't be used to represent other things.

I'm guessing that these dashes are representing a long horizontal stroke in the original document? One of these characters (esp. en dash and em dash) might be a good substitute.

@lgessler
Copy link
Collaborator

Ok, for FA205-252, on line 51 we have:

<lb>ⲧⲱⲛ_·>_ϫⲉ|ⲟⲩ|ⲣⲱ</lb>

That is not legal SGML, since > would need to be &gt;. This looks like a typo, though--you didn't really want a > there, right?

@ctschroeder
Copy link
Collaborator Author

ctschroeder commented Feb 14, 2019 via email

@amir-zeldes
Copy link
Contributor

Hm, these are quite subtle and insidious errors. The problem is that, since we work with SGML, we can't use standard XML validators. It might be worth building a custom JS validator to point out format errors of this kind. I'll make an issue so we can think about dealing with these.

@amir-zeldes
Copy link
Contributor

#126

@ctschroeder
Copy link
Collaborator Author

Hey, so ·> is actually Dr. Atanassova's attempt to represent some punctuation in the manuscript. there is a little character that looks like a > with a dot in the middle. I can't figure out how to represent it (there doesn't seem to be a good unicode character for this in Coptic or anywhere else). If either of you have an idea, let me know. Otherwise I will make it a raised dot and put in a note.
For the dashes I will choose one of the squiggly dashes. These both examples come from text in a digital document contributed by a colleague (not transcribed by CS folks) fyi, which is interesting.

@lgessler
Copy link
Collaborator

You can write it as ·&gt; and it should work, so I think that's worth a try. I'll be the first to admit that it's disappointing we need to tell annotators to remember to use &gt; and &lt; for > and <, but at the moment we don't have a better alternative. I think Amir and I are going to keep thinking about solutions, even if it's just making it a validation error if either of the two errors we've mentioned in this thread occur.

@ctschroeder
Copy link
Collaborator Author

This is something digitized on a totally different project, and so we need to take what we get and adjust. Thanks for this suggestion!

@lgessler
Copy link
Collaborator

Oh, that kind of CS! Ok, hm, yeah, for those texts we'll just have to hunt these special cases down, then.

@ctschroeder
Copy link
Collaborator Author

And also I am now noticing not all of those dashes in the other doc were dashes in the ms. There's some decoration over the letters. Also some of the dashes are little bracket type decorations. I'm going to use ❮ (hex 0x276E ) if no one has any objections.

@ctschroeder
Copy link
Collaborator Author

Ok both have gone through NLP now. Will close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants