Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes concerning MUC6 and MUC7 coreference resolution input files #114

Closed
wants to merge 6 commits into from

Conversation

peschue
Copy link

@peschue peschue commented Dec 16, 2015

These changes make it possible to use dcoref with MUC6 and MUC7 files for coreference resolution as distributed by LDC. The existing code had problems with both files.

(I declare that this contribution is in the public domain.)

in MUC7 as distributed by LDC, <p> are not ended with </p>
instead, a new <p> starts
…GML tokens)

some <COREF > tokens contain " escaped with \ instead of &quot;
therefore the tokenizer produces unexpected output
(e.g., "<" becomes a separate word)
some annotations in MUC6 as distributed by LDC contain swapped IDs
(REF and ID is swapped)
…ng mentions)

removes pointers to non-existing mentions
(which would cause crash of dcoref)
this circumvents a crash with MUC6 dryrun data with a sentence
consisting of only spaces.
@peschue
Copy link
Author

peschue commented Dec 23, 2015

I found out that the fixes circumvent crashes but some results are unexpected. I will investigate and open another pull request if I am confident that everything is really fixed wrt MUC6/MUC7.

@peschue peschue closed this Dec 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant