Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy anchoring to compute document similarity? #115

Open
Daniel-Mietchen opened this issue Dec 2, 2015 · 1 comment
Open

Fuzzy anchoring to compute document similarity? #115

Daniel-Mietchen opened this issue Dec 2, 2015 · 1 comment

Comments

@Daniel-Mietchen
Copy link
Member

We have the https://github.com/wpoa/JATS-to-Mediawiki converter, but in some cases, it does not recognize an unusual way of tagging things, so that sometimes a reference or an infobox or something in a table might be missing or otherwise deviating from the original.

It seems to me that fuzzy anchoring (as per https://hypothes.is/blog/fuzzy-anchoring/ ) requires this problem to be solved, so I imagine that we could use it - perhaps with some tweaks - to score the perfectness of the fit of the Wikisource copy to the PMC original. Is that assumption correct?

@nickstenning
Copy link

Daniel asked me to provide a bit more context on how the anchoring code in Hypothesis works. The blog post linked provides a passable overview but much of the detail has changed and the code itself is a fair bit easier to understand than when that post was written.

The basic approach is this: for each annotation target we store several different "selectors" -- that is, serialisable data that describes the original location of the annotation in the document. We can then use these selectors either individually or in combination when reanchoring annotations to the page.

We do this through the use of a series of DOM anchoring libraries:

https://github.com/hypothesis/dom-anchor-text-quote
https://github.com/hypothesis/dom-anchor-text-position
https://github.com/hypothesis/dom-anchor-fragment

You can see how these are used to reanchor annotations in this part of our code:

https://github.com/hypothesis/h/blob/a71b6a8/h/static/scripts/annotator/anchoring/html.coffee

The anchoring code is different for different document types. See, for example, here for the PDF anchoring code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants