-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ISLANDORA-2297 Book reader malformed search result preview on pin hover #127
base: 7.x
Are you sure you want to change the base?
Conversation
Strip HTML tags before removing non-alphanumeric characters.
Thanks @alxp! This looks good to me, but per Islandora rules I can't approve since we work together. |
@rosiel you can approve. You just can't merge. |
@alxp Can you add some testing instructions? I'd like to try this out, but your PR doesn't have any of the template changed/filled in. |
@alxp Just a quick prod since we're hitting code freeze... If you can provide some testing instructions, I can run through this quick and perhaps get it approved before the deadline. |
when this pr was changed to use the right branch, the description in the template of the pull request were lost. They are here: #126 |
Thanks, @rosiel. For reviewers/testers then, here's the detail from the original PR: JIRA Ticket: https://jira.duraspace.org/browse/ISLANDORA-2297 ** What does this Pull Request do? ** Strip out HTML tags from search result previews that show when you hover over a pin in the book reader. This solves an issue where HTML tag names were appearing in the search highlights. I decided to strip out tags because keeping the wanted tags while stripping out any characters that might break layout seems unnecessarily complicated. ** What's new? ** ** How should this be tested? ** On a working instance of the Internet Archive Book Reader with a book that has an OCR text stream, and searching working correctly, the problem will be reproducible if Solr sends back tags to surround the searched-for keyword. E.g., If you search for "mountain" the hover text over a search result pin will show "enmountainem". |
@rosiel @bondjimbond max version of jquery is 1.7 for display it. In jQuery Update at http://localhost:8000/admin/config/development/jquery_update it is version 1.10. |
Is |
It should be "emmountainem" i.e., the tag coming from the search engine to show highlights. |
I'm finally catching up on this. Had to set JQuery to 1.7 -- yes, we need a ticket for that. But @alxp I'm not reproducing the issue. My search results contain no tags (7.x-1.12 release candidate machine). I even added HTML tags to the OCR datastream (added Can you provide clearer testing instructions? What exactly do you have in the OCR that makes this issue appear? |
We provide different wrapping tokens that Solr should use, instead of the default ... Alternatively... looks like it might use |
I found this is SolrConfig.xml <!-- Configure the standard formatter -->
<formatter name="html"
default="true"
class="solr.highlight.HtmlFormatter">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<em>]]></str>
<str name="hl.simple.post"><![CDATA[</em>]]></str>
</lst>
</formatter> This is what would be adding the tags on the results, this looks like a default setting so while we can remove it might still be a good idea to not have the UI render HTML tags if it is going to show them as tags, regardless of how a user's Sole result formatting is configured. |
I'm... not sure it makes sense, to try to strip markup there: anything in the results that is XML-like should be escaped already (otherwise, you could end up with snippets starting in the middle of an element, and breaking the entire markup of the page, anywhere you attempted to use the snippet)... seems like the only place unescaped markup could come from in snippets would be from these points of configuration... as in: The entire objective of these points of configuration is to add these pre-/suffix bits, for inclusion in the page... ... if it is indeed just because you're not using the "FastVectorHighlighter" (or rather, are using the "Original Highlighter"), then yeah... we just need to make it so it sets |
@adam-vessey I'm trying to follow developments here... It sounds like you might have found an issue with the particular Solr config being used that's causing the tags to appear in the first place -- is that right? Given that the issue is not reproducible unless you have that particular Solr configuration, should we close this pull? |
@bondjimbond: I've not gone so far as to reproduce it (having no content on hand at the moment to test with), I believe that if you disable the "Enable Solr Fast Vector Highlighting" bit at ... The fix would be changing the existing code to be something like: $component = variable_get('islandora_ocr_solr_hocr_highlighting_use_fast', TRUE) ?
'tag' :
'simple';
$results = islandora_paged_content_perform_solr_highlighting_query($term, array(
'fq' => array(format_string('!field:("info:fedora/!value" OR "!value")', array(
'!field' => variable_get('islandora_internet_archive_bookreader_ocr_filter_field', 'RELS_EXT_isMemberOf_uri_ms'),
'!value' => $object_id,
))),
"hl.$component.pre" => '{{{',
"hl.$component.post" => '}}}',
'defType' => 'dismax',
)); Would want to double-check that the core Islandora Solr doesn't stomp on our "simple" values... but given that it only adds if they're not already there, then... it's probably fine in that respect? |
THANK YOU, that did the job. I can now see the "em" attached to the start and end of the term. So the issue arises only when you have turned off Solr Fast Vector Highlighting. This pull resolves the problem, then... but @adam-vessey - are you suggesting that it should take a different approach? |
@bondjimbond: Yes, a different approach, with the goal of making the snippet match delimiters consistent independent of when "Enable Solr Fast Vector Highlighting" is enabled or disabled... as in: When it is enabled, what does the markup for it look like? Do we actually display the braces surrounding the matches? Are they replaced with markup somewhere else? |
@bondjimbond: Naive search (may also be stuff happening elsewhere), but appears to be relevant, replacing the |
@adam-vessey @bondjimbond It is shown em after disabling Solr Fast Vector Highlighting, try the PR and em gone... |
@bondjimbond Also on line https://github.com/Islandora/internet_archive_bookreader/blob/e645cd172c983b453f6ebcd38901cd7d1f1290a3/BookReader/BookReader.js#L3483 |
@DigitLib A space would certainly be nice. |
@bondjimbond it is easy to do I can pull PR for that? also can add ... before and after search result? |
@alxp @rosiel Thoughts on @adam-vessey's comments? |
@rosiel @alxp Just a prod... Do you have any thoughts on the comments from @adam-vessey? I missed the last Committers' Call, so I don't know if it was discussed there. |
Yes, it's reproducible if you set jquery update to 1.7 and disable fastVector highlighting in the OCR tool. Based on the Solr Highlighting documentation, fastVector highlighting is one method of doing highlighting, of four available (in solr 7 at least). Maybe when this module was written it was considered good, but now "in order of general recommendation" it ranks 3rd of 4. So turning off "fastVector highlighting" seems to fall back on the Original highlighter, which is 'ranked' second. Highlighters have (slightly) different requirements and different arguments. I don't know why we default to using fastVector, or if it gives us other useful features. But when using the Original highlighter, the Maybe instead of stripping tags, along with
That way we would get highlighted results both with fastVector checkbox enabled and without. |
Updated to point against 7.x branch
Strip HTML tags before removing non-alphanumeric characters.
JIRA Ticket: (https://jira.duraspace.org/browse/ISLANDORA-2297)
What does this Pull Request do?
A brief description of what the intended result of the PR will be and/or what problem it solves.
What's new?
A in-depth description of the changes made by this PR. Technical details and possible side effects.
Example:
How should this be tested?
A description of what steps someone could take to:
Additional Notes:
Any additional information that you think would be helpful when reviewing this PR.
Example:
Interested parties
Tag (@ mention) interested parties or, if unsure, @Islandora/7-x-1-x-committers