Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISLANDORA-2297 Book reader malformed search result preview on pin hover #127

Open
wants to merge 1 commit into
base: 7.x
Choose a base branch
from

Conversation

alxp
Copy link

@alxp alxp commented Aug 29, 2018

Updated to point against 7.x branch

Strip HTML tags before removing non-alphanumeric characters.

JIRA Ticket: (https://jira.duraspace.org/browse/ISLANDORA-2297)

  • Other Relevant Links (Google Groups discussion, related pull requests, Release pull requests, etc.)

What does this Pull Request do?

A brief description of what the intended result of the PR will be and/or what problem it solves.

What's new?

A in-depth description of the changes made by this PR. Technical details and possible side effects.

Example:

  • Changes x feature to such that y
  • Added x
  • Removed y

How should this be tested?

A description of what steps someone could take to:

  • Reproduce the problem you are fixing (if applicable)
  • Test that the Pull Request does what is intended.
  • Please be as detailed as possible.
  • Good testing instructions help get your PR completed faster.

Additional Notes:

Any additional information that you think would be helpful when reviewing this PR.

Example:

  • Does this change require documentation to be updated?
  • Does this change add any new dependencies?
  • Does this change require any other modifications to be made to the repository (ie. Regeneration activity, etc.)?
  • Could this change impact execution of existing code?

Interested parties

Tag (@ mention) interested parties or, if unsure, @Islandora/7-x-1-x-committers

Strip HTML tags before removing non-alphanumeric characters.
@rosiel
Copy link
Member

rosiel commented Aug 29, 2018

Thanks @alxp! This looks good to me, but per Islandora rules I can't approve since we work together.

@DiegoPino
Copy link

@rosiel you can approve. You just can't merge.

@bondjimbond
Copy link

@alxp Can you add some testing instructions? I'd like to try this out, but your PR doesn't have any of the template changed/filled in.

@bondjimbond
Copy link

@alxp Just a quick prod since we're hitting code freeze... If you can provide some testing instructions, I can run through this quick and perhaps get it approved before the deadline.

@rosiel
Copy link
Member

rosiel commented Oct 25, 2018

when this pr was changed to use the right branch, the description in the template of the pull request were lost. They are here: #126

@bondjimbond
Copy link

Thanks, @rosiel. For reviewers/testers then, here's the detail from the original PR:

JIRA Ticket: https://jira.duraspace.org/browse/ISLANDORA-2297

** What does this Pull Request do? **

Strip out HTML tags from search result previews that show when you hover over a pin in the book reader. This solves an issue where HTML tag names were appearing in the search highlights.

I decided to strip out tags because keeping the wanted tags while stripping out any characters that might break layout seems unnecessarily complicated.

** What's new? **
Added a call to the PHP strip_tags() function.

** How should this be tested? **

On a working instance of the Internet Archive Book Reader with a book that has an OCR text stream, and searching working correctly, the problem will be reproducible if Solr sends back tags to surround the searched-for keyword.

E.g., If you search for "mountain" the hover text over a search result pin will show "enmountainem".

@bondjimbond
Copy link

@rosiel @alxp Trying to test this, but the hover text in my Book Reader (7.x-1.12 VM) just says "Search Result". Any special configuration I need to make it display the key words?

@DigitLib
Copy link

DigitLib commented Nov 4, 2018

@rosiel @bondjimbond max version of jquery is 1.7 for display it. In jQuery Update at http://localhost:8000/admin/config/development/jquery_update it is version 1.10.
@DiegoPino @rosiel Should we put this in JIRA?
Tested with changed file and show normal text, also show with unchanged show normal searched word

@willtp87
Copy link
Member

willtp87 commented Nov 8, 2018

Is "enmountainem" a typo or a case of mismatched tags?

@alxp
Copy link
Author

alxp commented Nov 8, 2018

It should be "emmountainem" i.e., the tag coming from the search engine to show highlights.

@bondjimbond
Copy link

I'm finally catching up on this. Had to set JQuery to 1.7 -- yes, we need a ticket for that.

But @alxp I'm not reproducing the issue. My search results contain no tags (7.x-1.12 release candidate machine). I even added HTML tags to the OCR datastream (added <b></b> around a term and searched for it); the search preview strips it.

Can you provide clearer testing instructions? What exactly do you have in the OCR that makes this issue appear?

@adam-vessey
Copy link

adam-vessey commented Nov 20, 2018

We provide different wrapping tokens that Solr should use, instead of the default <em>/</em> tag business... is hl.tag.pre/hl.tag.post set invariant/appended in your Solr request handlers?

... Alternatively... looks like it might use hl.simple.pre/hl.simple.post when not using the "FastVectorHighlighter" business... Possibly just have change which pre/post business we do, conditional on whether or not we're using "FastVectorHightlighter"?... though looks like the Islandora Solr core code itself may set these to other tags... might be an edge case left uncovered, but yeah... more info particular to configurations would help track it down, I think... stripping out the tags (as this PR does presently) shouldn't be required.

@alxp
Copy link
Author

alxp commented Nov 20, 2018

I found this is SolrConfig.xml

<!-- Configure the standard formatter -->
      <formatter name="html"
                 default="true"
                 class="solr.highlight.HtmlFormatter">
        <lst name="defaults">
          <str name="hl.simple.pre"><![CDATA[<em>]]></str>
          <str name="hl.simple.post"><![CDATA[</em>]]></str>
        </lst>
      </formatter>

This is what would be adding the tags on the results, this looks like a default setting so while we can remove it might still be a good idea to not have the UI render HTML tags if it is going to show them as tags, regardless of how a user's Sole result formatting is configured.

@adam-vessey
Copy link

I'm... not sure it makes sense, to try to strip markup there: anything in the results that is XML-like should be escaped already (otherwise, you could end up with snippets starting in the middle of an element, and breaking the entire markup of the page, anywhere you attempted to use the snippet)... seems like the only place unescaped markup could come from in snippets would be from these points of configuration... as in: The entire objective of these points of configuration is to add these pre-/suffix bits, for inclusion in the page...

... if it is indeed just because you're not using the "FastVectorHighlighter" (or rather, are using the "Original Highlighter"), then yeah... we just need to make it so it sets hl.simple.pre/.post such that it doesn't do its default thing, the same way we do the hl.tag.pre/.post... then... I'm kind of assuming something picks up on those {{{/}}} delimiters to render some form of marker?... bolding or italicizing? Would make it consistent, in any case.

@bondjimbond
Copy link

@adam-vessey I'm trying to follow developments here... It sounds like you might have found an issue with the particular Solr config being used that's causing the tags to appear in the first place -- is that right?

Given that the issue is not reproducible unless you have that particular Solr configuration, should we close this pull?

@adam-vessey
Copy link

@bondjimbond: I've not gone so far as to reproduce it (having no content on hand at the moment to test with), I believe that if you disable the "Enable Solr Fast Vector Highlighting" bit at admin/islandora/tools/ocr (we default to having it enabled), then you should be able to reproduce this...

... The fix would be changing the existing code to be something like:

  $component = variable_get('islandora_ocr_solr_hocr_highlighting_use_fast', TRUE) ?
    'tag' :
    'simple';
  $results = islandora_paged_content_perform_solr_highlighting_query($term, array(
    'fq' => array(format_string('!field:("info:fedora/!value" OR "!value")', array(
      '!field' => variable_get('islandora_internet_archive_bookreader_ocr_filter_field', 'RELS_EXT_isMemberOf_uri_ms'),
      '!value' => $object_id,
    ))),
    "hl.$component.pre" => '{{{',
    "hl.$component.post" => '}}}',
    'defType' => 'dismax',
  ));

Would want to double-check that the core Islandora Solr doesn't stomp on our "simple" values... but given that it only adds if they're not already there, then... it's probably fine in that respect?

@bondjimbond
Copy link

I believe that if you disable the "Enable Solr Fast Vector Highlighting" bit at admin/islandora/tools/ocr ... then you should be able to reproduce this...

THANK YOU, that did the job. I can now see the "em" attached to the start and end of the term.

So the issue arises only when you have turned off Solr Fast Vector Highlighting.

This pull resolves the problem, then... but @adam-vessey - are you suggesting that it should take a different approach?

@adam-vessey
Copy link

adam-vessey commented Nov 21, 2018

@bondjimbond: Yes, a different approach, with the goal of making the snippet match delimiters consistent independent of when "Enable Solr Fast Vector Highlighting" is enabled or disabled... as in: When it is enabled, what does the markup for it look like? Do we actually display the braces surrounding the matches? Are they replaced with markup somewhere else?

@adam-vessey
Copy link

@bondjimbond: Naive search (may also be stuff happening elsewhere), but appears to be relevant, replacing the {{{/}}} delimiters with markup: https://github.com/Islandora/internet_archive_bookreader/blob/e645cd172c983b453f6ebcd38901cd7d1f1290a3/BookReader/BookReader.js#L3479

@DigitLib
Copy link

@adam-vessey @bondjimbond It is shown em after disabling Solr Fast Vector Highlighting, try the PR and em gone...

@DigitLib
Copy link

DigitLib commented Nov 21, 2018

@bondjimbond Also on line https://github.com/Islandora/internet_archive_bookreader/blob/e645cd172c983b453f6ebcd38901cd7d1f1290a3/BookReader/BookReader.js#L3483
to add a space to divide a text and a PageNum? it is an aesthetic issue but give better view

@bondjimbond
Copy link

@DigitLib A space would certainly be nice.

@DigitLib
Copy link

@bondjimbond it is easy to do I can pull PR for that? also can add ... before and after search result?

@bondjimbond
Copy link

Yes, a different approach, with the goal of making the snippet match delimiters consistent independent of when "Enable Solr Fast Vector Highlighting" is enabled or disabled... as in: When it is enabled, what does the markup for it look like? Do we actually display the braces surrounding the matches? Are they replaced with markup somewhere else?

Naive search (may also be stuff happening elsewhere), but appears to be relevant, replacing the {{{/}}} delimiters with markup: https://github.com/Islandora/internet_archive_bookreader/blob/e645cd172c983b453f6ebcd38901cd7d1f1290a3/BookReader/BookReader.js#L3479

@alxp @rosiel Thoughts on @adam-vessey's comments?

@bondjimbond
Copy link

@rosiel @alxp Just a prod... Do you have any thoughts on the comments from @adam-vessey? I missed the last Committers' Call, so I don't know if it was discussed there.

@rosiel
Copy link
Member

rosiel commented Dec 18, 2018

Yes, it's reproducible if you set jquery update to 1.7 and disable fastVector highlighting in the OCR tool.

Based on the Solr Highlighting documentation, fastVector highlighting is one method of doing highlighting, of four available (in solr 7 at least). Maybe when this module was written it was considered good, but now "in order of general recommendation" it ranks 3rd of 4. So turning off "fastVector highlighting" seems to fall back on the Original highlighter, which is 'ranked' second. Highlighters have (slightly) different requirements and different arguments.

I don't know why we default to using fastVector, or if it gives us other useful features.

But when using the Original highlighter, the hl.tag.pre parameter is called hl.simple.pre (same for .post).

Maybe instead of stripping tags, along with h1.tag.pre/h1.tag.post we could just throw in

'hl.simple.pre' => '{{{',
'hl.simple.post' => '}}}',

That way we would get highlighted results both with fastVector checkbox enabled and without.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants