ISLANDORA-2297 Book reader malformed search result preview on pin hover #127

alxp · 2018-08-29T16:39:16Z

Updated to point against 7.x branch

Strip HTML tags before removing non-alphanumeric characters.

JIRA Ticket: (https://jira.duraspace.org/browse/ISLANDORA-2297)

Other Relevant Links (Google Groups discussion, related pull requests, Release pull requests, etc.)

What does this Pull Request do?

A brief description of what the intended result of the PR will be and/or what problem it solves.

What's new?

A in-depth description of the changes made by this PR. Technical details and possible side effects.

Example:

Changes x feature to such that y
Added x
Removed y

How should this be tested?

A description of what steps someone could take to:

Reproduce the problem you are fixing (if applicable)
Test that the Pull Request does what is intended.
Please be as detailed as possible.
Good testing instructions help get your PR completed faster.

Additional Notes:

Any additional information that you think would be helpful when reviewing this PR.

Example:

Does this change require documentation to be updated?
Does this change add any new dependencies?
Does this change require any other modifications to be made to the repository (ie. Regeneration activity, etc.)?
Could this change impact execution of existing code?

Interested parties

Tag (@ mention) interested parties or, if unsure, @Islandora/7-x-1-x-committers

Strip HTML tags before removing non-alphanumeric characters.

rosiel · 2018-08-29T22:43:29Z

Thanks @alxp! This looks good to me, but per Islandora rules I can't approve since we work together.

DiegoPino · 2018-08-29T22:46:29Z

@rosiel you can approve. You just can't merge.

bondjimbond · 2018-10-15T18:55:25Z

@alxp Can you add some testing instructions? I'd like to try this out, but your PR doesn't have any of the template changed/filled in.

bondjimbond · 2018-10-19T19:29:29Z

@alxp Just a quick prod since we're hitting code freeze... If you can provide some testing instructions, I can run through this quick and perhaps get it approved before the deadline.

rosiel · 2018-10-25T18:47:46Z

when this pr was changed to use the right branch, the description in the template of the pull request were lost. They are here: #126

bondjimbond · 2018-10-26T14:00:29Z

Thanks, @rosiel. For reviewers/testers then, here's the detail from the original PR:

JIRA Ticket: https://jira.duraspace.org/browse/ISLANDORA-2297

** What does this Pull Request do? **

Strip out HTML tags from search result previews that show when you hover over a pin in the book reader. This solves an issue where HTML tag names were appearing in the search highlights.

I decided to strip out tags because keeping the wanted tags while stripping out any characters that might break layout seems unnecessarily complicated.

** What's new? **
Added a call to the PHP strip_tags() function.

** How should this be tested? **

On a working instance of the Internet Archive Book Reader with a book that has an OCR text stream, and searching working correctly, the problem will be reproducible if Solr sends back tags to surround the searched-for keyword.

E.g., If you search for "mountain" the hover text over a search result pin will show "enmountainem".

bondjimbond · 2018-11-01T15:24:27Z

@rosiel @alxp Trying to test this, but the hover text in my Book Reader (7.x-1.12 VM) just says "Search Result". Any special configuration I need to make it display the key words?

DigitLib · 2018-11-04T12:25:37Z

@rosiel @bondjimbond max version of jquery is 1.7 for display it. In jQuery Update at http://localhost:8000/admin/config/development/jquery_update it is version 1.10.
@DiegoPino @rosiel Should we put this in JIRA?
Tested with changed file and show normal text, also show with unchanged show normal searched word

willtp87 · 2018-11-08T19:36:47Z

Is "enmountainem" a typo or a case of mismatched tags?

alxp · 2018-11-08T19:42:58Z

It should be "emmountainem" i.e., the tag coming from the search engine to show highlights.

bondjimbond · 2018-11-20T15:17:50Z

I'm finally catching up on this. Had to set JQuery to 1.7 -- yes, we need a ticket for that.

But @alxp I'm not reproducing the issue. My search results contain no tags (7.x-1.12 release candidate machine). I even added HTML tags to the OCR datastream (added <b></b> around a term and searched for it); the search preview strips it.

Can you provide clearer testing instructions? What exactly do you have in the OCR that makes this issue appear?

adam-vessey · 2018-11-20T18:27:53Z

We provide different wrapping tokens that Solr should use, instead of the default <em>/</em> tag business... is hl.tag.pre/hl.tag.post set invariant/appended in your Solr request handlers?

... Alternatively... looks like it might use hl.simple.pre/hl.simple.post when not using the "FastVectorHighlighter" business... Possibly just have change which pre/post business we do, conditional on whether or not we're using "FastVectorHightlighter"?... though looks like the Islandora Solr core code itself may set these to other tags... might be an edge case left uncovered, but yeah... more info particular to configurations would help track it down, I think... stripping out the tags (as this PR does presently) shouldn't be required.

alxp · 2018-11-20T19:16:06Z

I found this is SolrConfig.xml

<!-- Configure the standard formatter -->
      <formatter name="html"
                 default="true"
                 class="solr.highlight.HtmlFormatter">
        <lst name="defaults">
          <str name="hl.simple.pre"><![CDATA[<em>]]></str>
          <str name="hl.simple.post"><![CDATA[</em>]]></str>
        </lst>
      </formatter>

This is what would be adding the tags on the results, this looks like a default setting so while we can remove it might still be a good idea to not have the UI render HTML tags if it is going to show them as tags, regardless of how a user's Sole result formatting is configured.

adam-vessey · 2018-11-20T19:34:18Z

I'm... not sure it makes sense, to try to strip markup there: anything in the results that is XML-like should be escaped already (otherwise, you could end up with snippets starting in the middle of an element, and breaking the entire markup of the page, anywhere you attempted to use the snippet)... seems like the only place unescaped markup could come from in snippets would be from these points of configuration... as in: The entire objective of these points of configuration is to add these pre-/suffix bits, for inclusion in the page...

... if it is indeed just because you're not using the "FastVectorHighlighter" (or rather, are using the "Original Highlighter"), then yeah... we just need to make it so it sets hl.simple.pre/.post such that it doesn't do its default thing, the same way we do the hl.tag.pre/.post... then... I'm kind of assuming something picks up on those {{{/}}} delimiters to render some form of marker?... bolding or italicizing? Would make it consistent, in any case.

bondjimbond · 2018-11-21T17:56:35Z

@adam-vessey I'm trying to follow developments here... It sounds like you might have found an issue with the particular Solr config being used that's causing the tags to appear in the first place -- is that right?

Given that the issue is not reproducible unless you have that particular Solr configuration, should we close this pull?

adam-vessey · 2018-11-21T18:42:11Z

@bondjimbond: I've not gone so far as to reproduce it (having no content on hand at the moment to test with), I believe that if you disable the "Enable Solr Fast Vector Highlighting" bit at admin/islandora/tools/ocr (we default to having it enabled), then you should be able to reproduce this...

... The fix would be changing the existing code to be something like:

  $component = variable_get('islandora_ocr_solr_hocr_highlighting_use_fast', TRUE) ?
    'tag' :
    'simple';
  $results = islandora_paged_content_perform_solr_highlighting_query($term, array(
    'fq' => array(format_string('!field:("info:fedora/!value" OR "!value")', array(
      '!field' => variable_get('islandora_internet_archive_bookreader_ocr_filter_field', 'RELS_EXT_isMemberOf_uri_ms'),
      '!value' => $object_id,
    ))),
    "hl.$component.pre" => '{{{',
    "hl.$component.post" => '}}}',
    'defType' => 'dismax',
  ));

Would want to double-check that the core Islandora Solr doesn't stomp on our "simple" values... but given that it only adds if they're not already there, then... it's probably fine in that respect?

bondjimbond · 2018-11-21T19:25:22Z

I believe that if you disable the "Enable Solr Fast Vector Highlighting" bit at admin/islandora/tools/ocr ... then you should be able to reproduce this...

THANK YOU, that did the job. I can now see the "em" attached to the start and end of the term.

So the issue arises only when you have turned off Solr Fast Vector Highlighting.

This pull resolves the problem, then... but @adam-vessey - are you suggesting that it should take a different approach?

adam-vessey · 2018-11-21T19:35:02Z

@bondjimbond: Yes, a different approach, with the goal of making the snippet match delimiters consistent independent of when "Enable Solr Fast Vector Highlighting" is enabled or disabled... as in: When it is enabled, what does the markup for it look like? Do we actually display the braces surrounding the matches? Are they replaced with markup somewhere else?

adam-vessey · 2018-11-21T19:43:22Z

@bondjimbond: Naive search (may also be stuff happening elsewhere), but appears to be relevant, replacing the {{{/}}} delimiters with markup: https://github.com/Islandora/internet_archive_bookreader/blob/e645cd172c983b453f6ebcd38901cd7d1f1290a3/BookReader/BookReader.js#L3479

DigitLib · 2018-11-21T20:09:36Z

@adam-vessey @bondjimbond It is shown em after disabling Solr Fast Vector Highlighting, try the PR and em gone...

DigitLib · 2018-11-21T20:31:59Z

@bondjimbond Also on line https://github.com/Islandora/internet_archive_bookreader/blob/e645cd172c983b453f6ebcd38901cd7d1f1290a3/BookReader/BookReader.js#L3483
to add a space to divide a text and a PageNum? it is an aesthetic issue but give better view

bondjimbond · 2018-11-21T20:58:02Z

@DigitLib A space would certainly be nice.

DigitLib · 2018-11-21T21:02:07Z

@bondjimbond it is easy to do I can pull PR for that? also can add ... before and after search result?

bondjimbond · 2018-11-23T13:49:34Z

Yes, a different approach, with the goal of making the snippet match delimiters consistent independent of when "Enable Solr Fast Vector Highlighting" is enabled or disabled... as in: When it is enabled, what does the markup for it look like? Do we actually display the braces surrounding the matches? Are they replaced with markup somewhere else?

Naive search (may also be stuff happening elsewhere), but appears to be relevant, replacing the {{{/}}} delimiters with markup: https://github.com/Islandora/internet_archive_bookreader/blob/e645cd172c983b453f6ebcd38901cd7d1f1290a3/BookReader/BookReader.js#L3479

@alxp @rosiel Thoughts on @adam-vessey's comments?

bondjimbond · 2018-12-12T20:22:19Z

@rosiel @alxp Just a prod... Do you have any thoughts on the comments from @adam-vessey? I missed the last Committers' Call, so I don't know if it was discussed there.

rosiel · 2018-12-18T22:20:30Z

Yes, it's reproducible if you set jquery update to 1.7 and disable fastVector highlighting in the OCR tool.

Based on the Solr Highlighting documentation, fastVector highlighting is one method of doing highlighting, of four available (in solr 7 at least). Maybe when this module was written it was considered good, but now "in order of general recommendation" it ranks 3rd of 4. So turning off "fastVector highlighting" seems to fall back on the Original highlighter, which is 'ranked' second. Highlighters have (slightly) different requirements and different arguments.

I don't know why we default to using fastVector, or if it gives us other useful features.

But when using the Original highlighter, the hl.tag.pre parameter is called hl.simple.pre (same for .post).

Maybe instead of stripping tags, along with h1.tag.pre/h1.tag.post we could just throw in

'hl.simple.pre' => '{{{',
'hl.simple.post' => '}}}',

That way we would get highlighted results both with fastVector checkbox enabled and without.

ISLANDORA-2297 Book reader malformed search result preview on pin hover

f3d9e59

Strip HTML tags before removing non-alphanumeric characters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISLANDORA-2297 Book reader malformed search result preview on pin hover #127

ISLANDORA-2297 Book reader malformed search result preview on pin hover #127

alxp commented Aug 29, 2018 •

edited by bondjimbond

Loading

rosiel commented Aug 29, 2018

DiegoPino commented Aug 29, 2018

bondjimbond commented Oct 15, 2018

bondjimbond commented Oct 19, 2018

rosiel commented Oct 25, 2018

bondjimbond commented Oct 26, 2018

bondjimbond commented Nov 1, 2018

DigitLib commented Nov 4, 2018 •

edited

Loading

willtp87 commented Nov 8, 2018

alxp commented Nov 8, 2018

bondjimbond commented Nov 20, 2018

adam-vessey commented Nov 20, 2018 •

edited

Loading

alxp commented Nov 20, 2018

adam-vessey commented Nov 20, 2018

bondjimbond commented Nov 21, 2018

adam-vessey commented Nov 21, 2018

bondjimbond commented Nov 21, 2018

adam-vessey commented Nov 21, 2018 •

edited

Loading

adam-vessey commented Nov 21, 2018

DigitLib commented Nov 21, 2018

DigitLib commented Nov 21, 2018 •

edited

Loading

bondjimbond commented Nov 21, 2018

DigitLib commented Nov 21, 2018

bondjimbond commented Nov 23, 2018

bondjimbond commented Dec 12, 2018

rosiel commented Dec 18, 2018

ISLANDORA-2297 Book reader malformed search result preview on pin hover #127

Are you sure you want to change the base?

ISLANDORA-2297 Book reader malformed search result preview on pin hover #127

Conversation

alxp commented Aug 29, 2018 • edited by bondjimbond Loading

What does this Pull Request do?

What's new?

How should this be tested?

Additional Notes:

Interested parties

rosiel commented Aug 29, 2018

DiegoPino commented Aug 29, 2018

bondjimbond commented Oct 15, 2018

bondjimbond commented Oct 19, 2018

rosiel commented Oct 25, 2018

bondjimbond commented Oct 26, 2018

bondjimbond commented Nov 1, 2018

DigitLib commented Nov 4, 2018 • edited Loading

willtp87 commented Nov 8, 2018

alxp commented Nov 8, 2018

bondjimbond commented Nov 20, 2018

adam-vessey commented Nov 20, 2018 • edited Loading

alxp commented Nov 20, 2018

adam-vessey commented Nov 20, 2018

bondjimbond commented Nov 21, 2018

adam-vessey commented Nov 21, 2018

bondjimbond commented Nov 21, 2018

adam-vessey commented Nov 21, 2018 • edited Loading

adam-vessey commented Nov 21, 2018

DigitLib commented Nov 21, 2018

DigitLib commented Nov 21, 2018 • edited Loading

bondjimbond commented Nov 21, 2018

DigitLib commented Nov 21, 2018

bondjimbond commented Nov 23, 2018

bondjimbond commented Dec 12, 2018

rosiel commented Dec 18, 2018

alxp commented Aug 29, 2018 •

edited by bondjimbond

Loading

DigitLib commented Nov 4, 2018 •

edited

Loading

adam-vessey commented Nov 20, 2018 •

edited

Loading

adam-vessey commented Nov 21, 2018 •

edited

Loading

DigitLib commented Nov 21, 2018 •

edited

Loading