Increase relevance of search results within ZIM by refining the indexed content #952

wsdookadr · 2025-02-10T15:43:55Z

The FTS in xapian currently uses the entire text with accents removed in each webpage, but actually not all of it is relevant and this leads to suboptimal search results when searching inside the ZIM. On this, there's this library called readability that has an algorithm for scoring all DOM nodes and selecting the one where the actual webpage content is stored, or rather, is most likely to be stored (what is meant by "actual" is not things like "Similar products" or "See also" nodes). So it would be possible to only index the relevant content or even replace the original content with what readability believes is the relevant content. But either way the search results would improve as a result of this.

It would even be possible to have both indexes side-by-side, meaning, the index as it is right now, unchanged, and a new one with the proposed method.

kelson42 · 2025-02-10T15:52:48Z

FYI, there is actually a way to specify - in HTML - to ignore part of the HTML content. Probably undocumented feature though...

mgautierfr · 2025-02-10T20:37:19Z

This is possible and documented : https://libzim.readthedocs.io/en/latest/api/classzim_1_1writer_1_1IndexData.html#_CPPv4NK3zim6writer9IndexData10getContentEv

It is to the scrapper to implement this method has how to provide a curated content depends of the content itself (and so, of the scrapper)

kelson42 · 2025-02-11T03:19:20Z

@mgautierfr yes, although this is not whatbI meant. I meant without having to pass a custom content to index.

mgautierfr · 2025-02-11T08:03:33Z

This is not possible. I played a long time ago (before the "new" cpp api and the no namespace change) with our parser (https://github.com/openzim/libzim/blob/main/src/xapian/myhtmlparse.cc) to remove things like footnote or tag with cite_note* id (from memory).
But I quickly stop as we agree that this what not us to do it as it was really specific to wikipedia.

wsdookadr mentioned this issue Feb 10, 2025

Possible enhancements to ZIM #951

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase relevance of search results within ZIM by refining the indexed content #952

Increase relevance of search results within ZIM by refining the indexed content #952

wsdookadr commented Feb 10, 2025 •

edited

Loading

kelson42 commented Feb 10, 2025

mgautierfr commented Feb 10, 2025

kelson42 commented Feb 11, 2025

mgautierfr commented Feb 11, 2025

Increase relevance of search results within ZIM by refining the indexed content #952

Increase relevance of search results within ZIM by refining the indexed content #952

Comments

wsdookadr commented Feb 10, 2025 • edited Loading

kelson42 commented Feb 10, 2025

mgautierfr commented Feb 10, 2025

kelson42 commented Feb 11, 2025

mgautierfr commented Feb 11, 2025

wsdookadr commented Feb 10, 2025 •

edited

Loading