Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase relevance of search results within ZIM by refining the indexed content #952

Open
wsdookadr opened this issue Feb 10, 2025 · 4 comments

Comments

@wsdookadr
Copy link

wsdookadr commented Feb 10, 2025

The FTS in xapian currently uses the entire text with accents removed in each webpage, but actually not all of it is relevant and this leads to suboptimal search results when searching inside the ZIM. On this, there's this library called readability that has an algorithm for scoring all DOM nodes and selecting the one where the actual webpage content is stored, or rather, is most likely to be stored (what is meant by "actual" is not things like "Similar products" or "See also" nodes). So it would be possible to only index the relevant content or even replace the original content with what readability believes is the relevant content. But either way the search results would improve as a result of this.

It would even be possible to have both indexes side-by-side, meaning, the index as it is right now, unchanged, and a new one with the proposed method.

@kelson42
Copy link
Contributor

FYI, there is actually a way to specify - in HTML - to ignore part of the HTML content. Probably undocumented feature though...

@mgautierfr
Copy link
Collaborator

This is possible and documented : https://libzim.readthedocs.io/en/latest/api/classzim_1_1writer_1_1IndexData.html#_CPPv4NK3zim6writer9IndexData10getContentEv

It is to the scrapper to implement this method has how to provide a curated content depends of the content itself (and so, of the scrapper)

@kelson42
Copy link
Contributor

@mgautierfr yes, although this is not whatbI meant. I meant without having to pass a custom content to index.

@mgautierfr
Copy link
Collaborator

This is not possible. I played a long time ago (before the "new" cpp api and the no namespace change) with our parser (https://github.com/openzim/libzim/blob/main/src/xapian/myhtmlparse.cc) to remove things like footnote or tag with cite_note* id (from memory).
But I quickly stop as we agree that this what not us to do it as it was really specific to wikipedia.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants