You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The FTS in xapian currently uses the entire text with accents removed in each webpage, but actually not all of it is relevant and this leads to suboptimal search results when searching inside the ZIM. On this, there's this library called readability that has an algorithm for scoring all DOM nodes and selecting the one where the actual webpage content is stored, or rather, is most likely to be stored (what is meant by "actual" is not things like "Similar products" or "See also" nodes). So it would be possible to only index the relevant content or even replace the original content with what readability believes is the relevant content. But either way the search results would improve as a result of this.
It would even be possible to have both indexes side-by-side, meaning, the index as it is right now, unchanged, and a new one with the proposed method.
The text was updated successfully, but these errors were encountered:
This is not possible. I played a long time ago (before the "new" cpp api and the no namespace change) with our parser (https://github.com/openzim/libzim/blob/main/src/xapian/myhtmlparse.cc) to remove things like footnote or tag with cite_note* id (from memory).
But I quickly stop as we agree that this what not us to do it as it was really specific to wikipedia.
The FTS in xapian currently uses the entire text with accents removed in each webpage, but actually not all of it is relevant and this leads to suboptimal search results when searching inside the ZIM. On this, there's this library called readability that has an algorithm for scoring all DOM nodes and selecting the one where the actual webpage content is stored, or rather, is most likely to be stored (what is meant by "actual" is not things like "Similar products" or "See also" nodes). So it would be possible to only index the relevant content or even replace the original content with what readability believes is the relevant content. But either way the search results would improve as a result of this.
It would even be possible to have both indexes side-by-side, meaning, the index as it is right now, unchanged, and a new one with the proposed method.
The text was updated successfully, but these errors were encountered: