Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search only on a subset of articles - add faceted search to the libzim #943

Open
benoit74 opened this issue Jan 7, 2025 · 0 comments
Open
Assignees
Milestone

Comments

@benoit74
Copy link

benoit74 commented Jan 7, 2025

#734 already highlights that user might want to search for documents in a specific language, but it focuses mostly only on the technical issues of using proper stemmer / stop words.

I think that we should take a step back and consider that there is an overall bigger issue.

I will illustrate this issue with DevDocs scraper example, because this is where it comes from.

With DevDocs scraper, we intend (see openzim/zim-requests#1230) to create one ZIM with all documentation for a given programming language (e.g. all versions of Python, all versions of Node.JS, ...). We want to do this because from our experience, developers are often switching between versions. For instance at Kiwix, we regularly switch from one Python version to the another one, depending on which Python version is used currently by our project. And we are pretty sure this is a common pattern. So we do not want to create one ZIM per Python version, but preferably one ZIM with all Python versions. This is what we already have from online documentation ZIMed with Zimit: https://library.kiwix.org/viewer#docs.python.org_en

Due to the nature of these documentation, this means that we will have many articles which are very close to one another. And as such, suggestion and full-text search becomes very hard to use. Being able to select a subset of articles to search into would definitely help a lot. For instance the scraper could attach "tags" to the articles, and the user can then request (via the reader) to search only for document with a given tag (or set of tags).

This specific problem of DevDocs is in fact generic to many scrapers / ZIMs:

  • search only for articles in a given language
  • search only for articles in a given format (e.g. only ePub, only PDF in Gutenberg)
  • search only for articles in a given topic (here I spoke about programming languages in DevDocs, but we also have topics in TED, and similar things in multiple other scrapers)

All this looks a lot like faceting capabilities for the libzim.

While this could be implemented by the scraper in the HTML / JS embedded inside the ZIM with JS-based faceting for instance, I tend to start considering this is in fact a feature that must be provided by the libzim and implemented in readers. I don't know if we need full faceting capabilities, but this is a question we need to ask ourselves and answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants