Search only on a subset of articles - add faceted search to the libzim #943

benoit74 · 2025-01-07T09:10:07Z

#734 already highlights that user might want to search for documents in a specific language, but it focuses mostly only on the technical issues of using proper stemmer / stop words.

I think that we should take a step back and consider that there is an overall bigger issue.

I will illustrate this issue with DevDocs scraper example, because this is where it comes from.

With DevDocs scraper, we intend (see openzim/zim-requests#1230) to create one ZIM with all documentation for a given programming language (e.g. all versions of Python, all versions of Node.JS, ...). We want to do this because from our experience, developers are often switching between versions. For instance at Kiwix, we regularly switch from one Python version to the another one, depending on which Python version is used currently by our project. And we are pretty sure this is a common pattern. So we do not want to create one ZIM per Python version, but preferably one ZIM with all Python versions. This is what we already have from online documentation ZIMed with Zimit: https://library.kiwix.org/viewer#docs.python.org_en

Due to the nature of these documentation, this means that we will have many articles which are very close to one another. And as such, suggestion and full-text search becomes very hard to use. Being able to select a subset of articles to search into would definitely help a lot. For instance the scraper could attach "tags" to the articles, and the user can then request (via the reader) to search only for document with a given tag (or set of tags).

This specific problem of DevDocs is in fact generic to many scrapers / ZIMs:

search only for articles in a given language
search only for articles in a given format (e.g. only ePub, only PDF in Gutenberg)
search only for articles in a given topic (here I spoke about programming languages in DevDocs, but we also have topics in TED, and similar things in multiple other scrapers)

All this looks a lot like faceting capabilities for the libzim.

While this could be implemented by the scraper in the HTML / JS embedded inside the ZIM with JS-based faceting for instance, I tend to start considering this is in fact a feature that must be provided by the libzim and implemented in readers. I don't know if we need full faceting capabilities, but this is a question we need to ask ourselves and answer.

benoit74 added enhancement question labels Jan 7, 2025

benoit74 mentioned this issue Jan 7, 2025

Create a ZIM with all versions of a given project openzim/devdocs#42

Open

kelson42 self-assigned this Jan 7, 2025

kelson42 added this to the 9.3.0 milestone Jan 7, 2025

kelson42 modified the milestones: 9.3.0, 10.0.0 Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search only on a subset of articles - add faceted search to the libzim #943

Search only on a subset of articles - add faceted search to the libzim #943

benoit74 commented Jan 7, 2025

Search only on a subset of articles - add faceted search to the libzim #943

Search only on a subset of articles - add faceted search to the libzim #943

Comments

benoit74 commented Jan 7, 2025