You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#734 already highlights that user might want to search for documents in a specific language, but it focuses mostly only on the technical issues of using proper stemmer / stop words.
I think that we should take a step back and consider that there is an overall bigger issue.
I will illustrate this issue with DevDocs scraper example, because this is where it comes from.
With DevDocs scraper, we intend (see openzim/zim-requests#1230) to create one ZIM with all documentation for a given programming language (e.g. all versions of Python, all versions of Node.JS, ...). We want to do this because from our experience, developers are often switching between versions. For instance at Kiwix, we regularly switch from one Python version to the another one, depending on which Python version is used currently by our project. And we are pretty sure this is a common pattern. So we do not want to create one ZIM per Python version, but preferably one ZIM with all Python versions. This is what we already have from online documentation ZIMed with Zimit: https://library.kiwix.org/viewer#docs.python.org_en
Due to the nature of these documentation, this means that we will have many articles which are very close to one another. And as such, suggestion and full-text search becomes very hard to use. Being able to select a subset of articles to search into would definitely help a lot. For instance the scraper could attach "tags" to the articles, and the user can then request (via the reader) to search only for document with a given tag (or set of tags).
This specific problem of DevDocs is in fact generic to many scrapers / ZIMs:
search only for articles in a given language
search only for articles in a given format (e.g. only ePub, only PDF in Gutenberg)
search only for articles in a given topic (here I spoke about programming languages in DevDocs, but we also have topics in TED, and similar things in multiple other scrapers)
All this looks a lot like faceting capabilities for the libzim.
While this could be implemented by the scraper in the HTML / JS embedded inside the ZIM with JS-based faceting for instance, I tend to start considering this is in fact a feature that must be provided by the libzim and implemented in readers. I don't know if we need full faceting capabilities, but this is a question we need to ask ourselves and answer.
The text was updated successfully, but these errors were encountered:
#734 already highlights that user might want to search for documents in a specific language, but it focuses mostly only on the technical issues of using proper stemmer / stop words.
I think that we should take a step back and consider that there is an overall bigger issue.
I will illustrate this issue with DevDocs scraper example, because this is where it comes from.
With DevDocs scraper, we intend (see openzim/zim-requests#1230) to create one ZIM with all documentation for a given programming language (e.g. all versions of Python, all versions of Node.JS, ...). We want to do this because from our experience, developers are often switching between versions. For instance at Kiwix, we regularly switch from one Python version to the another one, depending on which Python version is used currently by our project. And we are pretty sure this is a common pattern. So we do not want to create one ZIM per Python version, but preferably one ZIM with all Python versions. This is what we already have from online documentation ZIMed with Zimit: https://library.kiwix.org/viewer#docs.python.org_en
Due to the nature of these documentation, this means that we will have many articles which are very close to one another. And as such, suggestion and full-text search becomes very hard to use. Being able to select a subset of articles to search into would definitely help a lot. For instance the scraper could attach "tags" to the articles, and the user can then request (via the reader) to search only for document with a given tag (or set of tags).
This specific problem of DevDocs is in fact generic to many scrapers / ZIMs:
All this looks a lot like faceting capabilities for the libzim.
While this could be implemented by the scraper in the HTML / JS embedded inside the ZIM with JS-based faceting for instance, I tend to start considering this is in fact a feature that must be provided by the libzim and implemented in readers. I don't know if we need full faceting capabilities, but this is a question we need to ask ourselves and answer.
The text was updated successfully, but these errors were encountered: