Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use browser-generated text as IndexData #194

Open
rgaudin opened this issue May 31, 2023 · 3 comments
Open

Use browser-generated text as IndexData #194

rgaudin opened this issue May 31, 2023 · 3 comments
Milestone

Comments

@rgaudin
Copy link
Member

rgaudin commented May 31, 2023

WACZ includes a pages.jsonl file that contains a text property for every page (~HTML entries) that is extracted from the fully rendered DOM.

Using this as source for getIndexData() can be huge boost in quality for dynamic websites (building DOM in JS) versus the current situation in which the text is extracted solely from the HTML source code.

This is controlled by the --text option of the crawler.

From: openzim/warc2zim#81

@benoit74
Copy link
Collaborator

Is this really mandatory for 2.0 ?

@kelson42 kelson42 modified the milestones: 2.0.0, 2.1.0 May 28, 2024
@benoit74
Copy link
Collaborator

We must still keep a fallback to indexing HTML source code, since we cannot expect pages.jsonl to be always available (warc2zim must work from only a warc file, pages.jsonl is only available when warc2zim is using in conjunction with browsertrix crawler e.g. in zimit scraper)

@benoit74 benoit74 modified the milestones: 2.1.0, 2.2.0 Jun 18, 2024
@rgaudin
Copy link
Member Author

rgaudin commented Jun 18, 2024

I believe this is transparent: if you have index data in pages.jsonl, then you set the getIndexData() and if you don't it's not there and libzim will index as it currently does.

@benoit74 benoit74 modified the milestones: 2.2.0, later Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants