Use browser-generated text as IndexData #194

rgaudin · 2023-05-31T15:31:57Z

WACZ includes a pages.jsonl file that contains a text property for every page (~HTML entries) that is extracted from the fully rendered DOM.

Using this as source for getIndexData() can be huge boost in quality for dynamic websites (building DOM in JS) versus the current situation in which the text is extracted solely from the HTML source code.

This is controlled by the --text option of the crawler.

From: openzim/warc2zim#81

The text was updated successfully, but these errors were encountered:

benoit74 · 2024-05-28T12:58:18Z

Is this really mandatory for 2.0 ?

benoit74 · 2024-06-18T09:42:29Z

We must still keep a fallback to indexing HTML source code, since we cannot expect pages.jsonl to be always available (warc2zim must work from only a warc file, pages.jsonl is only available when warc2zim is using in conjunction with browsertrix crawler e.g. in zimit scraper)

rgaudin · 2024-06-18T10:31:34Z

I believe this is transparent: if you have index data in pages.jsonl, then you set the getIndexData() and if you don't it's not there and libzim will index as it currently does.

rgaudin added the enhancement label May 31, 2023

rgaudin added this to the 2.0.0 milestone May 31, 2023

rgaudin mentioned this issue May 31, 2023

Alternative idea: store WACZ files in ZIM (wacz2zim) openzim/warc2zim#81

Closed

kelson42 modified the milestones: 2.0.0, 2.1.0 May 28, 2024

benoit74 modified the milestones: 2.1.0, 2.2.0 Jun 18, 2024

benoit74 modified the milestones: 2.2.0, later Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use browser-generated text as IndexData #194

Use browser-generated text as IndexData #194

rgaudin commented May 31, 2023

benoit74 commented May 28, 2024

benoit74 commented Jun 18, 2024

rgaudin commented Jun 18, 2024

Use browser-generated text as IndexData #194

Use browser-generated text as IndexData #194

Comments

rgaudin commented May 31, 2023

benoit74 commented May 28, 2024

benoit74 commented Jun 18, 2024

rgaudin commented Jun 18, 2024