Add index generation system that uses offsets into the WACZ itself. #38

anjackson · 2023-09-13T07:27:43Z

This proposed new py-wacz command allows you to generate a CDXJ file where the filenames and offsets refer to the WACZ itself rather than the WARC files within.

The idea is that we should be able to store the WACZ files directly without unpacking them, and use this modified CDXJ data to ingest into OutbackCDX. This should mean we can play back WACZ files as easily as we can WARCs.

Currently DRAFT:

In case I've missed something and this is a terrible idea?
Because perhaps the operation should be given a different name than just 'index'? Not sure it's the best name.
~~Because it needs a way to override the path prefix, and possibly even the actual WACZ name, depending on the use case.~~
And it should probably have some tests!

edsu · 2023-09-19T13:55:16Z

This is a pretty ingenious idea!

anjackson · 2023-10-05T07:08:54Z

The CI errors don't seem to be related to my patch?

pydantic.errors.PydanticUserError: A non-annotated attribute was detected: `version = '0.1.0'`. All model fields require a type annotation; if `version` is not meant to be a field, you may be able to resolve this error by annotating it as a `ClassVar` or updating `model_config['ignored_types']`.

Not sure what to do about that.

Add index generation system that uses offsets into the WACZ itself.

153fb61

anjackson mentioned this pull request Sep 13, 2023

Allow direct indexing of WACZ files ukwa/ukwa-manage#113

Open

anjackson added 3 commits September 14, 2023 11:05

Fix formatting with Black.

4125787

Allow output and WACZ prefix to be set.

35f4a23

Tweak code ordering for clarity.

2280a52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add index generation system that uses offsets into the WACZ itself. #38

Add index generation system that uses offsets into the WACZ itself. #38

Uh oh!

anjackson commented Sep 13, 2023 •

edited

Loading

Uh oh!

edsu commented Sep 19, 2023

Uh oh!

anjackson commented Oct 5, 2023

Uh oh!

Uh oh!

Uh oh!

Add index generation system that uses offsets into the WACZ itself. #38

Are you sure you want to change the base?

Add index generation system that uses offsets into the WACZ itself. #38

Uh oh!

Conversation

anjackson commented Sep 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edsu commented Sep 19, 2023

Uh oh!

anjackson commented Oct 5, 2023

Uh oh!

Uh oh!

anjackson commented Sep 13, 2023 •

edited

Loading