Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Downloaded WARC-files doesnt have a .gz extension but are actually zipped (results in issue with playback in SolrWayback) #211

Open
Klindten opened this issue May 3, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@Klindten
Copy link

Klindten commented May 3, 2024

ArchiveWeb.page Version

v0.11.3

What did you expect to happen? What happened instead?

When downloading WARC 1.1 and ingesting them in SolrWayback via UKWA warcindexer I expected to have replay of the WARC-files but this wasn´t the case. Although they seemed to be indexed.

We figured out that the downloaded WARC-files with names like:
webrec_boersen_braender.warc

was actually:
webrec_boersen_braender.warc.gz (ZIP!)

When i put a .gz at the end of the filename and re-indexed all was great again.

Solution:
Could .gz-files be named .gz? SolrWayback can´t guess if it´s gzipped or not.

Step-by-step reproduction instructions

  1. Download and install SolrWayback https://github.com/netarchivesuite/solrwayback/releases
  2. Crawl content via ArchiveWeb.page Extension
  3. Download WARC 1.1 files
  4. Index via SolrWayback
  5. Replay results
  6. Replay doesnt work

Additional details

No response

@Klindten Klindten added the bug Something isn't working label May 3, 2024
@thomasegense
Copy link

thomasegense commented May 3, 2024

We do not want every playback resource loaded having to detect zip or not. It will increase playback speed to much.

The warc-indexer can index them but giver a warning. The reason the warc-indexer will accept them is because it does a fallback and try zip. But this is only done once for the whole file.

Section 12.3 GZIP WARC file name suffix
https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/

@tw4l tw4l self-assigned this May 7, 2024
@tw4l tw4l moved this from Triage to Todo in Webrecorder Projects May 7, 2024
@ikreymer
Copy link
Member

ikreymer commented May 7, 2024

We do not want every playback resource loaded having to detect zip or not. It will increase playback speed to much.

The warc-indexer can index them but giver a warning. The reason the warc-indexer will accept them is because it does a fallback and try zip. But this is only done once for the whole file.

Yes, it should only need to do that once for the whole file - they are all compressed, so should treat them all as compressed, right?

The reason it was done this way was to make it easier for users who might double click on a WARC file and to be able to associate a .warc extension with a replay tool. Unfortunately, .warc.gz extension uses the system's gzip utility, resulting in a worse experience for the end user.

But if you are ingesting WARCs, into solrwayback, could you just rename it to match your naming needs? I assume there is some automated process involved that ingests them, is there not?

@Klindten
Copy link
Author

Klindten commented May 8, 2024

For our normal use case, ingesting files to the big web archive, the UKWA WARC-indexer will take care of it.

For manual indexing WARC-files we´ll just rename the files as suggested. It could be great to update the download section with info on the .gz files, maybe like paragraph 2 above, so people making their own crawls and using eg. SolrWayback, know it´s .gz-files.

We will also write this in our SolrWayback readme.

I think the reason why we asked is because files form Browsertrix are with the .gz-extension...

@thomasegense
Copy link

The issue happened for at least 2 participants at the SolrWayback workshop in Paris as they tried to index warc-files
they had created themself.

It is correct we can rename the files to warc.gz before we store them in our repository and problem is solved, but it is still a manuel process that all national libraries using SolrWayback has to remember. And every researcher using SolrWayback for their collections.

I don't know what would be a good solution. I can see the inconvenience from the wrong default file association for .gz files.
But breaking the spec and many open source tool working with arc/arc.gz/warc/warc.gz files is also unfortunate.
No one creates warc (without gz) files anymore, but large national collections has many old warc-files in their collection still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Todo
Development

No branches or pull requests

4 participants