[Bug]: Downloaded WARC-files doesnt have a .gz extension but are actually zipped (results in issue with playback in SolrWayback) #211

Klindten · 2024-05-03T10:43:08Z

ArchiveWeb.page Version

v0.11.3

What did you expect to happen? What happened instead?

When downloading WARC 1.1 and ingesting them in SolrWayback via UKWA warcindexer I expected to have replay of the WARC-files but this wasn´t the case. Although they seemed to be indexed.

We figured out that the downloaded WARC-files with names like:
webrec_boersen_braender.warc

was actually:
webrec_boersen_braender.warc.gz (ZIP!)

When i put a .gz at the end of the filename and re-indexed all was great again.

Solution:
Could .gz-files be named .gz? SolrWayback can´t guess if it´s gzipped or not.

Step-by-step reproduction instructions

Download and install SolrWayback https://github.com/netarchivesuite/solrwayback/releases
Crawl content via ArchiveWeb.page Extension
Download WARC 1.1 files
Index via SolrWayback
Replay results
Replay doesnt work

Additional details

No response

thomasegense · 2024-05-03T11:08:36Z

We do not want every playback resource loaded having to detect zip or not. It will increase playback speed to much.

The warc-indexer can index them but giver a warning. The reason the warc-indexer will accept them is because it does a fallback and try zip. But this is only done once for the whole file.

Section 12.3 GZIP WARC file name suffix
https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/

ikreymer · 2024-05-07T17:36:37Z

We do not want every playback resource loaded having to detect zip or not. It will increase playback speed to much.

The warc-indexer can index them but giver a warning. The reason the warc-indexer will accept them is because it does a fallback and try zip. But this is only done once for the whole file.

Yes, it should only need to do that once for the whole file - they are all compressed, so should treat them all as compressed, right?

The reason it was done this way was to make it easier for users who might double click on a WARC file and to be able to associate a .warc extension with a replay tool. Unfortunately, .warc.gz extension uses the system's gzip utility, resulting in a worse experience for the end user.

But if you are ingesting WARCs, into solrwayback, could you just rename it to match your naming needs? I assume there is some automated process involved that ingests them, is there not?

Klindten · 2024-05-08T10:32:38Z

For our normal use case, ingesting files to the big web archive, the UKWA WARC-indexer will take care of it.

For manual indexing WARC-files we´ll just rename the files as suggested. It could be great to update the download section with info on the .gz files, maybe like paragraph 2 above, so people making their own crawls and using eg. SolrWayback, know it´s .gz-files.

We will also write this in our SolrWayback readme.

I think the reason why we asked is because files form Browsertrix are with the .gz-extension...

thomasegense · 2024-05-13T07:46:28Z

The issue happened for at least 2 participants at the SolrWayback workshop in Paris as they tried to index warc-files
they had created themself.

It is correct we can rename the files to warc.gz before we store them in our repository and problem is solved, but it is still a manuel process that all national libraries using SolrWayback has to remember. And every researcher using SolrWayback for their collections.

I don't know what would be a good solution. I can see the inconvenience from the wrong default file association for .gz files.
But breaking the spec and many open source tool working with arc/arc.gz/warc/warc.gz files is also unfortunate.
No one creates warc (without gz) files anymore, but large national collections has many old warc-files in their collection still.

Klindten added the bug Something isn't working label May 3, 2024

tw4l self-assigned this May 7, 2024

tw4l added this to Webrecorder Projects May 7, 2024

github-project-automation bot moved this to Triage in Webrecorder Projects May 7, 2024

tw4l moved this from Triage to Todo in Webrecorder Projects May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Downloaded WARC-files doesnt have a .gz extension but are actually zipped (results in issue with playback in SolrWayback) #211

[Bug]: Downloaded WARC-files doesnt have a .gz extension but are actually zipped (results in issue with playback in SolrWayback) #211

Klindten commented May 3, 2024

thomasegense commented May 3, 2024 •

edited

Loading

ikreymer commented May 7, 2024

Klindten commented May 8, 2024

thomasegense commented May 13, 2024

[Bug]: Downloaded WARC-files doesnt have a .gz extension but are actually zipped (results in issue with playback in SolrWayback) #211

[Bug]: Downloaded WARC-files doesnt have a .gz extension but are actually zipped (results in issue with playback in SolrWayback) #211

Comments

Klindten commented May 3, 2024

ArchiveWeb.page Version

What did you expect to happen? What happened instead?

Step-by-step reproduction instructions

Additional details

thomasegense commented May 3, 2024 • edited Loading

ikreymer commented May 7, 2024

Klindten commented May 8, 2024

thomasegense commented May 13, 2024

thomasegense commented May 3, 2024 •

edited

Loading