-
-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Downloaded WARC-files doesnt have a .gz extension but are actually zipped (results in issue with playback in SolrWayback) #211
Comments
We do not want every playback resource loaded having to detect zip or not. It will increase playback speed to much. The warc-indexer can index them but giver a warning. The reason the warc-indexer will accept them is because it does a fallback and try zip. But this is only done once for the whole file. Section 12.3 GZIP WARC file name suffix |
Yes, it should only need to do that once for the whole file - they are all compressed, so should treat them all as compressed, right? The reason it was done this way was to make it easier for users who might double click on a WARC file and to be able to associate a .warc extension with a replay tool. Unfortunately, .warc.gz extension uses the system's gzip utility, resulting in a worse experience for the end user. But if you are ingesting WARCs, into solrwayback, could you just rename it to match your naming needs? I assume there is some automated process involved that ingests them, is there not? |
For our normal use case, ingesting files to the big web archive, the UKWA WARC-indexer will take care of it. For manual indexing WARC-files we´ll just rename the files as suggested. It could be great to update the download section with info on the .gz files, maybe like paragraph 2 above, so people making their own crawls and using eg. SolrWayback, know it´s .gz-files. We will also write this in our SolrWayback readme. I think the reason why we asked is because files form Browsertrix are with the .gz-extension... |
The issue happened for at least 2 participants at the SolrWayback workshop in Paris as they tried to index warc-files It is correct we can rename the files to warc.gz before we store them in our repository and problem is solved, but it is still a manuel process that all national libraries using SolrWayback has to remember. And every researcher using SolrWayback for their collections. I don't know what would be a good solution. I can see the inconvenience from the wrong default file association for .gz files. |
ArchiveWeb.page Version
v0.11.3
What did you expect to happen? What happened instead?
When downloading WARC 1.1 and ingesting them in SolrWayback via UKWA warcindexer I expected to have replay of the WARC-files but this wasn´t the case. Although they seemed to be indexed.
We figured out that the downloaded WARC-files with names like:
webrec_boersen_braender.warc
was actually:
webrec_boersen_braender.warc.gz (ZIP!)
When i put a .gz at the end of the filename and re-indexed all was great again.
Solution:
Could .gz-files be named .gz? SolrWayback can´t guess if it´s gzipped or not.
Step-by-step reproduction instructions
Additional details
No response
The text was updated successfully, but these errors were encountered: