Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL agnostic deduplication of WARC #13

Open
Arkiver2 opened this issue Aug 25, 2016 · 0 comments
Open

URL agnostic deduplication of WARC #13

Arkiver2 opened this issue Aug 25, 2016 · 0 comments

Comments

@Arkiver2
Copy link

Arkiver2 commented Aug 25, 2016

This would be useful for grabs where the exact same images are grabbed with different URLs. There should be a revisit record from an URL to a duplicated URL. Duplicated URLs can be best discovered by comparing the hashes.

This would be used for the flickr Archive Team project. The WARCs would be postprocessed with warcat deduplication.

edit: better explanation of what this would be used for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants