Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clueweb22 #213

Closed
wants to merge 114 commits into from
Closed

Conversation

janheinrichmerker
Copy link
Contributor

@janheinrichmerker janheinrichmerker commented Oct 19, 2022

I'd like to keep this PR as a way of tracking progress of the ir_datasets integration for ClueWeb22.
Of course, the implementation is far from finished (as you can see by the numerous todo's 😆).
But I figure that keeping the process open to other contributors might encourage valuable feedback and discussion.

And of course, this PR would close #210 😉

@seanmacavaney
Copy link
Collaborator

Wow-- thanks! Seems to be coming along nicely. The vdom structure is a bit complicated, but I guess it needs to be in order to properly represent the data.

@janheinrichmerker
Copy link
Contributor Author

Yep, I haven't started with the VDOM type yet, but will as soon as the documentation is up.

@seanmacavaney
Copy link
Collaborator

Thanks!

Looks like there are still some py36 incompatibilities: ImportError: cannot import name 'Final' from 'typing'.

My main hesitation remains that in my experience so far with the package, it seems that most users just care about having an easy way to get the text, even when loads of other nice structured data are available. So I'd like to make that case as easy and optimised as possible for folks. You make some reasonable counter-points, though, and I think I'm inclined to agree on the current path forward. But maybe it's worth getting some additional input before committing to it.

@janheinrichmerker
Copy link
Contributor Author

My main hesitation remains that in my experience so far with the package, it seems that most users just care about having an easy way to get the text, even when loads of other nice structured data are available. So I'd like to make that case as easy and optimised as possible for folks.

Well, with the current approach it is already easy (just use clueweb22/b/text instead of clueweb22/b) and optimized (for clueweb22/b/text we would only look at the text files, no WARC is touched).

So why is it a problem to have users explicitly choose clueweb22/b/text if they only care about the text?

I'm now going to test everything with a Python 3.6 interpreter, just to be sure.

@janheinrichmerker
Copy link
Contributor Author

I'd like to add that there are also datasets in ir_datasets where the derived datasets are a suffix to the original dataset:

So I don't see a general pattern for preferring shorter IDs for the "only text"-version.

@janheinrichmerker
Copy link
Contributor Author

That should have been the last few 3.7-incompatible things.

@seanmacavaney
Copy link
Collaborator

Awesome, thanks!

@seanmacavaney
Copy link
Collaborator

Maybe I'd feel a bit more comfortable if we had some performance benchmarks. E.g., how fast is it to iterate the first 100k documents for the combined vs text-only versions?

@janheinrichmerker
Copy link
Contributor Author

These might not be too accurate as I'm accessing the files remotely via CephFS but here you go:

[INFO] [starting] first 100k docs, just text
100000it [00:07, 12524.57it/s]
[INFO] [finished] first 100k docs, just text [8.06s]
[INFO] [starting] first 100k docs, with html, txt, vdom, inlink, outlink
[WARNING] URL hash mismatch for clueweb22-de0000-00-13406: txt URL hash was 9D5A53C6ACCB07B2C2319A4E5E44AB76 but html URL hash was B6956297B5EBBDFEAABF458F2FA5EADC
[WARNING] URL mismatch for clueweb22-de0000-00-13406: outlink URL was https://www.jovanovic.com/quotidien.htm but html URL was https://www.jovanna.de/
[WARNING] URL hash mismatch for clueweb22-de0000-00-13406: outlink URL hash was 9D5A53C6ACCB07B2C2319A4E5E44AB76 but html URL hash was B6956297B5EBBDFEAABF458F2FA5EADC
[WARNING] URL hash mismatch for clueweb22-de0000-01-14834: txt URL hash was 612691A107701D76AD36FD32F8608F3C but html URL hash was 825E120CE7F82C8B0268440A59107D04
[WARNING] URL mismatch for clueweb22-de0000-01-14834: inlink URL was https://simon.ccbcmd.edu/pls/PROD/bwskalog.p_disploginnew?in_id=&cpbl=&newid= but html URL was https://simon-transporte.com/
[WARNING] URL hash mismatch for clueweb22-de0000-01-14834: inlink URL hash was 612691A107701D76AD36FD32F8608F3C but html URL hash was 825E120CE7F82C8B0268440A59107D04
[WARNING] URL mismatch for clueweb22-de0000-01-14834: outlink URL was https://simon.ccbcmd.edu/pls/PROD/bwskalog.p_disploginnew?in_id=&cpbl=&newid= but html URL was https://simon-transporte.com/
[WARNING] URL hash mismatch for clueweb22-de0000-01-14834: outlink URL hash was 612691A107701D76AD36FD32F8608F3C but html URL hash was 825E120CE7F82C8B0268440A59107D04
100000it [03:04, 541.70it/s]
[INFO] [finished] first 100k docs, with html, txt, vdom, inlink, outlink [03:05]

As expected parsing the WARC files is 22x slower than just reading the JSONL file.

@seanmacavaney
Copy link
Collaborator

Great news, my copy of the CW22 drive arrived.

@janheinrichmerker
Copy link
Contributor Author

Great to hear that!

@janheinrichmerker
Copy link
Contributor Author

I've updated the branch to reflect upstream changes and added default_text() implementations.

@janheinrichmerker
Copy link
Contributor Author

Is anything still blocking the merge?

@seanmacavaney
Copy link
Collaborator

Sorry -- the only thing blocking is finding the time to run through the tests on my end.

@janheinrichmerker
Copy link
Contributor Author

Hey @seanmacavaney, have you found time to run the tests? Now that the ClueWeb22 is used in a number of research papers, I really think it would be worth it to add it to ir_datasets. If there is anything I can help with, please let me know.

@janheinrichmerker
Copy link
Contributor Author

Closing this PR in favor of the new ir-datasets-clueweb22 extension.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ClueWeb22
2 participants