Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClueWeb22 #210

Open
11 of 12 tasks
janheinrichmerker opened this issue Oct 5, 2022 · 9 comments
Open
11 of 12 tasks

ClueWeb22 #210

janheinrichmerker opened this issue Oct 5, 2022 · 9 comments

Comments

@janheinrichmerker
Copy link
Contributor

janheinrichmerker commented Oct 5, 2022

Dataset Information:

ClueWeb22 is the newest in the Lemur Project's ClueWeb line of datasets that support research on information retrieval, natural language processing and related human language technologies. This new dataset is being developed by the Lemur Project with significant assistance and support from Microsoft Corporation.

The ClueWeb22 dataset has several novel characteristics compared with earlier ClueWeb datasets.

  • It is much larger.
  • Documents are of higher quality.
  • Documents are provided in several formats (HTML, clean text, screen shots).
  • Document page analyses are provided that reveal where on a page text was displayed, and what was near it.
  • The dataset includes a large set of crowdsourced queries and shallow relevance assessments (a pseudo search log).

Authors: Arnold Overwijk, Chenyan Xiong (@xiongchenyan), Jamie Callan (@jamiecallan), Cameron VandenBerg, Xiao Lucy Liu

Links to Resources:

Dataset ID(s) & supported entities:

  • clueweb22/a: 200M docs, queries, qrels, scoreddocs?
  • clueweb22/b: 2B docs, queries?, qrels?, scoreddocs?
  • clueweb22/l: 10B docs, queries?, qrels?, scoreddocs?

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

  • Dataset definition (in ir_datasets/datasets/clueweb22.py)
  • Tests (in tests/integration/clueweb22.py)
  • Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
  • Documentation (in ir_datasets/etc/clueweb22.yaml)
  • Downloadable content (in ir_datasets/etc/downloads.json) Manual download requirded.
    • Download instructions added
    • Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
    • Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

The dataset is planned to be used for shared tasks in the near future.
I also personally think it is of very high value to have this in ir_datasets.

Open Questions

  • Where to get the topic tag mentioned in the paper?
  • Is VDOM-Paragraph the same as VDOM-Passage in the WARC headers?
  • What means the ? in the inlink format anchor type description?
@seanmacavaney
Copy link
Collaborator

Excellent, thanks @heinrichreimer!

A while back I requested that they include offset files to facilitate random lookups, and it looks like it made it into the final spec! This will make adding the datasets much easier, since we won't need to save zlib states and release our own checkpoint files.

@jamiecallan
Copy link

jamiecallan commented Oct 11, 2022 via email

@janheinrichmerker
Copy link
Contributor Author

Thanks for explaining the file structure!
A small data sample would indeed be nice. Then we can start implementing the required parsers for ir_datasets even before our full copy arrives.
@seanmacavaney do you want to do it or should I take some time to implement the parser then?

@seanmacavaney
Copy link
Collaborator

We're still in the process of requesting the data here. A sample would indeed be helpful for getting started.

@heinrichreimer -- I've got a pretty busy couple of weeks coming up, would you be able to take a stab at the implementation?

@janheinrichmerker
Copy link
Contributor Author

Sure, I'll try my best. I guess most of the code can be "recycled" from ClueWeb12 anyway.

@seanmacavaney
Copy link
Collaborator

Awesome, thanks! The most challenging bit is doing lookups, but with the offset file that's included, this should be much easier.

Feel free to reach out if you have problems/questions/etc. Thanks!

@janheinrichmerker
Copy link
Contributor Author

As ClueWeb22 also features language tags and is structured in a way to efficiently filter by language, I'll also include subsets like this:

  • clueweb22/a/en
  • clueweb22/a/de
  • clueweb22/a/zn
  • ...
  • clueweb22/a/other-languages
  • clueweb22/b/en
  • ...
  • clueweb22/l/en
  • ...

@seanmacavaney
Copy link
Collaborator

Great, thanks. This is aligned with clueweb09/[lang]

@janheinrichmerker
Copy link
Contributor Author

As the categories are subsets of the larger ones, I've now also added "views" that can, for example, be used to just parse the plain text from the B category. The keys would be clueweb22/b/as-l, clueweb22/b/as-a, clueweb22/b/as-l/en, clueweb22/b/as-a/en and so on.
To not clutter the list of dataset IDs too much, we could also just skip the language-specific versions for the "views".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants