Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

harvest resource files behind DOI #56

Open
pvgenuchten opened this issue Oct 3, 2024 · 0 comments
Open

harvest resource files behind DOI #56

pvgenuchten opened this issue Oct 3, 2024 · 0 comments

Comments

@pvgenuchten
Copy link
Contributor

For different purposes we want to index the contents of the resource file which exists behind a DOI, this applies to doc/pdf/html reports, not datasets or software:

  • Full text search optimalisation
  • Input for the chatbot
  • Knowledge/context extraction using NLP to augment metadata

In theory you can resolve the DOI to arrive at the resource. However in many cases the DOI resolves to a metadata page, which has one or many links to download the resource.

Results of the harvest are stored on a table:

identifier content mimetype error hash date
10.5281/zenodo.8012910 binary application/pdf place error here, if error occured 3AD4EF42 2024-09-01

Task first retrieves all identifiers for which no content exists yet and no error occured

If a resource consists of multiple files, zip the files to an archive and use mimetype application/zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant