harvest resource files behind DOI #56

pvgenuchten · 2024-10-03T07:32:07Z

For different purposes we want to index the contents of the resource file which exists behind a DOI, this applies to doc/pdf/html reports, not datasets or software:

Full text search optimalisation
Input for the chatbot
Knowledge/context extraction using NLP to augment metadata

In theory you can resolve the DOI to arrive at the resource. However in many cases the DOI resolves to a metadata page, which has one or many links to download the resource.

For example in zenodo, the DOI https://doi.org/10.5281/zenodo.8012910 opens the metadata, and https://zenodo.org/api/records/8012910/files-archive downloads the dataset.
In ScienceDirect http://doi.org/10.1016/j.biteb.2022.100975 is the actual article in html, and the pdf is available through https://www.sciencedirect.com/science/article/pii/S2589014X22000329/pdfft
In bonares the DOI https://doi.org/10.20387/bonares-dykr-eh37 points to a generic web application which preloads the selected dataset using javascript

Results of the harvest are stored on a table:

identifier	content	mimetype	error	hash	date
10.5281/zenodo.8012910	binary	application/pdf	place error here, if error occured	3AD4EF42	2024-09-01

Task first retrieves all identifiers for which no content exists yet and no error occured

If a resource consists of multiple files, zip the files to an archive and use mimetype application/zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

harvest resource files behind DOI #56

harvest resource files behind DOI #56

pvgenuchten commented Oct 3, 2024

harvest resource files behind DOI #56

harvest resource files behind DOI #56

Comments

pvgenuchten commented Oct 3, 2024