Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open PID with Xarray #139

Open
Marco-DKRZ opened this issue Dec 7, 2023 · 12 comments
Open

Open PID with Xarray #139

Marco-DKRZ opened this issue Dec 7, 2023 · 12 comments

Comments

@Marco-DKRZ
Copy link

Is it possible to implement a feature, which enables intake-xarray to open a file based on its PID?

For example:

intake.open('hdl:21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df')

In this example hdl:21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df is a PID handle of a CMIP6 precipitation data set.

@martindurant
Copy link
Member

Could you please explain what a PID is, and how you map it to the actual asset/file?

@Marco-DKRZ
Copy link
Author

A PID is a Persistent identifier, which is a long lasting reference to a digital object (https://en.wikipedia.org/wiki/Persistent_identifier). PIDs can be resolved via a handle server: https://hdl.handle.net/

In the example above the request would look like this: https://hdl.handle.net/api/handles/21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df, where 21.14100/02c6b729-fff6-4f31-a8da-2cf590b544dfz is the PID.

The json response has an entry with the file location:

9	
index	10
type	"URL_ORIGINAL_DATA"
data	
format	"string"
value	'<locations><location href="http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2021-12-21T13:09:16.983+00:00" host="esgf3.dkrz.de" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /></locations>'
ttl	86400
timestamp	"2021-12-21T13:09:20Z"

This file can be downloaded or opened directly with xarray. An example of this workflow can be found in the following notebook: https://gitlab.dkrz.de/data-infrastructure-services/fdo/-/blob/master/automated_data_access_improved.ipynb?ref_type=heads

PIDs for this kind of climate data (CMIP6, https://en.wikipedia.org/wiki/Coupled_Model_Intercomparison_Project) are standardized with always the same keywords.

My question would be, is it possible to implement a function that allows xarray to open a file by simply passing its PID?

@martindurant
Copy link
Member

Interesting! I see that the HDL server also knows about the "dataset" that this is part of (which links, in turn, to a DOI).

is it possible to implement a function that allows xarray to open a file by simply passing its PID

Certainly. It would be easy to add to intake-xarray, but I would like to add it to add it to Intake Take2, as this process "transform URL of known form to other URL of known type" is just the kind of thing it's designed for.

Question:
since "hdl:21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df" is essentially URL/file like, would this actually be an fsspec-like operation rather than intake?

@martindurant
Copy link
Member

Was this closed in error?

With scratch code in Intake 2, I have

In [1]: import intake

In [2]: h = intake.readers.datatypes.Handle("hdl:/21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df")

In [3]: h.to_reader().read()
Out[3]: HDF5, {'url': 'http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc', 'storage_options': None, 'path': '', 'metadata': {'URL': {'format': 'string', 'value': 'https://handle-esgf.dkrz.de/lp/21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df'}, 'AGGREGATION_LEVEL': {'format': 'string', 'value': 'FILE'}, 'FIXED_CONTENT': {'format': 'string', 'value': 'TRUE'}, 'FILE_NAME': {'format': 'string', 'value': 'pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc'}, 'FILE_SIZE': {'format': 'string', 'value': '374071932'}, 'IS_PART_OF': {'format': 'string', 'value': 'hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace'}, 'FILE_VERSION': {'format': 'string', 'value': '1'}, 'CHECKSUM': {'format': 'string', 'value': '76d11477fbb4acbd2d0db1595a9ef16309f53eb6c2874078bfb122167241d2f5'}, 'CHECKSUM_METHOD': {'format': 'string', 'value': 'SHA256'}, 'URL_ORIGINAL_DATA': {'format': 'string', 'value': '<locations><location href="http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2021-12-21T13:09:16.983+00:00" host="esgf3.dkrz.de" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /></locations>'}, 'URL_REPLICA': {'format': 'string', 'value': '<locations><location href="http://esgf-data1.llnl.gov/thredds/fileServer/https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2021-12-27T06:23:20.561+00:00" host="esgf-data1.llnl.gov" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /><location href="http://eagle.alcf.anl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2023-11-15T20:37:07.249+00:00" host="eagle.alcf.anl.gov" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /></locations>'}, 'PROBLEM': {'format': 'string', 'value': '500_R_N (2021-12-27T06:23:20.561+00:00);500_R_N (2023-11-15T20:37:07.249+00:00)'}, 'HS_ADMIN': {'format': 'admin', 'value': {'handle': '21.14100/ADMINLIST', 'index': 200, 'permissions': '111111111111'}}}}

In [4]: _.to_reader("xarray").read()
Out[4]:
<xarray.Dataset>
Dimensions:    (time: 7305, bnds: 2, lat: 96, lon: 192)
Coordinates:
  * time       (time) datetime64[ns] 1870-01-01T12:00:00 ... 1889-12-31T12:00:00
  * lat        (lat) float64 -88.57 -86.72 -84.86 -83.0 ... 84.86 86.72 88.57
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
...

@Marco-DKRZ
Copy link
Author

Was this closed in error?

Yes, it has been closed in error.

With scratch code in Intake 2, I have

In [1]: import intake

In [2]: h = intake.readers.datatypes.Handle("hdl:/21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df")

In [3]: h.to_reader().read()
Out[3]: HDF5, {'url': 'http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc', 'storage_options': None, 'path': '', 'metadata': {'URL': {'format': 'string', 'value': 'https://handle-esgf.dkrz.de/lp/21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df'}, 'AGGREGATION_LEVEL': {'format': 'string', 'value': 'FILE'}, 'FIXED_CONTENT': {'format': 'string', 'value': 'TRUE'}, 'FILE_NAME': {'format': 'string', 'value': 'pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc'}, 'FILE_SIZE': {'format': 'string', 'value': '374071932'}, 'IS_PART_OF': {'format': 'string', 'value': 'hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace'}, 'FILE_VERSION': {'format': 'string', 'value': '1'}, 'CHECKSUM': {'format': 'string', 'value': '76d11477fbb4acbd2d0db1595a9ef16309f53eb6c2874078bfb122167241d2f5'}, 'CHECKSUM_METHOD': {'format': 'string', 'value': 'SHA256'}, 'URL_ORIGINAL_DATA': {'format': 'string', 'value': '<locations><location href="http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2021-12-21T13:09:16.983+00:00" host="esgf3.dkrz.de" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /></locations>'}, 'URL_REPLICA': {'format': 'string', 'value': '<locations><location href="http://esgf-data1.llnl.gov/thredds/fileServer/https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2021-12-27T06:23:20.561+00:00" host="esgf-data1.llnl.gov" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /><location href="http://eagle.alcf.anl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2023-11-15T20:37:07.249+00:00" host="eagle.alcf.anl.gov" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /></locations>'}, 'PROBLEM': {'format': 'string', 'value': '500_R_N (2021-12-27T06:23:20.561+00:00);500_R_N (2023-11-15T20:37:07.249+00:00)'}, 'HS_ADMIN': {'format': 'admin', 'value': {'handle': '21.14100/ADMINLIST', 'index': 200, 'permissions': '111111111111'}}}}

In [4]: _.to_reader("xarray").read()
Out[4]:
<xarray.Dataset>
Dimensions:    (time: 7305, bnds: 2, lat: 96, lon: 192)
Coordinates:
  * time       (time) datetime64[ns] 1870-01-01T12:00:00 ... 1889-12-31T12:00:00
  * lat        (lat) float64 -88.57 -86.72 -84.86 -83.0 ... 84.86 86.72 88.57
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
...

Perfect, this looks like the workflow I imagined.

@Marco-DKRZ Marco-DKRZ reopened this Dec 8, 2023
@Marco-DKRZ
Copy link
Author

Interesting! I see that the HDL server also knows about the "dataset" that this is part of (which links, in turn, to a DOI).

is it possible to implement a function that allows xarray to open a file by simply passing its PID

Certainly. It would be easy to add to intake-xarray, but I would like to add it to add it to Intake Take2, as this process "transform URL of known form to other URL of known type" is just the kind of thing it's designed for.

Question: since "hdl:21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df" is essentially URL/file like, would this actually be an fsspec-like operation rather than intake?

That is actually a good question. The example provided was a single file. However, we also have dataset PIDs, e.g. 21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace. For those intake might be the better choice.

@martindurant
Copy link
Member

Using the HAS_PARTS value?

@Marco-DKRZ
Copy link
Author

What exactly do you mean?
Yes, the aggregated dataset PIDs (e.g. https://hdl.handle.net/api/handles/21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace) show the file PIDs under HAS_PARTS.

@martindurant
Copy link
Member

OK, this class implements it for V2, although some questions remain. It could also be included in this repo for V1.

@Marco-DKRZ
Copy link
Author

Thanks a lot for implementing it. :-)
Which questions remain?

@martindurant
Copy link
Member

There are some comments in the code.

It's a little awkward to return data instances, which you then have to do something with; so maybe it would be better to return Xarray readers or even the final xarray instances.

@Marco-DKRZ
Copy link
Author

That is a valid point. Anyway, it is a start!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants