Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default Data Sources #286

Open
lesserwhirls opened this issue Oct 7, 2019 · 1 comment
Open

Default Data Sources #286

lesserwhirls opened this issue Oct 7, 2019 · 1 comment

Comments

@lesserwhirls
Copy link
Collaborator

This is a high-level issue to capture an extension to Siphon. The idea is that users of siphon would have a way to access data without a needing to supply a specific source (via class name or URL). Because I'm an unimaginative hack at the best of times, I'll call it "Default Data Sources" for now.

Consider model output. As it is now, you need to know a data source to use Siphon. It would be nice to be able to do something like:

dataset = GFS("0.25", <run date>).gimme()

or

dataset = GFS("0.25", "latest").gimme()

and at that point, you'd have a netCDF4-compatible Dataset object hooked
up to the OPeNDAP or cdmremote endpoint for a specific run, or the the latest available run, of the 0.25 degree GFS. Depending on the requested run time (or the presence of a bounding box), Siphon may try thredds.ucar.edu, thredds-test.unidata.ucar.edu, or www.ncei.noaa.gov. Running on jetstream? thredds-jetstream.unidata.ucar.edu bumps up in priority.

Now, consider a Simple Web Service, such as one of the Upper Air data sources. Currently, siphon requires choosing a specific provider to grab Upper Air data (i.e. WyomingUpperAir or IGRAUpperAir). What if, similar to GFS above, users could simply use:

dataset = UpperAir(sites="all", level="500mb", area="CONUS", <date>).gimme()

and siphon would pick a default source based on the user supplied parameters and/or what data are available "locally close" (e.g. same cloud).

Of course, we'd always want to have a way for the user to determine the actual source for any of these requests. For example:

print(dataset.source_name)
>> "Integrated Global Radiosonde Archive version 2."
print(dataset.source_publishers)
>> "NOAA National Centers for Environmental Information."
print(dataset.source_about)
>> "https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.ncdc:C00975"

Note: In these examples I've used a method called gimme(). I don't actually suggest that, but we don't have a consistent "give me the data" method across different types of data or different access types. For simplewebservices, we have:

  • request_data, request_all_data, latest_observations, return pandas.DataFrame
  • acis_request returns a dict
  • raw_buoy_data, realtime_observations return str.

For NCSS

  • get_data returns whatever was in the query
  • get_data_raw returns bytes

For RadarServer,

  • get_catalog returns a TDSCatalog containing the datasets that match a query

Some work by giving you all the data for the request at once, some provide a "remote" view into the data and only pull things as the variables are sliced.

I think what we want at this level of functionality would be pandas.DataFrame for point type data (things that live in, say, siphon.defaultdatasources.point), and xarray.Dataset for gridded type data (things that live in, say, siphon.defaultdatasources.grids) based on a single request/response loop (so no OPeNDAP or cdmremote kinds of access, for consistency).

@lesserwhirls
Copy link
Collaborator Author

Could tackle #131 at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant