Default Data Sources #286

lesserwhirls · 2019-10-07T19:25:25Z

This is a high-level issue to capture an extension to Siphon. The idea is that users of siphon would have a way to access data without a needing to supply a specific source (via class name or URL). Because I'm an unimaginative hack at the best of times, I'll call it "Default Data Sources" for now.

Consider model output. As it is now, you need to know a data source to use Siphon. It would be nice to be able to do something like:

dataset = GFS("0.25", <run date>).gimme()

or

dataset = GFS("0.25", "latest").gimme()

and at that point, you'd have a netCDF4-compatible Dataset object hooked
up to the OPeNDAP or cdmremote endpoint for a specific run, or the the latest available run, of the 0.25 degree GFS. Depending on the requested run time (or the presence of a bounding box), Siphon may try thredds.ucar.edu, thredds-test.unidata.ucar.edu, or www.ncei.noaa.gov. Running on jetstream? thredds-jetstream.unidata.ucar.edu bumps up in priority.

Now, consider a Simple Web Service, such as one of the Upper Air data sources. Currently, siphon requires choosing a specific provider to grab Upper Air data (i.e. WyomingUpperAir or IGRAUpperAir). What if, similar to GFS above, users could simply use:

dataset = UpperAir(sites="all", level="500mb", area="CONUS", <date>).gimme()

and siphon would pick a default source based on the user supplied parameters and/or what data are available "locally close" (e.g. same cloud).

Of course, we'd always want to have a way for the user to determine the actual source for any of these requests. For example:

print(dataset.source_name)
>> "Integrated Global Radiosonde Archive version 2."
print(dataset.source_publishers)
>> "NOAA National Centers for Environmental Information."
print(dataset.source_about)
>> "https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.ncdc:C00975"

Note: In these examples I've used a method called gimme(). I don't actually suggest that, but we don't have a consistent "give me the data" method across different types of data or different access types. For simplewebservices, we have:

request_data, request_all_data, latest_observations, return pandas.DataFrame
acis_request returns a dict
raw_buoy_data, realtime_observations return str.

For NCSS

get_data returns whatever was in the query
get_data_raw returns bytes

For RadarServer,

get_catalog returns a TDSCatalog containing the datasets that match a query

Some work by giving you all the data for the request at once, some provide a "remote" view into the data and only pull things as the variables are sliced.

I think what we want at this level of functionality would be pandas.DataFrame for point type data (things that live in, say, siphon.defaultdatasources.point), and xarray.Dataset for gridded type data (things that live in, say, siphon.defaultdatasources.grids) based on a single request/response loop (so no OPeNDAP or cdmremote kinds of access, for consistency).

The text was updated successfully, but these errors were encountered:

lesserwhirls · 2019-10-09T14:36:53Z

Could tackle #131 at the same time.

lesserwhirls added the Type: Feature label Oct 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default Data Sources #286

Default Data Sources #286

lesserwhirls commented Oct 7, 2019

lesserwhirls commented Oct 9, 2019

Default Data Sources #286

Default Data Sources #286

Comments

lesserwhirls commented Oct 7, 2019

lesserwhirls commented Oct 9, 2019