Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thredds harvest #86

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

Conversation

jyucsiro
Copy link
Contributor

This PR features a tool called threddsnc2rdf.py.

threddsnc2rdf.py provides the ability to pass in a THREDDS or THREDDS catalog endpoint to then crawl any netCDF file and output BALD RDF or Schema.org json-ld descriptions .

Examples:

#output BALD RDF
$python threddsnc2rdf.py https://data.nodc.noaa.gov/thredds/catalog/ncei/gocd/a0000068/catalog.xml

#output Schema.org JSON-LD
$ python threddsnc2rdf.py --schema-org https://data.nodc.noaa.gov/thredds/catalog/ncei/gocd/a0000068/catalog.xml

Example BALD RDF output

@prefix bald: <http://binary-array-ld.net/latest/> .
@prefix ns1: <https://data.nodc.noaa.gov/thredds/dodsC/ncei/gocd/a0000068/gocd_a0000068_46023_199909.nc/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://data.nodc.noaa.gov/thredds/dodsC/ncei/gocd/a0000068/gocd_a0000068_46023_199909.nc> a bald:Container ;
    bald:contains ns1:crs,
        ns1:depth,
        ns1:depth_quality_flag,
        ns1:latitude,
        ns1:latitude_quality_flag,
        ns1:longitude,
        ns1:longitude_quality_flag,
        ns1:time,
        ns1:time_quality_flag,
        ns1:u,
        ns1:u_quality_flag,
        ns1:u_quality_flag_time_ref,
        ns1:u_time_ref,
        ns1:v,
        ns1:v_quality_flag,
        ns1:v_quality_flag_time_ref,
        ns1:v_time_ref ;
    ns1:Conventions "CF-1.6, ACDD-1.3" ;
    ns1:QC_Manual "Reference Manual for Quality Control Subsurface Currents Data, Version 1.0" ;
    ns1:QC_Software "qcscd.R, 1.4, 2017-06-08" ;
    ns1:QC_Version "1.0" ;
    ns1:QC_indicator "1" ;
    ns1:QC_test_codes "1, 1, 1, 1, 1, 1, 1, 1, 1" ;
    ns1:QC_test_names "Platform_Identification, Impossible_Date/Time, Impossible_Location, Position_on_Land, Global_Impossible_Parameter_Values, Spike, Constant_Speed, Constant_Direction, Rate_of_Change_in_Time" ;
    ns1:QC_test_results "1, 1, 1, -9, 1, 1, 1, 1, 1" ;
    ns1:acknowledgment "The most important contributors are the collectors of the original data. Without their efforts, this compilation of data would not have been possible. These data were acquired from the US NOAA National Centers for Environmental Information (NCEI). This work was supported by the NCEI management." ;
    ns1:cdm_data_type "Profile" ;
    ns1:comment " " ;
    ns1:contributor_name "NDBC" ;
    ns1:contributor_role "data originator" ;
    ns1:creator_email "[email protected]" ;
    ns1:creator_institution "NOAA National Centers for Environmental Information " ;
    ns1:creator_name "Charles Sun" ;
    ns1:creator_type "person" ;
    ns1:creator_url <https://www.nodc.noaa.gov/gocd/> ;
    ns1:date_created "2017-06-08T16:47:04Z" ;
    ns1:date_issued "2017-06-08T16:47:09Z" ;
    ns1:date_metadata_modified "2017-06-08T16:47:04Z" ;
    ns1:date_modified "2017-06-08T16:47:09Z" ;
    ns1:featureType "timeSeriesProfile" ;
    ns1:geospatial_bounds "POLYGON((34.71389 -120.96667, 34.71389 -120.96667, 34.71389 -120.96667, 34.71389 -120.96667, 34.71389 -120.96667))" ;
    ns1:geospatial_bounds_crs "EPSG:4326" ;
    ns1:geospatial_bounds_vertical_crs "EPSG:4326" ;
    ns1:geospatial_lat_max 34.71389 ;
    ns1:geospatial_lat_min 34.71389 ;
    ns1:geospatial_lat_resolution "point" ;
    ns1:geospatial_lat_units "degrees_north" ;
    ns1:geospatial_lon_max -120.96667 ;
    ns1:geospatial_lon_min -120.96667 ;
    ns1:geospatial_lon_resolution "point" ;
    ns1:geospatial_lon_units "degrees_east" ;
    ns1:geospatial_vertical_max 328.0 ;
    ns1:geospatial_vertical_min 24.0 ;
    ns1:geospatial_vertical_positive "down" ;
    ns1:geospatial_vertical_resolution "point" ;
    ns1:geospatial_vertical_units "meters" ;
    ns1:gocd_format_version "GOCD-3.0" ;
    ns1:gocd_id "gocd_a0000068_46023_199909.nc" ;
    ns1:history "2017-06-08T16:47:04Z csun writeF291ADCPnc.R 1.0" ;
    ns1:id "0093183" ;
    ns1:institution "NDBC" ;
    ns1:instrument "ADCP" ;
    ns1:instrument_vocabulary "NCEI Ocean Archive System database" ;
    ns1:keywords "EARTH SCIENCE,OCEANS,OCEAN CIRCULATION,OCEAN CURRENTS" ;
    ns1:keywords_vocabulary "GCMD Science Keywords" ;
    ns1:license "These data are openly available to the public Please acknowledge the use of these data with the text given in the acknowledgment attribute." ;
    ns1:metadata_link <https://www.nodc.noaa.gov/cgi-bin/OAS/prd/text/query> ;
    ns1:naming_authority "gov.noaa.nodc" ;
    ns1:ncei_template_version "NCEI_NetCDF_TimeSeriesProfile_Orthogonal_Template_v1.1" ;
    ns1:platform "FIXED PLATFORM" ;
    ns1:platform_vocabulary "NCEI Ocean Archive System database" ;
    ns1:processing_level "not applicable" ;
    ns1:product_version "GOCD 3.0" ;
    ns1:program "NDBC C-MAN stations and moored buoys program" ;
    ns1:project "NDBC C-MAN stations and moored buoys program" ;
    ns1:publisher_email "[email protected]" ;
    ns1:publisher_institution "US DOC; NESDIS; NATIONAL CENTERS FOR ENVIRONMENTAL INFORMATION - IN295" ;
    ns1:publisher_name "US DOC; NESDIS; NATIONAL CENTERS FOR ENVIRONMENTAL INFORMATION - IN295" ;
    ns1:publisher_type "institution" ;
    ns1:publisher_url <https://www.nodc.noaa.gov> ;
    ns1:references <https://www.nodc.noaa.gov/> ;
    ns1:source "global ocean currents in the NCEI archive holdings" ;
    ns1:standard_name_vocabulary "NetCDF Climate and Forecast (CF) Metadata Convention" ;
    ns1:summary "global ocean currents in the NCEI archive holdings" ;
    ns1:time_coverage_duration "P0Y029DT23H00M00S" ;
    ns1:time_coverage_end "1999-09-30T23:49:59Z" ;
    ns1:time_coverage_resolution "R000001/1999-09-01T00:49:59Z/P0Y029DT23H00M00S" ;
    ns1:time_coverage_start "1999-09-01T00:49:59Z" ;
    ns1:title "Global Ocean Currents Database - gocd_a0000068_46023_199909.nc" ;
    ns1:uuid "9fb84bea-612d-4838-bc1c-4393269585bc" .

ns1:latitude_quality_flag a bald:Subject ;
    ns1:_FillValue "-9" ;
    ns1:flag_meanings "good_value probably_good probably_bad bad_value modified_value not_used not_used not_used missing_value" ;
    ns1:flag_values ( 1 2 3 4 5 6 7 8 9 ) ;
    ns1:long_name "Latitude Quality Flags" .

ns1:longitude_quality_flag a bald:Subject ;
    ns1:_FillValue "-9" ;
    ns1:flag_meanings "good_value probably_good probably_bad bad_value modified_value not_used not_used not_used missing_value" ;
    ns1:flag_values ( 1 2 3 4 5 6 7 8 9 ) ;
    ns1:long_name "Longitude Quality Flags" .

ns1:u a bald:Array ;
    bald:references ns1:u_time_ref ;
    bald:shape "(719, 20)" ;
    ns1:C_format "%7.4f" ;
    ns1:FORTRAN_format "F7.4" ;
    ns1:_FillValue "9999.9" ;
    ns1:ancillary_variables ns1:u_quality_flag ;
    ns1:cell_methods "time: point depth:point" ;
    ns1:coordinates ns1:depth,
        ns1:latitude,
        ns1:longitude,
        ns1:time ;
    ns1:coverage_content_type "physicalMeasurement" ;
    ns1:data_max 0.0305 ;
    ns1:data_min -0.031 ;
    ns1:grid_mapping ns1:crs ;
    ns1:long_name "Eastward Velocity Component" ;
    ns1:standard_name "eastward_sea_water_velocity" ;
    ns1:units "m s-1" ;
    ns1:valid_max 5.0 ;
    ns1:valid_min -5.0 .

ns1:v a bald:Array ;
    bald:references ns1:v_time_ref ;
    bald:shape "(719, 20)" ;
    ns1:C_format "%7.4f" ;
    ns1:FORTRAN_format "F7.4" ;
    ns1:_FillValue "9999.9" ;
    ns1:ancillary_variables ns1:v_quality_flag ;
    ns1:cell_methods "time: point depth:point" ;
    ns1:coordinates ns1:depth,
        ns1:latitude,
        ns1:longitude,
        ns1:time ;
    ns1:coverage_content_type "physicalMeasurement" ;
    ns1:data_max 0.045 ;
    ns1:data_min -0.0205 ;
    ns1:grid_mapping ns1:crs ;
    ns1:long_name "Northward Velocity Component" ;
    ns1:standard_name "northward_sea_water_velocity" ;
    ns1:units "m s-1" ;
    ns1:valid_max 5.0 ;
    ns1:valid_min -5.0 .

ns1:depth_quality_flag a bald:Array ;
    bald:shape "(20,)" ;
    ns1:_FillValue "-9" ;
    ns1:flag_meanings "good_value probably_good probably_bad bad_value modified_value not_used not_used not_used missing_value" ;
    ns1:flag_values ( 1 2 3 4 5 6 7 8 9 ) ;
    ns1:long_name "Depth Quality Flags" .

ns1:time_quality_flag a bald:Array ;
    bald:references ns1:time ;
    bald:shape "(719,)" ;
    ns1:_FillValue "-9" ;
    ns1:flag_meanings "good_value probably_good probably_bad bad_value modified_value not_used not_used not_used missing_value" ;
    ns1:flag_values ( 1 2 3 4 5 6 7 8 9 ) ;
    ns1:long_name "Time Quality Flags" .

ns1:u_quality_flag a bald:Array ;
    bald:references ns1:u_quality_flag_time_ref ;
    bald:shape "(719, 20)" ;
    ns1:_FillValue "-9" ;
    ns1:flag_meanings "good_value probably_good probably_bad bad_value modified_value not_used not_used not_used missing_value" ;
    ns1:flag_values ( 1 2 3 4 5 6 7 8 9 ) ;
    ns1:long_name "Eastward Velocity component QC Flags" .

ns1:u_quality_flag_time_ref a bald:Reference,
        bald:Subject ;
    bald:array ns1:time ;
    bald:childBroadcast "(719, 1)" .

ns1:u_time_ref a bald:Reference,
        bald:Subject ;
    bald:array ns1:time ;
    bald:childBroadcast "(719, 1)" .

ns1:v_quality_flag a bald:Array ;
    bald:references ns1:v_quality_flag_time_ref ;
    bald:shape "(719, 20)" ;
    ns1:_FillValue "-9" ;
    ns1:flag_meanings "good_value probably_good probably_bad bad_value modified_value not_used not_used not_used missing_value" ;
    ns1:flag_values ( 1 2 3 4 5 6 7 8 9 ) ;
    ns1:long_name "Northward Velocity component QC Flags" .

ns1:v_quality_flag_time_ref a bald:Reference,
        bald:Subject ;
    bald:array ns1:time ;
    bald:childBroadcast "(719, 1)" .

ns1:v_time_ref a bald:Reference,
        bald:Subject ;
    bald:array ns1:time ;
    bald:childBroadcast "(719, 1)" .

ns1:depth a bald:Array ;
    bald:shape "(20,)" ;
    ns1:C_format "%7.2f" ;
    ns1:FORTRAN_format "F7.2" ;
    ns1:_FillValue "9999.9" ;
    ns1:ancillary_variables ns1:depth_quality_flag ;
    ns1:coverage_content_type "coordinate" ;
    ns1:data_max 328.0 ;
    ns1:data_min 24.0 ;
    ns1:long_name "Depth" ;
    ns1:positive "down" ;
    ns1:standard_name ns1:depth ;
    ns1:units "meters" .

ns1:latitude a bald:Subject ;
    ns1:_FillValue "9999.9" ;
    ns1:axis "Y" ;
    ns1:coverage_content_type "coordinate" ;
    ns1:data_max 34.71389 ;
    ns1:data_min 34.71389 ;
    ns1:grid_mapping ns1:crs ;
    ns1:long_name "Latitude" ;
    ns1:standard_name ns1:latitude ;
    ns1:units "degrees_north" ;
    ns1:valid_max 90.0 ;
    ns1:valid_min -90.0 .

ns1:longitude a bald:Subject ;
    ns1:_FillValue "9999.9" ;
    ns1:axis "X" ;
    ns1:coverage_content_type "coordinate" ;
    ns1:data_max -120.96667 ;
    ns1:data_min -120.96667 ;
    ns1:grid_mapping ns1:crs ;
    ns1:long_name "Longitude" ;
    ns1:standard_name ns1:longitude ;
    ns1:units "degrees_east" ;
    ns1:valid_max 180.0 ;
    ns1:valid_min -180.0 .

ns1:crs a bald:Subject ;
    ns1:coverage_content_type "auxiliaryInformation" ;
    ns1:epsg_code "EPSG:4326" ;
    ns1:grid_mapping_name "latitude_longitude" ;
    ns1:inverse_flattening "298.257223563" ;
    ns1:long_name "Coordinate Reference System" ;
    ns1:longitude_of_prime_meridian "0.0" ;
    ns1:semi_major_axis "6378137.0" .

ns1:time a bald:Array,
        bald:Reference ;
    bald:array ns1:time ;
    bald:first_value 36402.034722222015 ;
    bald:last_value 36431.993055555504 ;
    bald:shape "(719,)" ;
    ns1:C_format "%9.4f" ;
    ns1:FORTRAN_format "F9.4" ;
    ns1:ancillary_variables ns1:time_quality_flag ;
    ns1:axis "T" ;
    ns1:calendar "julian" ;
    ns1:cf_role "timeseries_id" ;
    ns1:coverage_content_type "coordinate" ;
    ns1:data_max 36431.99306 ;
    ns1:data_min 36402.03472 ;
    ns1:long_name ns1:time ;
    ns1:standard_name ns1:time ;
    ns1:units "days since 1900-01-01 00:00:00" .

Example schema.org output

{
    "@context": "http://schema.org/",
    "description": "global ocean currents in the NCEI archive holdings",
    "http://schema.org/identifier": "0093183",
    "http://schema.org/license": "These data are openly available to the public Please acknowledge the use of these data with the text given in the acknowledgment attribute.",
    "id": "https://data.nodc.noaa.gov/thredds/dodsC/ncei/gocd/a0000068/gocd_a0000068_46054_199909.nc",
    "keywords": [
        "OCEANS",
        "OCEAN CIRCULATION",
        "OCEAN CURRENTS",
        "EARTH SCIENCE"
    ],
    "name": "Global Ocean Currents Database - gocd_a0000068_46054_199909.nc",
    "type": "Dataset",
    "url": "https://data.nodc.noaa.gov/thredds/dodsC/ncei/gocd/a0000068/gocd_a0000068_46054_199909.nc"
}

Copy link
Member

@marqh marqh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some dependencies listed here that don't appear to be used.

i'd like to avoid urllib3 and focus url interactions on requests, for example

please confirm whether these dependencies are needed

pydap
requests-futures
owslib
urllib3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this dependency doesn't appear to be used anywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

threddsnc2rdf.py uses the urljoin/split/parse functions of urllib3/urllib.

requests-futures
owslib
urllib3
python-dateutil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this dependency doesn't appear to be used anywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code used to use the parser function. seems like it doesn't now. it can go.

@@ -0,0 +1,6 @@
lxml
pydap
requests-futures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this dependency doesn't appear to be used anywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requests-futures could probably go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants