-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to data source information #36
base: master
Are you sure you want to change the base?
Conversation
I've added an |
Hmm, interesting approach! I'm a little bit skeptical that all the "CKAN" instances around the world expose exactly the same APIs with the same information in them. My experiences in the past, even just with CKAN's in Australia turned up enough weird edge cases etc. |
While it's true that various CKAN portals use addons to augment the metadata fields and data structures for their data, the license fields are standard to CKAN, so are likely going to be consistent across installations (leaving the possibilities as either "present" or "missing" rather "in some other obscure field that we don't know the name of"). I acknowledge though that some metadata on when data has been updated is sometimes put in other odd fields, though. |
How fussy do you want to be about license names? Below is the variety of information from all CKAN data sources regarding CC licenses, and I'm not sure how best to represent the nuance in them all in a single "license" field.
I wrote this script which got me 95% of the way there to putting the license information into the source JSON file using the CKAN field names. (I ran it through #!/usr/bin/env python3
import json
import urllib.request
import ssl
import pprint
# Don't do SSL verification because it didn't work right away and I can't be bothered debugging it
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
with open('../sources-out.json') as json_data:
d = json.load(json_data)
for x in d:
if 'ckan_api' in x:
api = x['ckan_api']
with urllib.request.urlopen(api, context=ctx) as url:
data = json.loads(url.read().decode())
if 'result' not in data:
continue
else:
result = data['result']
print(x['id'])
try:
for l in ( 'license_id', 'license_title', 'license_url' ):
if l in result:
x[l] = result[l]
except:
print('BROKEN', x['id'])
continue
with open('data.json', 'w') as outfile:
json.dump(d, outfile) |
Yeah, agreed. I've already started using a
|
Easy. Will make some tweaks. |
Done. Have just made up license names for licenses that aren't listed in SPDX. Ref: |
I had this grand plan to programmatically determine which data sources were from CKAN portals, and then pull license information from the API, but I don't know enough JavaScript to do that. Here's where I got to in my attempts:
In lieu of that, I used a mixture of programmatic and manual methods to find CKAN API endpoints for all the datasets I could, and added a new field to each dataset called "ckan_api", which can be used to retrieve the JSON API object that should contain the
license_title
,license_id
, andlicense_url
, as well as other information if needed in future.Perhaps you're able to fairly trivially write a script to populate each dataset's license with the content from its JSON API information? This should partially address #34.
Given the API endpoints, these may also point to the last updated date for each dataset, thereby also potentially addressing #14.
I'm trying to see if I can find something similar for ArcGIS, but I think it's less consistent with its API endpoints. Will see how I go.