Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to data source information #36

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

mattcen
Copy link

@mattcen mattcen commented Mar 18, 2020

I had this grand plan to programmatically determine which data sources were from CKAN portals, and then pull license information from the API, but I don't know enough JavaScript to do that. Here's where I got to in my attempts:

#!/usr/bin/env node
  
var fs = require('fs');

var obj = JSON.parse(fs.readFileSync('sources-out.json', 'utf8'));

for(var i in obj) {
  d = obj[i];
  
  dl = d['download']
  if(/.*data\.gov\.au/.test(dl))
  {
    if(/\/geoserver\//.test(dl))
      ds = dl.replace(/.*geoserver\//,"").replace(/\/.*/,"");
    else 
      ds = dl.replace(/.*dataset\//,"").replace(/\/.*/,"");
    url = `https://data.gov.au/data/dataset/${ds}`;
    // SET info HERE
    d['info'] = url;
    
    var api = `https://data.gov.au/api/3/action/package_show?id=${ds}`;
    console.log(api);
    /* Then, in a shell
     ./get_ckan_url.js | while read -r ds; do echo "$ds"; curl -s "$ds" | jq .result.license_title; done
   */
  }
}

In lieu of that, I used a mixture of programmatic and manual methods to find CKAN API endpoints for all the datasets I could, and added a new field to each dataset called "ckan_api", which can be used to retrieve the JSON API object that should contain the license_title, license_id, and license_url, as well as other information if needed in future.

Perhaps you're able to fairly trivially write a script to populate each dataset's license with the content from its JSON API information? This should partially address #34.

Given the API endpoints, these may also point to the last updated date for each dataset, thereby also potentially addressing #14.

I'm trying to see if I can find something similar for ArcGIS, but I think it's less consistent with its API endpoints. Will see how I go.

@mattcen
Copy link
Author

mattcen commented Mar 18, 2020

I've added an arcgis_page field to appropriate datasets that links to a consistently formatted web page. I haven't yet worked out if there's a consistent way to determine what, if any, license a given dataset has from here; work in progress.

@stevage
Copy link
Owner

stevage commented Mar 18, 2020

Hmm, interesting approach! I'm a little bit skeptical that all the "CKAN" instances around the world expose exactly the same APIs with the same information in them. My experiences in the past, even just with CKAN's in Australia turned up enough weird edge cases etc.

@mattcen
Copy link
Author

mattcen commented Mar 18, 2020

While it's true that various CKAN portals use addons to augment the metadata fields and data structures for their data, the license fields are standard to CKAN, so are likely going to be consistent across installations (leaving the possibilities as either "present" or "missing" rather "in some other obscure field that we don't know the name of"). I acknowledge though that some metadata on when data has been updated is sometimes put in other odd fields, though.

@mattcen
Copy link
Author

mattcen commented Mar 19, 2020

How fussy do you want to be about license names?

Below is the variety of information from all CKAN data sources regarding CC licenses, and I'm not sure how best to represent the nuance in them all in a single "license" field.

license_id license_title license_url
cc-0 CC-0 http://creativecommons.org/publicdomain/zero/1.0/deed.nl
cc-by Attribution (BY 4.0) https://www.donneesquebec.ca/fr/licence/#cc-by
cc-by CC-BY http://creativecommons.org/licenses/by/4.0/deed.nl
cc-by Creative Commons Attribution 3.0 Australia http://creativecommons.org/licenses/by/3.0/au/
cc-by Creative Commons Attribution http://creativecommons.org/licenses/by/4.0
cc-by Creative Commons Attribution http://www.opendefinition.org/licenses/cc-by
cc-by-2.5 Creative Commons Attribution 2.5 Australia http://creativecommons.org/licenses/by/2.5/au/
CC-BY-4.0 Creative Commons Attribution 4.0 https://creativecommons.org/licenses/by/4.0/

I wrote this script which got me 95% of the way there to putting the license information into the source JSON file using the CKAN field names. (I ran it through jq . afterwards to format it to look like the original file, and needed to hand-hold santiago and koeln.)

#!/usr/bin/env python3

import json
import urllib.request
import ssl
import pprint

# Don't do SSL verification because it didn't work right away and I can't be bothered debugging it
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

with open('../sources-out.json') as json_data:
    d = json.load(json_data)

for x in d:
    if 'ckan_api' in x:
        api = x['ckan_api']
        with urllib.request.urlopen(api, context=ctx) as url:
            data = json.loads(url.read().decode())
            if 'result' not in data:
                continue
            else:
                result = data['result']
            print(x['id'])
            try:
                for l in ( 'license_id', 'license_title', 'license_url' ):
                    if l in result:
                        x[l] = result[l]
            except:
                print('BROKEN', x['id'])
                continue

with open('data.json', 'w') as outfile:
    json.dump(d, outfile)

@stevage
Copy link
Owner

stevage commented Mar 19, 2020

Below is the variety of information from all CKAN data sources regarding CC licenses, and I'm not sure how best to represent the nuance in them all in a single "license" field.

Yeah, agreed. I've already started using a licenseUrl field too so, I'd suggest:

licence: a short code, preferably SPDX
licenseUrl: link to full text
licenseName: a longer name. I'd primarily use this when there just isn't anything that would work as an ID.

@mattcen
Copy link
Author

mattcen commented Mar 19, 2020

Easy. Will make some tweaks.

@mattcen
Copy link
Author

mattcen commented Mar 19, 2020

Done. Have just made up license names for licenses that aren't listed in SPDX. Ref: OGL-Surrey, OGL-Toronto, other-open, and CC-BY (where no version is listed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants