-
Notifications
You must be signed in to change notification settings - Fork 101
CKAN commands
Robert Bryson edited this page Oct 14, 2022
·
18 revisions
Documents common CKAN commands that are used for both inventory.data.gov and catalog.data.gov.
These site administrator accounts are handled via a "user provided service" in cloud.gov.
We use the ckanext-saml2auth extension
where we implement the sysadmins_list
config variable to set who has access.
You can see our implementation in catalog,
and you can see the update process in the cloud.gov notes
-
ckan geodatagov sitemap-to-s3
- triggers generation of a new sitemap and uploads to s3. Intended to be run on a schedule with an action
None of these are in cloud.gov production, need to be re-evaluated
Supervisor Name | Full Command | Description | Current Status | Future Plans |
---|---|---|---|---|
ckan-clean-deleted | ckan --plugin=ckanext-geodatagov geodatagov clean-deleted | This job deletes packages and associated information directly from the DB via a long query. Not sure if this can be run regularly without running db-solr-sync. | Not Automated | Add back to automation |
ckan-combine-feeds | ckan --plugin=ckanext-geodatagov geodatagov combine-feeds | This job takes the standard ckan feeds at https://catalog.data.gov/feeds/dataset.atom?page=1 and combines the first 21 pages into 1 record in the database. It is then served at /usasearch-custom-feed.xml to feed USAsearch. USAsearch uses Bing index as backend which does not understand pagination in atom feeds. |
Not Automated | Research if can be deprecated |
ckan-db-solr-sync | ckan --plugin=ckanext-geodatagov geodatagov db_solr_sync | Compares every solr item against the DB and notes items missing from the DB (are eventually removed, marked with notfound ) or those that have a different metadata_modified timestamp (uses search.rebuild to rebuild the index for each of these packages, marked with outsync ). Marks all SOLR packages that match DB as ‘insync’. Compares every DB item that has not been matched with a SOLR item and tries to rebuild the search index for that package/dataset. |
Not Automated | Re-integrate (current system too large/bloated to implement automated) |
ckan-export-csv | ckan --plugin=ckanext-geodatagov geodatagov export-csv | This builds a csv file of the current state of datasets, their organizations, harvest source, and topic (CKAN group) names. Useful for being able to sync up various dataset objects, such as topics. | Not Automated | Automate into s3? Low priority |
ckan-havest-job-cleanup | ckan --plugin=ckanext-geodatagov geodatagov harvest-job-cleanup | First checks harvest_system_info table (updated every time the harvest run job is run), and if it doesn’t exist or doesn’t have recent entries (within 1 hour) the job quits out. Then, it searches for any harvest jobs that are running at least 12 hours (or have not started) that have objects that have not finished within 6 hours. These objects are marked as STUCK . The system also finds jobs that have not ‘started’ within 12 hours and/or have no harvest objects and marks the jobs as Finished . Then it validates the packages are re-linked to the correct harvest object. Finally, it updates all harvest jobs that were forced to complete to a truncated timestamp so as to be easy to see/review. |
Not Automated | Should Automate so less manual intervention is necessary with the harvester |
ckan-jsonl-export | ckan --plugin=ckanext-geodatagov geodatagov jsonl-export | A complete list of all datasets, saved in raw CKAN jsonl form and pushed to s3. | Not Automated | current ticket |
ckan-metrics-csv | ckan --plugin=ckanext-geodatagov geodatagov metrics-csv | A list of all datasets and the number of views for a given month, placed into a csv and loaded into S3 | Automated monthly | Can be run only after ckan-tracking-update is run |
ckan-report | ckan --plugin=ckanext-report report generate | Generates any registered reports (in catalog case, the broken-links report). | Not Automated | Remove report or automate so agencies can use with reliability |
ckan-sitemap | ckan --plugin=ckanext-geodatagov geodatagov sitemap-to-s3 | Creates a sitemap for scanning services. Is made up of each dataset page, broken up into multiple pages. Pushed to s3. | Automation running | Will continue |
ckan-tracking-update | ckan tracking update | Pulls most recent data from tracking_raw and updates tracking_summary; then it syncs this information in solr. Tracking_summary table has not been updated since August 2019. Should contain a list of datasets and their most counts, running total, and most recent views. Native to CKAN and upstream CKAN. | Not Automated | Should only keep 1 metrics dataset, ckan-metrics-csv or this |
pycsw-keywords-all | ckan --plugin=ckanext-spatial ckan-pycsw set_keywords -p /etc/ckan/pycsw-all.cfg* | This grabs top 20 tags from CKAN and put them into /etc/ckan/pycsw-all.cfg as CSW service metadata keywords. | Automated | Unknown |
pycsw-keywords-collection | ckan --plugin=ckanext-spatial ckan-pycsw set_keywords -p /etc/ckan/pycsw-collection.cfg* | This grabs top 20 tags from CKAN and put them into /etc/ckan/pycsw-collection.cfg as CSW service metadata keywords. | Automated | Unknown |
pycsw-load | ckan --plugin=ckanext-spatial ckan-pycsw load -p /etc/ckan/pycsw-all.cfg | Accesses CKAN api to load CKAN datasets into pycsw database. | Automated | Unknown |
/usr/lib/ckan/bin/python /usr/lib/ckan/bin/pycsw-db-admin.py vacuumdb /etc/ckan/pycsw-all.cfg | Does vacuumdb job on pycsw database. | Not in supervisor | Unknown | |
pycsw-reindex-fts | /usr/lib/ckan/bin/python /usr/lib/ckan/bin/pycsw-db-admin.py reindex_fts /etc/ckan/pycsw-all.cfg | Rebuilds GIN index on pycsw records table to speed up full text search. | Automated | Unknown |
qa-update-sel | ckan --plugin=ckanext-qa qa update_sel | This command no longer exists. It was built to update datasets that had been recently modified, meant to be run daily. | Not Automated | Re-add, re-integrate and automate |
qa-update | ckan --plugin=ckanext-qa qa update | Pulls all datasets and creates a QA job for each in a queue (using celery), and runs through them slowly. This QA job runs analysis and stores information on each dataset and resource, giving them a score on various attributes | Not Automated | Need to re-integrate and automate |