Skip to content

CKAN commands

Robert Bryson edited this page Oct 14, 2022 · 18 revisions

Documents common CKAN commands that are used for both inventory.data.gov and catalog.data.gov.

System administrator accounts

These site administrator accounts are handled via a "user provided service" in cloud.gov. We use the ckanext-saml2auth extension where we implement the sysadmins_list config variable to set who has access. You can see our implementation in catalog, and you can see the update process in the cloud.gov notes

Catalog commands

  • ckan geodatagov sitemap-to-s3 - triggers generation of a new sitemap and uploads to s3. Intended to be run on a schedule with an action

None of these are in cloud.gov production, need to be re-evaluated

Supervisor Name     Full Command               Description           Current Status Future Plans
ckan-clean-deleted ckan --plugin=ckanext-geodatagov geodatagov clean-deleted This job deletes packages and associated information directly from the DB via a long query. Not sure if this can be run regularly without running db-solr-sync. Not Automated Add back to automation
ckan-combine-feeds ckan --plugin=ckanext-geodatagov geodatagov combine-feeds This job takes the standard ckan feeds at https://catalog.data.gov/feeds/dataset.atom?page=1 and combines the first 21 pages into 1 record in the database. It is then served at /usasearch-custom-feed.xml to feed USAsearch. USAsearch uses Bing index as backend which does not understand pagination in atom feeds. Not Automated Research if can be deprecated
ckan-db-solr-sync ckan --plugin=ckanext-geodatagov geodatagov db_solr_sync Compares every solr item against the DB and notes items missing from the DB (are eventually removed, marked with notfound) or those that have a different metadata_modified timestamp (uses search.rebuild to rebuild the index for each of these packages, marked with outsync). Marks all SOLR packages that match DB as ‘insync’. Compares every DB item that has not been matched with a SOLR item and tries to rebuild the search index for that package/dataset. Not Automated Re-integrate (current system too large/bloated to implement automated)
ckan-export-csv ckan --plugin=ckanext-geodatagov geodatagov export-csv This builds a csv file of the current state of datasets, their organizations, harvest source, and topic (CKAN group) names. Useful for being able to sync up various dataset objects, such as topics. Not Automated Automate into s3? Low priority
ckan-havest-job-cleanup ckan --plugin=ckanext-geodatagov geodatagov harvest-job-cleanup First checks harvest_system_info table (updated every time the harvest run job is run), and if it doesn’t exist or doesn’t have recent entries (within 1 hour) the job quits out. Then, it searches for any harvest jobs that are running at least 12 hours (or have not started) that have objects that have not finished within 6 hours. These objects are marked as STUCK. The system also finds jobs that have not ‘started’ within 12 hours and/or have no harvest objects and marks the jobs as Finished. Then it validates the packages are re-linked to the correct harvest object. Finally, it updates all harvest jobs that were forced to complete to a truncated timestamp so as to be easy to see/review. Not Automated Should Automate so less manual intervention is necessary with the harvester
ckan-jsonl-export ckan --plugin=ckanext-geodatagov geodatagov jsonl-export A complete list of all datasets, saved in raw CKAN jsonl form and pushed to s3. Not Automated current ticket
ckan-metrics-csv ckan --plugin=ckanext-geodatagov geodatagov metrics-csv A list of all datasets and the number of views for a given month, placed into a csv and loaded into S3 Automated monthly Can be run only after ckan-tracking-update is run
ckan-report ckan --plugin=ckanext-report report generate Generates any registered reports (in catalog case, the broken-links report). Not Automated Remove report or automate so agencies can use with reliability
ckan-sitemap ckan --plugin=ckanext-geodatagov geodatagov sitemap-to-s3 Creates a sitemap for scanning services. Is made up of each dataset page, broken up into multiple pages. Pushed to s3. Automation running Will continue
ckan-tracking-update ckan tracking update Pulls most recent data from tracking_raw and updates tracking_summary; then it syncs this information in solr. Tracking_summary table has not been updated since August 2019. Should contain a list of datasets and their most counts, running total, and most recent views. Native to CKAN and upstream CKAN. Not Automated Should only keep 1 metrics dataset, ckan-metrics-csv or this
pycsw-keywords-all ckan --plugin=ckanext-spatial ckan-pycsw set_keywords -p /etc/ckan/pycsw-all.cfg* This grabs top 20 tags from CKAN and put them into /etc/ckan/pycsw-all.cfg as CSW service metadata keywords. Automated Unknown
pycsw-keywords-collection ckan --plugin=ckanext-spatial ckan-pycsw set_keywords -p /etc/ckan/pycsw-collection.cfg* This grabs top 20 tags from CKAN and put them into /etc/ckan/pycsw-collection.cfg as CSW service metadata keywords. Automated Unknown
pycsw-load ckan --plugin=ckanext-spatial ckan-pycsw load -p /etc/ckan/pycsw-all.cfg Accesses CKAN api to load CKAN datasets into pycsw database. Automated Unknown
/usr/lib/ckan/bin/python /usr/lib/ckan/bin/pycsw-db-admin.py vacuumdb /etc/ckan/pycsw-all.cfg Does vacuumdb job on pycsw database. Not in supervisor Unknown
pycsw-reindex-fts /usr/lib/ckan/bin/python /usr/lib/ckan/bin/pycsw-db-admin.py reindex_fts /etc/ckan/pycsw-all.cfg Rebuilds GIN index on pycsw records table to speed up full text search. Automated Unknown
qa-update-sel ckan --plugin=ckanext-qa qa update_sel This command no longer exists. It was built to update datasets that had been recently modified, meant to be run daily. Not Automated Re-add, re-integrate and automate
qa-update ckan --plugin=ckanext-qa qa update Pulls all datasets and creates a QA job for each in a queue (using celery), and runs through them slowly. This QA job runs analysis and stores information on each dataset and resource, giving them a score on various attributes Not Automated Need to re-integrate and automate
Clone this wiki locally