CKAN commands

Documents common CKAN commands that are used for both inventory.data.gov and catalog.data.gov.

System administrator accounts

These site administrator accounts are handled via a "user provided service" in cloud.gov. We use the ckanext-saml2auth extension where we implement the sysadmins_list config variable to set who has access. You can see our implementation in catalog, and you can see the update process in the cloud.gov notes

Catalog commands

ckan geodatagov sitemap-to-s3 - triggers generation of a new sitemap and uploads to s3. Intended to be run on a schedule with an action

None of these are in cloud.gov production, need to be re-evaluated

Supervisor Name	Full Command	Description	Current Status	Future Plans
ckan-clean-deleted	ckan --plugin=ckanext-geodatagov geodatagov clean-deleted	This job deletes packages and associated information directly from the DB via a long query. Not sure if this can be run regularly without running db-solr-sync.	Not Automated	Add back to automation
ckan-combine-feeds	ckan --plugin=ckanext-geodatagov geodatagov combine-feeds	This job takes the standard ckan feeds at https://catalog.data.gov/feeds/dataset.atom?page=1 and combines the first 21 pages into 1 record in the database. It is then served at `/usasearch-custom-feed.xml` to feed USAsearch. USAsearch uses Bing index as backend which does not understand pagination in atom feeds.	Not Automated	Research if can be deprecated
ckan-db-solr-sync	ckan --plugin=ckanext-geodatagov geodatagov db_solr_sync	Compares every solr item against the DB and notes items missing from the DB (are eventually removed, marked with `notfound`) or those that have a different `metadata_modified` timestamp (uses `search.rebuild` to rebuild the index for each of these packages, marked with `outsync`). Marks all SOLR packages that match DB as ‘insync’. Compares every DB item that has not been matched with a SOLR item and tries to rebuild the search index for that package/dataset.	Not Automated	Re-integrate (current system too large/bloated to implement automated)
ckan-export-csv	ckan --plugin=ckanext-geodatagov geodatagov export-csv	This builds a csv file of the current state of datasets, their organizations, harvest source, and topic (CKAN group) names. Useful for being able to sync up various dataset objects, such as topics.	Not Automated	Automate into s3? Low priority
ckan-havest-job-cleanup	ckan --plugin=ckanext-geodatagov geodatagov harvest-job-cleanup	First checks `harvest_system_info` table (updated every time the `harvest run` job is run), and if it doesn’t exist or doesn’t have recent entries (within 1 hour) the job quits out. Then, it searches for any harvest jobs that are running at least 12 hours (or have not started) that have objects that have not finished within 6 hours. These objects are marked as `STUCK`. The system also finds jobs that have not ‘started’ within 12 hours and/or have no harvest objects and marks the jobs as `Finished`. Then it validates the packages are re-linked to the correct harvest object. Finally, it updates all harvest jobs that were forced to complete to a truncated timestamp so as to be easy to see/review.	Not Automated	Should Automate so less manual intervention is necessary with the harvester
ckan-jsonl-export	ckan --plugin=ckanext-geodatagov geodatagov jsonl-export	A complete list of all datasets, saved in raw CKAN jsonl form and pushed to s3.	Not Automated	current ticket
ckan-metrics-csv	ckan --plugin=ckanext-geodatagov geodatagov metrics-csv	A list of all datasets and the number of views for a given month, placed into a csv and loaded into S3	Automated monthly	Can be run only after ckan-tracking-update is run
ckan-report	ckan --plugin=ckanext-report report generate	Generates any registered reports (in catalog case, the broken-links report).	Not Automated	Remove report or automate so agencies can use with reliability
ckan-sitemap	ckan --plugin=ckanext-geodatagov geodatagov sitemap-to-s3	Creates a sitemap for scanning services. Is made up of each dataset page, broken up into multiple pages. Pushed to s3.	Automation running	Will continue
ckan-tracking-update	ckan tracking update	Pulls most recent data from tracking_raw and updates tracking_summary; then it syncs this information in solr. Tracking_summary table has not been updated since August 2019. Should contain a list of datasets and their most counts, running total, and most recent views. Native to CKAN and upstream CKAN.	Not Automated	Should only keep 1 metrics dataset, ckan-metrics-csv or this
pycsw-keywords-all	ckan --plugin=ckanext-spatial ckan-pycsw set_keywords -p /etc/ckan/pycsw-all.cfg*	This grabs top 20 tags from CKAN and put them into /etc/ckan/pycsw-all.cfg as CSW service metadata keywords.	Automated	Unknown
pycsw-keywords-collection	ckan --plugin=ckanext-spatial ckan-pycsw set_keywords -p /etc/ckan/pycsw-collection.cfg*	This grabs top 20 tags from CKAN and put them into /etc/ckan/pycsw-collection.cfg as CSW service metadata keywords.	Automated	Unknown
pycsw-load	ckan --plugin=ckanext-spatial ckan-pycsw load -p /etc/ckan/pycsw-all.cfg	Accesses CKAN api to load CKAN datasets into pycsw database.	Automated	Unknown
	/usr/lib/ckan/bin/python /usr/lib/ckan/bin/pycsw-db-admin.py vacuumdb /etc/ckan/pycsw-all.cfg	Does vacuumdb job on pycsw database.	Not in supervisor	Unknown
pycsw-reindex-fts	/usr/lib/ckan/bin/python /usr/lib/ckan/bin/pycsw-db-admin.py reindex_fts /etc/ckan/pycsw-all.cfg	Rebuilds GIN index on pycsw records table to speed up full text search.	Automated	Unknown
qa-update-sel	ckan --plugin=ckanext-qa qa update_sel	This command no longer exists. It was built to update datasets that had been recently modified, meant to be run daily.	Not Automated	Re-add, re-integrate and automate
qa-update	ckan --plugin=ckanext-qa qa update	Pulls all datasets and creates a QA job for each in a queue (using celery), and runs through them slowly. This QA job runs analysis and stores information on each dataset and resource, giving them a score on various attributes	Not Automated	Need to re-integrate and automate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CKAN commands

System administrator accounts

Catalog commands

Clone this wiki locally