-
Notifications
You must be signed in to change notification settings - Fork 14
DB Performance
Here are some SQL scripts that dramatically improve db performance.
We noticed that there was increasingly more performance issues with the CKAN catalog for IOOS. Initially during our first few months of testing, we didn't notice any problems. Eventually it got to the point where we couldn't purge without waiting more than 5 minutes for the server to respond and we couldn't clear a harvest without waiting about half as long.
We hit a critical point when the server (t2.medium EC-2 Instance) was incapable of performing a regular harvest. We tried some remedial repairs by shutting down all but one harvester and stopping all the jobs but one at a time. Even with this approach, we weren't able to bring the system to a working state.
We reached out to the community for guidance. I received some input from the CKAN IRC Channel and from the data.gov folks. Their recommendations let me make some database changes that immediately helped us.
We ended up having to wipe all the datasets out of the database because even after we cleaned up as much as we could the UI was still non-functional. We restored from a backup all of the user accounts and organizations and I recreated the harvests by hand.
After a few days of investigations, we learned that there are several problems.
-
There are a few dozen foreign key constraints in the database. PostgreSQL checks each constraint rigorously which causes some performance slow-downs. The folks at data.gov provided some excellent solutions for this.
-
Revisions. CKAN stores a copy of every change ever made to a dataset. So, every single harvest that we do, all the metadata gets stored as a revision if it was changed in the slightest. So, we're storing several gigabytes of old XML files that we don't need to be retaining as rows in our
harvest_object
table. -
Dangling records. We found several hundred thousand records in the database that weren't attached to any package or anything relevant. These were things like harvest objects, package extents, package extras etc. CKAN apparently lacks the internal mechanics to clean up these. I assume that these are created and left dangling by failed harvests or something that doesn't complete.
-
Critical indexes missing. We found that very important columns in several tables lacked indexes which makes general access slow.
-
Object Relational Mapping ORMs are great for programmers by the way they abstract the interface of a relational database as an object in an object-oriented environment (like Python). Unfortunately they may not always take full advantage of the relational database's ability to make fast performant queries. We noticed that several large transactional queries made by SQLAlchemy (CKAN's ORM of choice) were hugely inefficient.
The following SQL script drops the foreign key constraints that severely impacts performance.
ALTER TABLE "public"."activity_detail" DROP CONSTRAINT "activity_detail_activity_id_fkey";
ALTER TABLE "public"."authorization_group_user" DROP CONSTRAINT "authorization_group_user_authorization_group_id_fkey";
ALTER TABLE "public"."authorization_group_user" DROP CONSTRAINT "authorization_group_user_user_id_fkey";
ALTER TABLE "public"."dashboard" DROP CONSTRAINT "dashboard_user_id_fkey";
ALTER TABLE "public"."group" DROP CONSTRAINT "group_revision_id_fkey";
ALTER TABLE "public"."group_extra" DROP CONSTRAINT "group_extra_group_id_fkey";
ALTER TABLE "public"."group_extra" DROP CONSTRAINT "group_extra_revision_id_fkey";
ALTER TABLE "public"."group_extra_revision" DROP CONSTRAINT "group_extra_revision_continuity_id_fkey";
ALTER TABLE "public"."group_extra_revision" DROP CONSTRAINT "group_extra_revision_group_id_fkey";
ALTER TABLE "public"."group_extra_revision" DROP CONSTRAINT "group_extra_revision_revision_id_fkey";
ALTER TABLE "public"."group_revision" DROP CONSTRAINT "group_revision_continuity_id_fkey";
ALTER TABLE "public"."group_revision" DROP CONSTRAINT "group_revision_revision_id_fkey";
ALTER TABLE "public"."harvest_gather_error" DROP CONSTRAINT "harvest_gather_error_harvest_job_id_fkey";
ALTER TABLE "public"."harvest_job" DROP CONSTRAINT "harvest_job_source_id_fkey";
ALTER TABLE "public"."harvest_object" DROP CONSTRAINT "harvest_object_harvest_job_id_fkey";
ALTER TABLE "public"."harvest_object" DROP CONSTRAINT "harvest_object_harvest_source_id_fkey";
ALTER TABLE "public"."harvest_object_error" DROP CONSTRAINT "harvest_object_error_harvest_object_id_fkey";
ALTER TABLE "public"."harvest_object_extra" DROP CONSTRAINT "harvest_object_extra_harvest_object_id_fkey";
ALTER TABLE "public"."member" DROP CONSTRAINT "member_group_id_fkey";
ALTER TABLE "public"."member" DROP CONSTRAINT "member_revision_id_fkey";
ALTER TABLE "public"."member_revision" DROP CONSTRAINT "member_revision_continuity_id_fkey";
ALTER TABLE "public"."member_revision" DROP CONSTRAINT "member_revision_group_id_fkey";
ALTER TABLE "public"."member_revision" DROP CONSTRAINT "member_revision_revision_id_fkey";
ALTER TABLE "public"."package" DROP CONSTRAINT "package_revision_id_fkey";
ALTER TABLE "public"."package_extra" DROP CONSTRAINT "package_extra_package_id_fkey";
ALTER TABLE "public"."package_extra" DROP CONSTRAINT "package_extra_revision_id_fkey";
ALTER TABLE "public"."package_extra_revision" DROP CONSTRAINT "package_extra_revision_continuity_id_fkey";
ALTER TABLE "public"."package_extra_revision" DROP CONSTRAINT "package_extra_revision_package_id_fkey";
ALTER TABLE "public"."package_extra_revision" DROP CONSTRAINT "package_extra_revision_revision_id_fkey";
ALTER TABLE "public"."package_relationship" DROP CONSTRAINT "package_relationship_object_package_id_fkey";
ALTER TABLE "public"."package_relationship" DROP CONSTRAINT "package_relationship_revision_id_fkey";
ALTER TABLE "public"."package_relationship" DROP CONSTRAINT "package_relationship_subject_package_id_fkey";
ALTER TABLE "public"."package_relationship_revision" DROP CONSTRAINT "package_relationship_revision_continuity_id_fkey";
ALTER TABLE "public"."package_relationship_revision" DROP CONSTRAINT "package_relationship_revision_object_package_id_fkey";
ALTER TABLE "public"."package_relationship_revision" DROP CONSTRAINT "package_relationship_revision_revision_id_fkey";
ALTER TABLE "public"."package_relationship_revision" DROP CONSTRAINT "package_relationship_revision_subject_package_id_fkey";
ALTER TABLE "public"."package_revision" DROP CONSTRAINT "package_revision_continuity_id_fkey";
ALTER TABLE "public"."package_revision" DROP CONSTRAINT "package_revision_revision_id_fkey";
ALTER TABLE "public"."package_tag" DROP CONSTRAINT "package_tag_package_id_fkey";
ALTER TABLE "public"."package_tag" DROP CONSTRAINT "package_tag_revision_id_fkey";
ALTER TABLE "public"."package_tag" DROP CONSTRAINT "package_tag_tag_id_fkey";
ALTER TABLE "public"."package_tag_revision" DROP CONSTRAINT "package_tag_revision_continuity_id_fkey";
ALTER TABLE "public"."package_tag_revision" DROP CONSTRAINT "package_tag_revision_package_id_fkey";
ALTER TABLE "public"."package_tag_revision" DROP CONSTRAINT "package_tag_revision_revision_id_fkey";
ALTER TABLE "public"."package_tag_revision" DROP CONSTRAINT "package_tag_revision_tag_id_fkey";
ALTER TABLE "public"."rating" DROP CONSTRAINT "rating_package_id_fkey";
ALTER TABLE "public"."rating" DROP CONSTRAINT "rating_user_id_fkey";
ALTER TABLE "public"."resource" DROP CONSTRAINT "resource_revision_id_fkey";
ALTER TABLE "public"."resource_revision" DROP CONSTRAINT "resource_revision_continuity_id_fkey";
ALTER TABLE "public"."resource_revision" DROP CONSTRAINT "resource_revision_revision_id_fkey";
ALTER TABLE "public"."system_info" DROP CONSTRAINT "system_info_revision_id_fkey";
ALTER TABLE "public"."system_info_revision" DROP CONSTRAINT "system_info_revision_continuity_id_fkey";
ALTER TABLE "public"."system_info_revision" DROP CONSTRAINT "system_info_revision_revision_id_fkey";
ALTER TABLE "public"."tag" DROP CONSTRAINT "tag_vocabulary_id_fkey";
ALTER TABLE "public"."user_following_dataset" DROP CONSTRAINT "user_following_dataset_follower_id_fkey";
ALTER TABLE "public"."user_following_dataset" DROP CONSTRAINT "user_following_dataset_object_id_fkey";
ALTER TABLE "public"."user_following_group" DROP CONSTRAINT "user_following_group_group_id_fkey";
ALTER TABLE "public"."user_following_group" DROP CONSTRAINT "user_following_group_user_id_fkey";
ALTER TABLE "public"."user_following_user" DROP CONSTRAINT "user_following_user_follower_id_fkey";
ALTER TABLE "public"."user_following_user" DROP CONSTRAINT "user_following_user_object_id_fkey";
ALTER TABLE "public"."activity" DROP CONSTRAINT "activity_pkey";
ALTER TABLE "public"."activity_detail" DROP CONSTRAINT "activity_detail_pkey";
ALTER TABLE "public"."group_extra_revision" DROP CONSTRAINT "group_extra_revision_pkey";
ALTER TABLE "public"."group_revision" DROP CONSTRAINT "group_revision_pkey";
ALTER TABLE "public"."harvest_gather_error" DROP CONSTRAINT "harvest_gather_error_pkey";
ALTER TABLE "public"."harvest_job" DROP CONSTRAINT "harvest_job_pkey";
ALTER TABLE "public"."harvest_object" DROP CONSTRAINT "harvest_object_pkey";
ALTER TABLE "public"."harvest_object_error" DROP CONSTRAINT "harvest_object_error_pkey";
ALTER TABLE "public"."harvest_object_extra" DROP CONSTRAINT "harvest_object_extra_pkey";
ALTER TABLE "public"."harvest_source" DROP CONSTRAINT "harvest_source_pkey";
ALTER TABLE "public"."member" DROP CONSTRAINT "member_pkey";
ALTER TABLE "public"."member_revision" DROP CONSTRAINT "member_revision_pkey";
ALTER TABLE "public"."package_extra" DROP CONSTRAINT "package_extra_pkey";
ALTER TABLE "public"."package_extra_revision" DROP CONSTRAINT "package_extra_revision_pkey";
ALTER TABLE "public"."package_relationship_revision" DROP CONSTRAINT "package_relationship_revision_pkey";
ALTER TABLE "public"."package_revision" DROP CONSTRAINT "package_revision_pkey";
ALTER TABLE "public"."package_tag" DROP CONSTRAINT "package_tag_pkey";
ALTER TABLE "public"."package_tag_revision" DROP CONSTRAINT "package_tag_revision_pkey";
ALTER TABLE "public"."resource_revision" DROP CONSTRAINT "resource_revision_pkey";
ALTER TABLE "public"."revision" DROP CONSTRAINT "revision_pkey";
ALTER TABLE "public"."system_info_revision" DROP CONSTRAINT "system_info_revision_pkey";
BWA: I can't agree with dropping foreign key constraints entirely, if they break, they will likely cause the application to malfunction
The following SQL script creates indexes on several columns that are frequently accessed and used in JOIN-based queries.
create index idx_harvest_object_guid on harvest_object(guid);
create index idx_harvest_object_pkg_id on harvest_object(package_id);
create index idx_harvest_object_id on harvest_object_extra(harvest_object_id);
create index idx_harvest_object_err on harvest_object_error(harvest_object_id);
create index idx_package_extend_pkg_id on package_extent(package_id);
create index idx_package_extra_revision_pkg_id on package_extra_revision(package_id);
create index idx_package_extra_revision on package_extra_revision(id);
--special
create index idx_revision_id on revision(id);
--drop index idx_package_resource_pkg_id_resource_id;
create index idx_resource_name on resource(name);
--create index idx_resource_group_pkg_id on resource_group(package_id);
--create index idx_resource_group_revision_pkg_id on resource_group_revision(package_id);
--create index idx_resource_group_revision_rev_id on resource_group_revision(revision_id);
--create index idx_resource_group_revision on resource_group_revision(id);
create index idx_resource_revision on resource_revision(package_id);
--create index idx_resource_revision_res_grp_id on resource_revision(resource_group_id);
create index idx_member_revision_id on member_revision(id);
create index idx_member_revision_group_id on member_revision(group_id);
drop INDEX idx_package_extra_current;
drop INDEX idx_package_extra_period;
drop INDEX idx_package_extra_period_package;
drop index idx_extra_id_pkg_id;
drop INDEX idx_package_tag_id ;
drop INDEX idx_package_tag_current ;
drop INDEX idx_package_tag_revision_pkg_id_tag_id ;
drop INDEX idx_period_package_tag ;
--drop INDEX idx_resource_group_period ;
--drop INDEX idx_resource_group_period_package ;
--drop INDEX idx_resource_group_current ;
drop INDEX idx_resource_period;
drop INDEX idx_resource_current;
--drop INDEX idx_resource_period_resource_group;
drop index idx_package_group_period_package_group;
drop index "idx_package_group_current";
drop index idx_pkg_id;
drop index idx_pkg_name;
drop index idx_pkg_rev_id;
drop index idx_pkg_sid;
drop index idx_pkg_slname;
drop index idx_pkg_sname;
drop index idx_pkg_srev_id;
drop index idx_pkg_stitle;
drop index idx_pkg_suname;
drop index idx_pkg_title;
drop index idx_pkg_uname;
BWA: Consider using CREATE INDEX CONCURRENTLY
if running on a production server, as it will not block insertion and deletion of rows, although it is more resource intensive
The following SQL script can be used to purge the database of packages that are marked for deletion.
begin;
update package set state = 'to_delete' where state <> 'active';
update package set state = 'to_delete' where owner_org is null;
delete from resource_revision where package_id in (select id from package where state = 'to_delete' );
delete from package_tag_revision where package_id in (select id from package where state = 'to_delete');
delete from member_revision where table_id in (select id from package where state = 'to_delete');
delete from package_extra_revision where package_id in (select id from package where state = 'to_delete');
delete from package_revision where id in (select id from package where state = 'to_delete');
delete from package_tag where package_id in (select id from package where state = 'to_delete');
delete from resource_view where resource_id in (select id from resource where package_id in (select id from package where state = 'to_delete'));
delete from resource where package_id in (select id from package where state = 'to_delete');
delete from package_extra where package_id in (select id from package where state = 'to_delete');
delete from member where table_id in (select id from package where state = 'to_delete');
delete from harvest_object_error hoe using harvest_object ho where ho.id = hoe.harvest_object_id and package_id in (select id from package where state = 'to_delete');
delete from harvest_object_extra hoe using harvest_object ho where ho.id = hoe.harvest_object_id and package_id in (select id from package where state = 'to_delete');
delete from harvest_object where package_id in (select id from package where state = 'to_delete');
delete from package_extent where package_id in (select id from package where state = 'to_delete');
delete from package where id in (select id from package where state = 'to_delete');
commit;
BWA: USING IN
like a JOIN
creates is slower query plans in many cases, especially if there are many rows present
The following SQL script removes all package_extent
records that don't
actively belong to any packages.
begin;
delete from package_extent where package_id not in (select id from package);
commit;
BWA: similar to above comment, but NOT EXISTS
is usually the preferred form.
The following SQL script removes all activity records from the database. Activity records are the results of users clicking around in the UI.
begin;
delete from activity_detail cascade;
delete from activity;
commit;
BWA: Consider using TRUNCATE if concurrency/MVCC issues are not a problem. It runs much faster than DELETE. The cascade here is an alias of table activity_detail -- it isn't an actual keyword in the DELETE grammar specification
The following SQL script was critical in reducing the overall size of the database and removing the revision records for harvest objects.
begin;
delete from harvest_object where id not in (
select distinct on(ho.package_id) ho.id from harvest_object ho
join package p on p.id = ho.package_id
order by ho.package_id, ho.import_started desc
);
delete from harvest_object_error where harvest_object_id not in (select id from harvest_object);
delete from harvest_object_extra where harvest_object_id not in (select id from harvest_object);
commit;
The following script removes all old revisions of package_extras
which are the
metadata attributes to packages. These get bloated quickly because we keep a
copy of every metadata attribute change over time, these are simple things like
the temporal extents of the dataset. For packages that correspond to realtime
datasets, these would get 10-100 rows per harvest.
begin;
delete from revision where id in (select revision_id from package_extra_revision where current is not true);
delete from package_extra_revision where current is not true;
commit;
The following script removes all old resources
, which are the links to
external resources like OPeNDAP, HTML, ERDDAP etc. As mentioned with the
dangling records bullet, we found that several hundred thousand resource records
had no attached package.
begin;
delete from revision where id in (select revision_id from resource_revision where current is not true);
delete from resource_revision where current is not true;
commit;
The following SQL script removes old revisions of packages.
begin;
delete from revision where id in (select revision_id from package_revision where current is not true);
delete from package_revision where current is not true;
commit;
The following SQL script removes tags that aren't currently being referenced.
begin;
delete from tag where id in (select tag_id from package_tag_revision where current is not true);
delete from package_tag where tag_id in (select tag_id from package_tag_revision where current is not true);
delete from revision where id in (select revision_id from package_tag_revision where current is not true);
delete from package_tag_revision where current is not true;
commit;
The database was tuned with various memory. Server settings were updated to log long running queries and any failed statements. It is important to note that due to PostgreSQL's MVCC model, old records are not immediately cleared out when updated or deleted. These old row versions require VACUUM
ing. I (BWA) noticed there were quite a few of these old row versions removed when running VACUUM ANALYZE VERBOSE
, so it may be necessary to tune the autovacuum daemon settings in postgresql.conf