Skip to content

Commit

Permalink
Merge pull request #29 from qonto/group-alerts-for-long-running-queri…
Browse files Browse the repository at this point in the history
…es-to-avoid-alert-spam

feat: raise alert on long running queries per user instead of single pid
  • Loading branch information
dcupif authored May 21, 2024
2 parents 438aa6f + 3716026 commit 0aff185
Show file tree
Hide file tree
Showing 7 changed files with 57 additions and 52 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,21 @@ evaluation_interval: 1m

tests:

- name: PostgreSQLLongRunningQuery
- name: PostgreSQLLongRunningQueries
interval: 1m
input_series:
- series: 'pg_active_backend_duration_minutes{target="db1",datname="unittest",usename="test",pid="1234"}'
values: 40+1x10
alert_rule_test:
- alertname: PostgreSQLLongRunningQuery
- alertname: PostgreSQLLongRunningQueries
eval_time: 1m
exp_alerts:
- exp_labels:
target: db1
datname: unittest
usename: test
severity: warning
pid: 1234
exp_annotations:
summary: "Long running query on unittest of db1"
description: "test is running a long query on unittest of db1 with pid 1234"
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/postgresql/PostgreSQLLongRunningQuery"
summary: "Long running queries on unittest of db1 initiated by test"
description: "test is running long queries on unittest of db1"
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/postgresql/PostgreSQLLongRunningQueries"
8 changes: 4 additions & 4 deletions charts/prometheus-postgresql-alerts/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -77,14 +77,14 @@ rules:
summary: "Physical replication slot is inactive"
description: "{{ $labels.slot_name }} on {{ $labels.target }} is inactive"

PostgreSQLLongRunningQuery:
expr: max by (target, datname, usename, pid) (pg_active_backend_duration_minutes{usename!=""}) > 30
PostgreSQLLongRunningQueries:
expr: max by (target, datname, usename) (pg_active_backend_duration_minutes{usename!=""}) > 30
for: 1m
labels:
severity: warning
annotations:
summary: "Long running query on {{ $labels.datname }} of {{ $labels.target }}"
description: "{{ $labels.usename }} is running a long query on {{ $labels.datname }} of {{ $labels.target }} with pid {{ $labels.pid }}"
summary: "Long running queries on {{ $labels.datname }} of {{ $labels.target }} initiated by {{ $labels.usename }}"
description: "{{ $labels.usename }} is running long queries on {{ $labels.datname }} of {{ $labels.target }}"
pintComments:
- disable promql/series

Expand Down
45 changes: 45 additions & 0 deletions content/runbooks/postgresql/PostgreSQLLongRunningQueries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
title: Long running queries
---

# PostgreSQLLongRunningQueries

## Meaning

Alert is triggered when SQL queries run for an extended period.

## Impact

- Block WAL file rotation

- Could block vacuum operations

- Could block other queries due to locks

- Could lead to replication lag on replica

## Diagnosis

1. Open `PostgreSQL server live` dashboard

1. Click on the queries to get details

## Mitigation

1. Identify the PIDs of the long running queries

{{< details title="SQL" open=false >}}
{{% sql "../postgresql/sql/list-long-running-transactions.sql" %}}
{{< /details >}}

1. Cancel the queries

{{% sql "sql/cancel_backend.sql" %}}

1. If queries do not get cancelled, kill them

{{% sql "sql/terminate_backend.sql" %}}

## Additional resources

n/a
39 changes: 0 additions & 39 deletions content/runbooks/postgresql/PostgreSQLLongRunningQuery.md

This file was deleted.

2 changes: 1 addition & 1 deletion content/runbooks/postgresql/SQLExporterScrapingLimit.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ The monitoring system is degraded. SQL exporter does not collect SQL metrics, al
1. Identify and kill heavy queries

<details>
<summary>How terminate a query?</summary>
<summary>How to terminate queries?</summary>

{{% sql "sql/terminate_backend.sql" %}}

Expand Down
2 changes: 1 addition & 1 deletion content/runbooks/postgresql/sql/cancel_backend.sql
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ SELECT
state_change,
query
FROM pg_stat_activity
WHERE pid = <replace_with_pid>;
WHERE pid in ('<replace_with_pids>');
2 changes: 1 addition & 1 deletion content/runbooks/postgresql/sql/terminate_backend.sql
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ SELECT
state_change,
query
FROM pg_stat_activity
WHERE pid = <replace_with_pid>;
WHERE pid in ('<replace_with_pids>');

0 comments on commit 0aff185

Please sign in to comment.