From 0384ef270900a53702fa47d2ff3c10afff07b297 Mon Sep 17 00:00:00 2001 From: Aolin Date: Fri, 1 Mar 2024 17:45:31 +0800 Subject: [PATCH 1/4] This is an automated cherry-pick of #16380 Signed-off-by: ti-chi-bot --- ticdc/ticdc-alert-rules.md | 32 +++++++++++++++++++++++++++----- 1 file changed, 27 insertions(+), 5 deletions(-) diff --git a/ticdc/ticdc-alert-rules.md b/ticdc/ticdc-alert-rules.md index 44d586fbce8f0..d3fb94142d6bd 100644 --- a/ticdc/ticdc-alert-rules.md +++ b/ticdc/ticdc-alert-rules.md @@ -16,7 +16,11 @@ For critical alerts, you need to pay close attention to abnormal monitoring metr - Alert rule: +<<<<<<< HEAD (time() - ticdc_processor_checkpoint_ts / 1000) > 600 +======= + `(time() - ticdc_owner_checkpoint_ts / 1000) > 600` +>>>>>>> 5f3a210d15 (ticdc: add ticdc_changefeed_failed alert rule (#16380)) - Description: @@ -24,13 +28,17 @@ For critical alerts, you need to pay close attention to abnormal monitoring metr - Solution: - See [TiCDC Handle Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). + See [TiCDC Handles Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). ## `cdc_resolvedts_high_delay` - Alert rule: +<<<<<<< HEAD (time() - ticdc_processor_resolved_ts / 1000) > 300 +======= + `(time() - ticdc_owner_resolved_ts / 1000) > 300` +>>>>>>> 5f3a210d15 (ticdc: add ticdc_changefeed_failed alert rule (#16380)) - Description: @@ -38,7 +46,21 @@ For critical alerts, you need to pay close attention to abnormal monitoring metr - Solution: - See [TiCDC Handle Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). + See [TiCDC Handles Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). + +### `ticdc_changefeed_failed` + +- Alert rule: + + `(max_over_time(ticdc_owner_status[1m]) == 2) > 0` + +- Description: + + A replication task encounters an unrecoverable error and enters the failed state. + +- Solution: + + This alert is similar to replication interruption. See [TiCDC Handles Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). ### `ticdc_processor_exit_with_error_count` @@ -52,7 +74,7 @@ For critical alerts, you need to pay close attention to abnormal monitoring metr - Solution: - See [TiCDC Handle Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). + See [TiCDC Handles Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). ## Warning alerts @@ -112,7 +134,7 @@ Warning alerts are a reminder for an issue or error. - Solution: - See [TiCDC Handle Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). + See [TiCDC Handles Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). ### `ticdc_puller_entry_sorter_sort_bucket` @@ -146,7 +168,7 @@ Warning alerts are a reminder for an issue or error. - Alert rule: - `changes(tikv_cdc_min_resolved_ts[1m]) < 1 and ON (instance) tikv_cdc_region_resolve_status{status="resolved"} > 0` + `changes(tikv_cdc_min_resolved_ts[1m]) < 1 and ON (instance) tikv_cdc_region_resolve_status{status="resolved"} > 0 and ON (instance) tikv_cdc_captured_region_total > 0` - Description: From bdc7645cea9777a66ae542f8712650ee82c7712c Mon Sep 17 00:00:00 2001 From: Aolin Date: Mon, 4 Mar 2024 13:56:11 +0800 Subject: [PATCH 2/4] resolve conflicts --- ticdc/ticdc-alert-rules.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/ticdc/ticdc-alert-rules.md b/ticdc/ticdc-alert-rules.md index d3fb94142d6bd..b4317e0cc8261 100644 --- a/ticdc/ticdc-alert-rules.md +++ b/ticdc/ticdc-alert-rules.md @@ -34,11 +34,7 @@ For critical alerts, you need to pay close attention to abnormal monitoring metr - Alert rule: -<<<<<<< HEAD - (time() - ticdc_processor_resolved_ts / 1000) > 300 -======= `(time() - ticdc_owner_resolved_ts / 1000) > 300` ->>>>>>> 5f3a210d15 (ticdc: add ticdc_changefeed_failed alert rule (#16380)) - Description: From 27b536adb350d1a57867a707392ac15bf3239b98 Mon Sep 17 00:00:00 2001 From: Aolin Date: Mon, 4 Mar 2024 13:57:01 +0800 Subject: [PATCH 3/4] Apply suggestions from code review --- ticdc/ticdc-alert-rules.md | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/ticdc/ticdc-alert-rules.md b/ticdc/ticdc-alert-rules.md index b4317e0cc8261..187d5f35835b8 100644 --- a/ticdc/ticdc-alert-rules.md +++ b/ticdc/ticdc-alert-rules.md @@ -16,11 +16,7 @@ For critical alerts, you need to pay close attention to abnormal monitoring metr - Alert rule: -<<<<<<< HEAD - (time() - ticdc_processor_checkpoint_ts / 1000) > 600 -======= - `(time() - ticdc_owner_checkpoint_ts / 1000) > 600` ->>>>>>> 5f3a210d15 (ticdc: add ticdc_changefeed_failed alert rule (#16380)) + `(time() - ticdc_processor_checkpoint_ts / 1000) > 600` - Description: @@ -34,7 +30,7 @@ For critical alerts, you need to pay close attention to abnormal monitoring metr - Alert rule: - `(time() - ticdc_owner_resolved_ts / 1000) > 300` + `(time() - ticdc_processor_checkpoint_ts / 1000) > 300` - Description: From 5a4dc087ac21d10c4bf13657ca1bf62fd31f7eb6 Mon Sep 17 00:00:00 2001 From: Aolin Date: Mon, 4 Mar 2024 13:57:23 +0800 Subject: [PATCH 4/4] Apply suggestions from code review --- ticdc/ticdc-alert-rules.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/ticdc-alert-rules.md b/ticdc/ticdc-alert-rules.md index 187d5f35835b8..d0452a4ebb587 100644 --- a/ticdc/ticdc-alert-rules.md +++ b/ticdc/ticdc-alert-rules.md @@ -30,7 +30,7 @@ For critical alerts, you need to pay close attention to abnormal monitoring metr - Alert rule: - `(time() - ticdc_processor_checkpoint_ts / 1000) > 300` + `(time() - ticdc_processor_resolved_ts / 1000) > 300` - Description: