Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC-8878 Enhance Essential Metrics with Alert guidance #18537

Merged
merged 44 commits into from
Sep 16, 2024
Merged

Conversation

florence-crl
Copy link
Contributor

@florence-crl florence-crl commented May 10, 2024

Fixes DOC-8878

(1) Added essential-alerts.md include file, a compilation of alerts from (a) a-entin's repo, (b) alertmanager alerts in public docs, and (c) Skills Taxonomy private doc.
(2) In essential-metrics.md, (a) added anchors to metrics used in essential-alerts.md, (b) added link to essential alerts.
(3) Added essential-alerts-self-hosted.md to display all the essential alerts.
(4) Added essential-alerts-advanced.md to display a subset of the essential alerts applicable to advanced clusters.
(5) In self-hosted-deployments.json and cloud-deployments.json, added links to the corresponding essential alerts pages.

Rendered preview:

Copy link

github-actions bot commented May 10, 2024

Files changed:

Copy link

netlify bot commented May 10, 2024

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit a759d83
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-api-docs/deploys/667ef97bf7c5030008bd544e

Copy link

netlify bot commented May 10, 2024

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit a759d83
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-interactivetutorials-docs/deploys/667ef97bb37a340008dda37e

Copy link

netlify bot commented May 10, 2024

Netlify Preview

Name Link
🔨 Latest commit d92a8e4
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-docs/deploys/66e880d2c72ba100082da6bb
😎 Deploy Preview https://deploy-preview-18537--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@florence-crl florence-crl changed the title DOC-8878 DOC-8878 Enhance Essential Metrics with Alert guidance Jun 10, 2024
@florence-crl
Copy link
Contributor Author

@andyyang890 OR @wenyihu6: please review the alert for Changefeed experiencing high latency.

@dikshant: please review the alerts for:

…alerts on storage.write-stalls and schedules.BACKUP.failed.

(b) fixed case of headings to sentence case.
Copy link
Contributor Author

@florence-crl florence-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

src/current/_includes/v24.1/essential-alerts.md Outdated Show resolved Hide resolved
@florence-crl florence-crl requested review from andyyang890 and removed request for wenyihu6 June 28, 2024 18:07
Copy link

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changefeed experiencing high latency section LGTM! Will defer to others for final approval.

Copy link

@kevin-v-ngo kevin-v-ngo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great Florence! Excited to get this published.

Comment on lines 336 to 383
## SQL

### Node not executing SQL

Send an alert when a node is not executing SQL despite having connections. `sql.conns` shows the number of connections as well as the distribution, or balancing, of connections across cluster nodes. An imbalance can lead to nodes becoming overloaded.

**Metric**
<br>[`sql.conns`]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#sql-conns)
<br>`sql.query.count`

**Rule**
<br>Set alerts for each node:
<br>WARNING: `sql.conns` greater than `0` while `sql.query.count` equals `0`

**Action**

- Refer to [Connection Pooling]({% link {{ page.version.version }}/connection-pooling.md %}).

### SQL query failure

Send an alert when the query failure count exceeds a user-determined threshold based on their application's SLA.

**Metric**
<br>[`sql.failure.count`]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#sql-failure-count)

**Rule**
<br>WARNING: `sql.failure.count` is greater than a threshold (based on the user’s application SLA)

**Action**

- Use the [**Insights** page]({% link {{ page.version.version }}/ui-insights-page.md %}) to find failed executions with their error code to troubleshoot or use application-level logs, if instrumented, to determine the cause of error.

### SQL queries experiencing high latency

Send an alert when the query latency exceeds a user-determined threshold based on their application’s SLA.

**Metric**
<br>[`sql.service.latency`]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#sql-service-latency)
<br>[`sql.conn.latency`]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#sql-conn-latency)

**Rule**
<br>WARNING: (p99 or p90 of `sql.service.latency` plus average of `sql.conn.latency`) is greater than a threshold (based on the user’s application SLA)

**Action**

- Apply the time range of the alert to the [**SQL Activity** pages]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#sql-activity-pages) to investigate. Use the [**Statements** page]({% link {{ page.version.version }}/ui-statements-page.md %}) P90 Latency and P99 latency columns to correlate [statement fingerprints]({% link {{ page.version.version }}/ui-statements-page.md %}#sql-statement-fingerprints) with this alert.

{% if include.deployment == 'self-hosted' %}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @mgartner do you have any thoughts/concerns on highlighting these as essential metrics to monitor for SQL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mgartner , when you have a chance, would you be able to review this section: SQL queries experiencing high latency?

@florence-crl florence-crl changed the base branch from main to cloud-2.0 September 10, 2024 14:19
In essential-metrics.md, took cloud-2.0 version and manually added links to metrics used by essential-alerts.md.

Renamed essential-alerts-dedicated.md to essential-alerts-advanced.md.

In essential-alerts.md, (a) replaced dedicated with advanced, (b) replaced links to essential-metrics-self-hosted.md with essential-metrics-{{ include.deployment }}.md.

In cloud-deployments.json, added link to essential-alerts-advanced.md.
…th link to Essential Alerts.

copied v24.1 changed files to v24.2
@florence-crl
Copy link
Contributor Author

Hi @kathancox, I have made the cloud-2.0 and v24.2 changes I had planned. Please review this PR at your earliest convenience. The cloud-2.0 docs release is Sept. 25, so please keep this date in mind as this PR needs to be merged before then.

Copy link
Contributor

@kathancox kathancox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great Florence! Pending your review of the comments, it looks good to me.

src/current/_includes/v24.1/essential-alerts.md Outdated Show resolved Hide resolved
src/current/_includes/v24.1/essential-alerts.md Outdated Show resolved Hide resolved
src/current/_includes/v24.1/essential-alerts.md Outdated Show resolved Hide resolved
src/current/_includes/v24.1/essential-alerts.md Outdated Show resolved Hide resolved
src/current/_includes/v24.1/essential-alerts.md Outdated Show resolved Hide resolved
Copy link
Contributor Author

@florence-crl florence-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

src/current/_includes/v24.1/essential-alerts.md Outdated Show resolved Hide resolved
src/current/_includes/v24.1/essential-alerts.md Outdated Show resolved Hide resolved
src/current/_includes/v24.1/essential-alerts.md Outdated Show resolved Hide resolved
src/current/_includes/v24.1/essential-alerts.md Outdated Show resolved Hide resolved
src/current/_includes/v24.1/essential-alerts.md Outdated Show resolved Hide resolved
@florence-crl florence-crl merged commit 1c06a8b into cloud-2.0 Sep 16, 2024
4 checks passed
@florence-crl florence-crl deleted the DOC-8878 branch September 16, 2024 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants