Merge branch 'main' into DOC-10686_innovation

cockroachdb · Aug 6, 2024 · eaf7f88 · eaf7f88
2 parents fcd9463 + 82e2e1f
commit eaf7f88
Show file tree

Hide file tree

Showing 40 changed files with 95 additions and 11 deletions.
diff --git a/src/current/_includes/releases/v24.1/v24.1.0-alpha.5.md b/src/current/_includes/releases/v24.1/v24.1.0-alpha.5.md
@@ -6,7 +6,7 @@ Release Date: April 1, 2024
 
 <h3 id="v24-1-0-alpha-5-{{-site.data.products.enterprise-}}-edition-changes">{{ site.data.products.enterprise }} edition changes</h3>
 
-- [Changefeeds]({% link v24.1/change-data-capture-overview.md %}) now default to evenly distributing their work across all replicas, including followers, regardless of leaseholder placement. To disable this behavior, set the [cluster setting]({% link v24.1/cluster-settings.md %}) `changefeed.random_replica_selection.enabled ` to `false`. If disabled, changefeed planning reverts to its previous behavior for distributing work. [#120077][#120077]
+- [Changefeeds]({% link v24.1/change-data-capture-overview.md %}) now default to evenly distributing their work across all replicas, including followers, regardless of leaseholder placement. On upgrade to v24.1, running changefeed jobs will be restarted automatically as part of the upgrade process and will default to distributing work across replicas. To disable this behavior, set the [cluster setting]({% link v24.1/cluster-settings.md %}) `changefeed.random_replica_selection.enabled ` to `false`. If disabled, changefeed planning reverts to its previous behavior for distributing work. [#120077][#120077]
 - When [physical cluster replication]({% link v24.1/physical-cluster-replication-overview.md %}) is enabled, the output of the `SHOW VIRTUAL CLUSTER ... WITH REPLICATION STATUS` command now displays replication lag. [#120782][#120782]
 - When [physical cluster replication]({% link v24.1/physical-cluster-replication-overview.md %}) is enabled, the output of the `SHOW VIRTUAL CLUSTER WITH REPLICATION STATUS to 1` command has changed:
     - The output no longer displays `replication_job_id` or `service_mode` return fields.

diff --git a/src/current/_includes/v23.1/cdc/work-distribution-setting.md b/src/current/_includes/v23.1/cdc/work-distribution-setting.md
@@ -0,0 +1 @@
+{% include_cached new-in.html version="v23.1.23" %} You can enable the `changefeed.random_replica_selection.enabled` cluster setting to change the way in which a changefeed distributes work across the cluster. With `changefeed.random_replica_selection.enabled` set to `true`, the job will evenly distribute changefeed work across the cluster by assigning it to any [replica]({% link {{ page.version.version }}/architecture/replication-layer.md %}) for a particular range. For `changefeed.random_replica_selection.enabled` to take effect on changefeed jobs, ensure you enable the cluster setting and then [pause]({% link {{ page.version.version }}/pause-job.md %}) and [resume]({% link {{ page.version.version }}/resume-job.md %}) existing changefeeds.
diff --git a/src/current/_includes/v23.1/known-limitations/cdc-queries-column-families.md b/src/current/_includes/v23.1/known-limitations/cdc-queries-column-families.md
@@ -0,0 +1 @@
+Creating a changefeed with {% if page.name == "cdc-queries.md" %} CDC queries {% else %} [CDC queries]({% link {{ page.version.version }}/cdc-queries.md %}) {% endif %} on tables with more than one {% if page.name == "changefeeds-on-tables-with-column-families.md" %} column family {% else %} [column family]({% link {{ page.version.version }}/changefeeds-on-tables-with-column-families.md %}) {% endif %} is not supported. [Tracking GitHub Issue](https://github.com/cockroachdb/cockroach/issues/127761)
diff --git a/src/current/_includes/v23.1/known-limitations/changefeed-column-family-message.md b/src/current/_includes/v23.1/known-limitations/changefeed-column-family-message.md
@@ -0,0 +1 @@
+When you create a changefeed on a table with more than one {% if page.name == "changefeeds-on-tables-with-column-families.md" %} column family {% else %} [column family]({% link {{ page.version.version }}/changefeeds-on-tables-with-column-families.md %}) {% endif %}, the changefeed will emit messages per column family in separate streams. As a result, [changefeed messages]({% link {{ page.version.version }}/changefeed-messages.md %}) for different column families will arrive at the [sink]({% link {{ page.version.version }}/changefeed-sinks.md %}) under separate topics. [Tracking GitHub Issue](https://github.com/cockroachdb/cockroach/issues/127736)
diff --git a/src/current/_includes/v23.2/cdc/work-distribution-setting.md b/src/current/_includes/v23.2/cdc/work-distribution-setting.md
@@ -0,0 +1 @@
+{% include_cached new-in.html version="v23.2.7" %} You can enable the `changefeed.random_replica_selection.enabled` cluster setting to change the way in which a changefeed distributes work across the cluster. With `changefeed.random_replica_selection.enabled` set to `true`, the job will evenly distribute changefeed work across the cluster by assigning it to any [replica]({% link {{ page.version.version }}/architecture/replication-layer.md %}) for a particular range. For `changefeed.random_replica_selection.enabled` to take effect on changefeed jobs, ensure you enable the cluster setting and then [pause]({% link {{ page.version.version }}/pause-job.md %}) and [resume]({% link {{ page.version.version }}/resume-job.md %}) existing changefeeds.
diff --git a/src/current/_includes/v23.2/known-limitations/cdc-queries-column-families.md b/src/current/_includes/v23.2/known-limitations/cdc-queries-column-families.md
@@ -0,0 +1 @@
+Creating a changefeed with {% if page.name == "cdc-queries.md" %} CDC queries {% else %} [CDC queries]({% link {{ page.version.version }}/cdc-queries.md %}) {% endif %} on tables with more than one {% if page.name == "changefeeds-on-tables-with-column-families.md" %} column family {% else %} [column family]({% link {{ page.version.version }}/changefeeds-on-tables-with-column-families.md %}) {% endif %} is not supported. [Tracking GitHub Issue](https://github.com/cockroachdb/cockroach/issues/127761)
diff --git a/src/current/_includes/v23.2/known-limitations/changefeed-column-family-message.md b/src/current/_includes/v23.2/known-limitations/changefeed-column-family-message.md
@@ -0,0 +1 @@
+When you create a changefeed on a table with more than one {% if page.name == "changefeeds-on-tables-with-column-families.md" %} column family {% else %} [column family]({% link {{ page.version.version }}/changefeeds-on-tables-with-column-families.md %}) {% endif %}, the changefeed will emit messages per column family in separate streams. As a result, [changefeed messages]({% link {{ page.version.version }}/changefeed-messages.md %}) for different column families will arrive at the [sink]({% link {{ page.version.version }}/changefeed-sinks.md %}) under separate topics. [Tracking GitHub Issue](https://github.com/cockroachdb/cockroach/issues/127736)
diff --git a/src/current/_includes/v24.1/essential-metrics.md b/src/current/_includes/v24.1/essential-metrics.md
@@ -50,6 +50,20 @@ The **Usage** column explains why each metric is important to visualize in a cus
 | admission.wait_durations.kv-stores-p75              | {% if include.deployment == 'self-hosted' %}admission.wait.durations.kv_stores |{% elsif include.deployment == 'dedicated' %}NOT AVAILABLE |{% endif %} Wait time durations for requests that waited                 | This metric shows if I/O utilization-based admission control feature is working effectively or potentially overaggressive. This is a latency histogram of how much delay was added to the workload due to throttling by I/O control. If observing over 100ms waits for over 5 seconds while there was excess I/O capacity available, then the admission control is overly aggressive. |
 | sys.runnable.goroutines.per.cpu                     | {% if include.deployment == 'self-hosted' %}sys.runnable.goroutines.per_cpu |{% elsif include.deployment == 'dedicated' %}NOT AVAILABLE |{% endif %} Average number of goroutines that are waiting to run, normalized by number of cores | If this metric has a value over 30, it indicates a CPU overload. If the condition lasts a short period of time (a few seconds), the database users are likely to experience inconsistent response times. If the condition persists for an extended period of time (tens of seconds, or minutes) the cluster may start developing stability issues. Review [CPU planning]({% link {{ page.version.version }}/common-issues-to-monitor.md %}#cpu).
 
+{% if include.deployment == 'self-hosted' %}
+## Network
+
+| <div style="width:225px">CockroachDB Metric Name</div> | <div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)<br>(add `cockroachdb.` prefix)</div> |<div style="width:150px">Description</div>| Usage |
+| ------------------------------------------------------ | --------------------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| rpc.connection.avg_round_trip_latency                  | rpc.connection.avg_round_trip_latency                     | Sum of exponentially weighted moving average of round-trip latencies, as measured through a gRPC RPC. Dividing this gauge by `rpc.connection.healthy` gives an approximation of average latency, but the top-level round-trip-latency histogram is more useful. Instead, users should consult the label families of this metric if they are available (which requires Prometheus and the cluster setting `server.child_metrics.enabled`); these provide per-peer moving averages. This metric does not track failed connection. A failed connection's contribution is reset to zero. | This metric is helpful in understanding general network issues outside of CockroachDB that could be impacting the user’s workload. |
+| rpc.connection.failures                                | rpc.connection.failures.count                             | Counter of failed connections. This includes both the event in which a healthy connection terminates as well as unsuccessful reconnection attempts. Connections that are terminated as part of local node shutdown are excluded. Decommissioned peers are excluded. | See Description. |
+| rpc.connection.healthy                                 | rpc.connection.healthy                                    | Gauge of current connections in a healthy state (i.e., bidirectionally connected and heartbeating). | See Description. |
+| rpc.connection.healthy_nanos                           | rpc.connection.healthy_nanos                              | Gauge of nanoseconds of healthy connection time. On the Prometheus endpoint scraped when the cluster setting `server.child_metrics.enabled` is set, this gauge allows you to see the duration for which a given peer has been connected in a healthy state. | This can be useful for monitoring the stability and health of connections within your CockroachDB cluster. |
+| rpc.connection.heartbeats                              | rpc.connection.heartbeats.count                           | Counter of successful heartbeats. | See Description. |
+| rpc.connection.unhealthy                               | rpc.connection.unhealthy                                  | Gauge of current connections in an unhealthy state (not bidirectionally connected or heartbeating). | If the value of this metric is greater than 0, this could indicate a network partition. |
+| rpc.connection.unhealthy_nanos                         | rpc.connection.unhealthy_nanos                            | Gauge of nanoseconds of unhealthy connection time. On the Prometheus endpoint scraped when the cluster setting `server.child_metrics.enabled` is set, this gauge allows you to see the duration for which a given peer has been unreachable. | If this duration is greater than 0, this could indicate how long a network partition has been occurring. |
+{% endif %}
+
 {% if include.deployment == 'self-hosted' %}
 ## Expiration of license and certificates
 

diff --git a/src/current/_includes/v24.1/known-limitations/cdc-queries-column-families.md b/src/current/_includes/v24.1/known-limitations/cdc-queries-column-families.md
@@ -0,0 +1 @@
+Creating a changefeed with {% if page.name == "cdc-queries.md" %} CDC queries {% else %} [CDC queries]({% link {{ page.version.version }}/cdc-queries.md %}) {% endif %} on tables with more than one {% if page.name == "changefeeds-on-tables-with-column-families.md" %} column family {% else %} [column family]({% link {{ page.version.version }}/changefeeds-on-tables-with-column-families.md %}) {% endif %} is not supported. [#127761](https://github.com/cockroachdb/cockroach/issues/127761)
diff --git a/src/current/_includes/v24.1/known-limitations/changefeed-column-family-message.md b/src/current/_includes/v24.1/known-limitations/changefeed-column-family-message.md
@@ -0,0 +1 @@
+When you create a changefeed on a table with more than one {% if page.name == "changefeeds-on-tables-with-column-families.md" %} column family {% else %} [column family]({% link {{ page.version.version }}/changefeeds-on-tables-with-column-families.md %}) {% endif %}, the changefeed will emit messages per column family in separate streams. As a result, [changefeed messages]({% link {{ page.version.version }}/changefeed-messages.md %}) for different column families will arrive at the [sink]({% link {{ page.version.version }}/changefeed-sinks.md %}) under separate topics. [#127736](https://github.com/cockroachdb/cockroach/issues/127736)
diff --git a/src/current/_includes/v24.2/essential-metrics.md b/src/current/_includes/v24.2/essential-metrics.md
@@ -50,6 +50,20 @@ The **Usage** column explains why each metric is important to visualize in a cus
 | admission.wait_durations.kv-stores-p75              | {% if include.deployment == 'self-hosted' %}admission.wait.durations.kv_stores |{% elsif include.deployment == 'dedicated' %}NOT AVAILABLE |{% endif %} Wait time durations for requests that waited                 | This metric shows if I/O utilization-based admission control feature is working effectively or potentially overaggressive. This is a latency histogram of how much delay was added to the workload due to throttling by I/O control. If observing over 100ms waits for over 5 seconds while there was excess I/O capacity available, then the admission control is overly aggressive. |
 | sys.runnable.goroutines.per.cpu                     | {% if include.deployment == 'self-hosted' %}sys.runnable.goroutines.per_cpu |{% elsif include.deployment == 'dedicated' %}NOT AVAILABLE |{% endif %} Average number of goroutines that are waiting to run, normalized by number of cores | If this metric has a value over 30, it indicates a CPU overload. If the condition lasts a short period of time (a few seconds), the database users are likely to experience inconsistent response times. If the condition persists for an extended period of time (tens of seconds, or minutes) the cluster may start developing stability issues. Review [CPU planning]({% link {{ page.version.version }}/common-issues-to-monitor.md %}#cpu).
 
+{% if include.deployment == 'self-hosted' %}
+## Network
+
+| <div style="width:225px">CockroachDB Metric Name</div> | <div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)<br>(add `cockroachdb.` prefix)</div> |<div style="width:150px">Description</div>| Usage |
+| ------------------------------------------------------ | --------------------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| rpc.connection.avg_round_trip_latency                  | rpc.connection.avg_round_trip_latency                     | Sum of exponentially weighted moving average of round-trip latencies, as measured through a gRPC RPC. Dividing this gauge by `rpc.connection.healthy` gives an approximation of average latency, but the top-level round-trip-latency histogram is more useful. Instead, users should consult the label families of this metric if they are available (which requires Prometheus and the cluster setting `server.child_metrics.enabled`); these provide per-peer moving averages. This metric does not track failed connection. A failed connection's contribution is reset to zero. | This metric is helpful in understanding general network issues outside of CockroachDB that could be impacting the user’s workload. |
+| rpc.connection.failures                                | rpc.connection.failures.count                             | Counter of failed connections. This includes both the event in which a healthy connection terminates as well as unsuccessful reconnection attempts. Connections that are terminated as part of local node shutdown are excluded. Decommissioned peers are excluded. | See Description. |
+| rpc.connection.healthy                                 | rpc.connection.healthy                                    | Gauge of current connections in a healthy state (i.e., bidirectionally connected and heartbeating). | See Description. |
+| rpc.connection.healthy_nanos                           | rpc.connection.healthy_nanos                              | Gauge of nanoseconds of healthy connection time. On the Prometheus endpoint scraped when the cluster setting `server.child_metrics.enabled` is set, this gauge allows you to see the duration for which a given peer has been connected in a healthy state. | This can be useful for monitoring the stability and health of connections within your CockroachDB cluster. |
+| rpc.connection.heartbeats                              | rpc.connection.heartbeats.count                           | Counter of successful heartbeats. | See Description. |
+| rpc.connection.unhealthy                               | rpc.connection.unhealthy                                  | Gauge of current connections in an unhealthy state (not bidirectionally connected or heartbeating). | If the value of this metric is greater than 0, this could indicate a network partition. |
+| rpc.connection.unhealthy_nanos                         | rpc.connection.unhealthy_nanos                            | Gauge of nanoseconds of unhealthy connection time. On the Prometheus endpoint scraped when the cluster setting `server.child_metrics.enabled` is set, this gauge allows you to see the duration for which a given peer has been unreachable. | If this duration is greater than 0, this could indicate how long a network partition has been occurring. |
+{% endif %}
+
 {% if include.deployment == 'self-hosted' %}
 ## Expiration of license and certificates
 

diff --git a/src/current/_includes/v24.2/known-limitations/cdc-queries-column-families.md b/src/current/_includes/v24.2/known-limitations/cdc-queries-column-families.md
@@ -0,0 +1 @@
+Creating a changefeed with {% if page.name == "cdc-queries.md" %} CDC queries {% else %} [CDC queries]({% link {{ page.version.version }}/cdc-queries.md %}) {% endif %} on tables with more than one {% if page.name == "changefeeds-on-tables-with-column-families.md" %} column family {% else %} [column family]({% link {{ page.version.version }}/changefeeds-on-tables-with-column-families.md %}) {% endif %} is not supported. [#127761](https://github.com/cockroachdb/cockroach/issues/127761)
diff --git a/src/current/_includes/v24.2/known-limitations/changefeed-column-family-message.md b/src/current/_includes/v24.2/known-limitations/changefeed-column-family-message.md
@@ -0,0 +1 @@
+When you create a changefeed on a table with more than one {% if page.name == "changefeeds-on-tables-with-column-families.md" %} column family {% else %} [column family]({% link {{ page.version.version }}/changefeeds-on-tables-with-column-families.md %}) {% endif %}, the changefeed will emit messages per column family in separate streams. As a result, [changefeed messages]({% link {{ page.version.version }}/changefeed-messages.md %}) for different column families will arrive at the [sink]({% link {{ page.version.version }}/changefeed-sinks.md %}) under separate topics. [#127736](https://github.com/cockroachdb/cockroach/issues/127736)
diff --git a/src/current/images/v24.1/changefeed-structure.png b/src/current/images/v24.1/changefeed-structure.png
diff --git a/src/current/images/v24.2/changefeed-structure.png b/src/current/images/v24.2/changefeed-structure.png
diff --git a/src/current/v23.1/cdc-queries.md b/src/current/v23.1/cdc-queries.md
@@ -55,6 +55,7 @@ For a SQL diagram of the CDC query syntax, see the [`CREATE CHANGEFEED`]({% link
 
 {% include {{ page.version.version }}/known-limitations/cdc-queries.md %}
 - {% include {{ page.version.version }}/known-limitations/alter-changefeed-cdc-queries.md %}
+- {% include {{ page.version.version }}/known-limitations/cdc-queries-column-families.md %}
 
 ## CDC query function support
 

diff --git a/src/current/v23.1/changefeeds-in-multi-region-deployments.md b/src/current/v23.1/changefeeds-in-multi-region-deployments.md
@@ -53,6 +53,10 @@ Once the coordinating node is determined, nodes that match the locality requirem
 - If the [leaseholder]({% link {{ page.version.version }}/architecture/reads-and-writes-overview.md %}#architecture-leaseholder) for the change data matches the filter, it will emit the changefeed messages.
 - If the leaseholder does not match the locality filter, a node will be selected matching the locality filter with a preference for nodes with localities that are more similar to the leaseholder.
 
+{{site.data.alerts.callout_info}}
+{% include {{ page.version.version }}/cdc/work-distribution-setting.md %}
+{{site.data.alerts.end}}
+
 When a node matching the locality filter takes part in the changefeed job, that node will read from the closest [replica]({% link {{ page.version.version }}/architecture/reads-and-writes-overview.md %}#architecture-replica). If the node is the leaseholder, or is itself a replica, it can read from itself. In the scenario where no replicas are available in the region of the assigned node, it may then read from a replica in a different region. As a result, you may want to consider [placing replicas]({% link {{ page.version.version }}/configure-replication-zones.md %}), including potentially [non-voting replicas]({% link {{ page.version.version }}/architecture/replication-layer.md %}#non-voting-replicas) that will have less impact on read latency, in the locality or region that you plan on pinning for changefeed job execution.
 
 For an overview of how a changefeed job works, see the [How does an Enterprise changefeed work?]({% link {{ page.version.version }}/how-does-an-enterprise-changefeed-work.md %}) section.