-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design:add outlier detection design doc #5460
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,275 @@ | ||
# Design for Supporting Outlier detection in Contour | ||
|
||
Status: Draft | ||
|
||
## Abstract | ||
This document recommends adding passive health checks to services to enhance routing reliability. | ||
|
||
## Background | ||
|
||
Envoy supports two types of health checks, namely active health check and [passive health check][1]. Passive and active health checks can be enabled simultaneously or independently, and form the basis of the overall upstream health check solution. Contour has implemented active health checks. If Passive health checks are configured to better monitor service status and route traffic to healthy instances. | ||
|
||
## Goals | ||
- Add outlier detection related configuration detection for services. | ||
- Support global configuration of OutlierDetection, service level can be overridden or disabled. | ||
- Support configuration based on route and TCPRoute. | ||
- Support configuration for consecutive-5 xx and consecutive-local-origin-failure. | ||
|
||
## Non-Goals | ||
- Not supported Outlier Detection configuration for consecutive-gateway-failure, success-rate, and failure-percentage. | ||
|
||
## Detailed Design | ||
|
||
### Global Configuration | ||
The `OutlierDetection` section of the Contour config file will look like: | ||
|
||
```yaml | ||
outlierDetection: | ||
# ConsecutiveServerErrors defines The number of consecutive server-side error responses before a consecutive 5xx ejection occurs. | ||
# When the backend host encounters consecutive | ||
# errors greater than or equal to ConsecutiveServerErrors, it will be | ||
# ejected from the load balancing pool. | ||
# for HTTP services, a 5xx counts as an error and for TCP services | ||
# connection failures and connection timeouts count as an error. | ||
# It can be disabled by setting the value to 0. | ||
# Defaults to 5. | ||
consecutiveServerErrors: 5 | ||
# Interval is the interval at which host status is evaluated. | ||
# Defaults to 10s. | ||
interval: 10s | ||
# BaseEjectionTime is the base time that a host is ejected for. | ||
# A host will remain ejected for a period of time equal to the | ||
# product of the ejection base duration and the number of times the host has been ejected. | ||
# Defaults to 30s. | ||
baseEjectionTime: 30s | ||
# MaxEjectionTime is the maximum time a host will be ejected for. | ||
# After this amount of time, a host will be returned to normal operation. | ||
# If not specified, the default value (300s) or BaseEjectionTime value is applied, whatever is larger. | ||
# Defaults to 300s. | ||
maxEjectionTime: 300s | ||
# SplitExternalLocalOriginErrors defines whether to split the local origin errors from the external origin errors. | ||
# Defaults to false. | ||
splitExternalLocalOriginErrors: false | ||
# ConsecutiveLocalOriginFailure defines the number of consecutive local origin failures before a consecutive local origin ejection occurs. | ||
# Parameters take effect only when SplitExternalLocalOriginErrors is true. | ||
# Defaults to 5. | ||
consecutiveLocalOriginFailure: 5 | ||
# MaxEjectionPercent is the max percentage of hosts in the load balancing pool for the upstream service that can be ejected. | ||
# But will eject at least one host regardless of the value here. | ||
# Defaults to 10%. | ||
maxEjectionPercent: 10 | ||
# MaxEjectionTimeJitter is The maximum amount of jitter to add to the ejection time, | ||
# in order to prevent a ‘thundering herd’ effect where all proxies try to reconnect to host at the same time. | ||
# Defaults to 0s. | ||
maxEjectionTimeJitter: 0s | ||
``` | ||
|
||
|
||
### Httpproxy Configuration | ||
Add a new field in the CRD of httpproxy for configuring Outlier Detection of the service. The field is defined as follows: | ||
```go | ||
// OutlierDetection defines the configuration for outlier detection on a service. | ||
// If not specified, global configuration will be used. | ||
// If specified, it will override the global configuration. | ||
type OutlierDetection struct { | ||
|
||
// Disabled defines whether to disable outlier detection for the service. | ||
// Defaults to false. | ||
Disabled *bool `json:"disabled,omitempty"` | ||
// ConsecutiveServerErrors defines The number of consecutive server-side error responses before a consecutive 5xx ejection occurs. | ||
// When the backend host encounters consecutive | ||
// errors greater than or equal to ConsecutiveServerErrors, it will be | ||
// ejected from the load balancing pool. | ||
// for HTTP services, a 5xx counts as an error and for TCP services | ||
// connection failures and connection timeouts count as an error. | ||
// It can be disabled by setting the value to 0. | ||
// Defaults to 5. | ||
// +optional | ||
ConsecutiveServerErrors *uint32 `json:"consecutiveServerErrors,omitempty"` | ||
|
||
// Interval is the interval at which host status is evaluated. | ||
// Defaults to 10s. | ||
// +optional | ||
// +kubebuilder:validation:Pattern=`^(((\d*(\.\d*)?h)|(\d*(\.\d*)?m)|(\d*(\.\d*)?s)|(\d*(\.\d*)?ms))+)$` | ||
Interval *string `json:"interval,omitempty"` | ||
|
||
// BaseEjectionTime is the base time that a host is ejected for. | ||
// A host will remain ejected for a period of time equal to the | ||
// product of the ejection base duration and the number of times the host has been ejected. | ||
// Defaults to 30s. | ||
// +optional | ||
// +kubebuilder:validation:Pattern=`^(((\d*(\.\d*)?h)|(\d*(\.\d*)?m)|(\d*(\.\d*)?s)|(\d*(\.\d*)?ms))+)$` | ||
BaseEjectionTime *string `json:"baseEjectionTime,omitempty"` | ||
|
||
// MaxEjectionTime is the maximum time a host will be ejected for. | ||
// After this amount of time, a host will be returned to normal operation. | ||
// If not specified, the default value (300s) or BaseEjectionTime value is applied, whatever is larger. | ||
// Defaults to 300s. | ||
// +optional | ||
// +kubebuilder:validation:Pattern=`^(((\d*(\.\d*)?h)|(\d*(\.\d*)?m)|(\d*(\.\d*)?s)|(\d*(\.\d*)?ms))+)$` | ||
MaxEjectionTime *string `json:"maxEjectionTime,omitempty"` | ||
|
||
// SplitExternalLocalOriginErrors defines whether to split the local origin errors from the external origin errors. | ||
// Defaults to false. | ||
// +optional | ||
// +kubebuilder:default=false | ||
SplitExternalLocalOriginErrors bool `json:"splitExternalLocalOriginErrors"` | ||
|
||
// ConsecutiveLocalOriginFailure defines the number of consecutive local origin failures before a consecutive local origin ejection occurs. | ||
// Parameters take effect only when SplitExternalLocalOriginErrors is true. | ||
// Defaults to 5. | ||
ConsecutiveLocalOriginFailure *uint32 `json:"consecutiveLocalOriginFailure,omitempty"` | ||
|
||
// MaxEjectionPercent is the max percentage of hosts in the load balancing pool for the upstream service that can be ejected. | ||
// But will eject at least one host regardless of the value here. | ||
// Defaults to 10%. | ||
// +optional | ||
// +kubebuilder:validation:Maximum=100 | ||
MaxEjectionPercent *uint32 `json:"maxEjectionPercent,omitempty"` | ||
|
||
// MaxEjectionTimeJitter is The maximum amount of jitter to add to the ejection time, | ||
// in order to prevent a ‘thundering herd’ effect where all proxies try to reconnect to host at the same time. | ||
// Defaults to 0s. | ||
// +optional | ||
// +kubebuilder:validation:Pattern=`^(((\d*(\.\d*)?s)|(\d*(\.\d*)?ms))+)$` | ||
MaxEjectionTimeJitter *string `json:"maxEjectionTimeJitter,omitempty"` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can just set a default for this and not expose this field to start with, this config is pretty big so limiting what users need to think about would be great |
||
} | ||
yangyy93 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
|
||
## Example | ||
|
||
If the global configuration is not configured,Below are some configuration examples for outlier detection. | ||
|
||
### Example 1 | ||
```yaml | ||
apiVersion: projectcontour.io/v1alpha1 | ||
kind: HTTPProxy | ||
metadata: | ||
name: simple-5xx | ||
namespace: projectcontour | ||
spec: | ||
virtualhost: | ||
fqdn: outlierDetection.projectcontour.io | ||
routes: | ||
- conditions: | ||
- prefix: / | ||
services: | ||
- name: s1 | ||
port: 80 | ||
outlierDetection: | ||
consecutiveServerErrors: 5 | ||
maxEjectionPercent: 100 | ||
``` | ||
In this example, when s1 service experiences 5 consecutive 5xx errors (including locally originated and externally originated (transaction) errors), s1 service will be ejected, with a maximum ejection rate of 100%. Panic is disabled, which is also the default value. | ||
|
||
Below is the generated envoy configuration: | ||
```yaml | ||
cluster: | ||
common_lb_config: | ||
healthy_panic_threshold: | ||
value: 0.0 | ||
outlier_detection: | ||
enforcing_success_rate: 0 | ||
consecutive_5xx: 5 | ||
enforcing_consecutive_5xx: 100 | ||
max_ejection_percent: 100 | ||
enforcing_consecutive_gateway_failure: 0 | ||
``` | ||
|
||
### Example 2 | ||
```yaml | ||
apiVersion: projectcontour.io/v1alpha1 | ||
kind: HTTPProxy | ||
metadata: | ||
name: only-local-external-origin-errors | ||
namespace: projectcontour | ||
spec: | ||
virtualhost: | ||
fqdn: outlierDetection.projectcontour.io | ||
routes: | ||
- conditions: | ||
- prefix: / | ||
services: | ||
- name: s1 | ||
port: 80 | ||
outlierDetection: | ||
consecutiveServerErrors: 0 | ||
maxEjectionPercent: 100 | ||
splitExternalLocalOriginErrors: true | ||
consecutiveLocalOriginFailure: 5 | ||
``` | ||
In this example, when the s1 service has 5 consecutive local origin errors, the s1 service will be ejected, and the 5 xx errors returned by the upstream service will not be counted in the error statistics. The maximum eviction ratio is 100%. Panic is disabled, which is also the default value. | ||
|
||
Below is the generated envoy configuration: | ||
```yaml | ||
cluster: | ||
common_lb_config: | ||
healthy_panic_threshold: | ||
value: 0.0 | ||
outlier_detection: | ||
enforcing_success_rate: 0 | ||
enforcing_consecutive_5xx: 0 | ||
max_ejection_percent: 100 | ||
split_external_local_origin_errors: true | ||
consecutive_local_origin_failure: 5 | ||
#Default to 100% if not specified | ||
#enforcing_consecutive_local_origin_failure: 100 | ||
enforcing_local_origin_success_rate: 0 | ||
enforcing_consecutive_gateway_failure: 0 | ||
``` | ||
|
||
### Example 3 | ||
```yaml | ||
apiVersion: projectcontour.io/v1alpha1 | ||
kind: HTTPProxy | ||
metadata: | ||
name: local-external-origin-errors-and-consecutive-5xx | ||
namespace: projectcontour | ||
spec: | ||
virtualhost: | ||
fqdn: outlierDetection.projectcontour.io | ||
routes: | ||
- conditions: | ||
- prefix: / | ||
services: | ||
- name: s1 | ||
port: 80 | ||
outlierDetection: | ||
consecutiveServerErrors: 10 | ||
maxEjectionPercent: 100 | ||
splitExternalLocalOriginErrors: true | ||
consecutiveLocalOriginFailure: 5 | ||
``` | ||
In this example, when the s1 service continuously encounters 5 local origin errors or 10 external errors, the s1 service will be ejected. The maximum eviction rate is 100%. Panic is disabled, which is also the default value. | ||
|
||
Below is the generated envoy configuration: | ||
```yaml | ||
cluster: | ||
common_lb_config: | ||
healthy_panic_threshold: | ||
value: 0.0 | ||
outlier_detection: | ||
enforcing_success_rate: 0 | ||
consecutive_5xx: 10 | ||
enforcing_consecutive_5xx: 100 | ||
max_ejection_percent: 100 | ||
split_external_local_origin_errors: true | ||
consecutive_local_origin_failure: 5 | ||
#Default to 100% if not specified | ||
#enforcing_consecutive_local_origin_failure: 100 | ||
enforcing_local_origin_success_rate: 0 | ||
enforcing_consecutive_gateway_failure: 0 | ||
``` | ||
|
||
Notes: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should probably add some discussion/documentation of the |
||
- If consecutiveServerErrors is specified as 0, and splitExternalLocalOriginErrors is true, then local errors will be ignored.This is especially useful when the upstream service explicitly returns a 5xx for some requests and you want to ignore those responses from upstream service while determining the outlier detection status of a host. | ||
- When accessing the upstream host through an opaque TCP connection, connection timeouts, connection errors, and request failures are all considered 5xx errors, and therefore these events are included in the 5xx error statistics. | ||
- Please refer to the [official documentation of Envoy][1] for more instructions. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should also include some discussion/documentation on the different behavior between http router and tcp filter when this is in the mix |
||
|
||
## Open Issues | ||
- Whether the existing supported configuration options meet the needs of the user. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm curious for people who have used this in the past, particularly at scale, whether the "consecutive errors" mechanism for configuring outlier detection is actually desirable vs. the failure percentage mechanism one could foresee a scenario in which a host in a cluster should be detected as an outlier however it does not consistently error consecutively but does still have a high failure percentage (say the consecutive errors was configured at 5, and you consistently get 3 out of 5 requests that fail), you may not detect that such a host should be ejected as quickly and also may not consistently eject it for as long as it needs to be ejected: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier#ejection-algorithm |
||
- Should a global switch be provided to record [ejection event logs][2]? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm in favor of adding some discussion and exploration on this, I think with our active healthchecking configuration, it isn't ideal as-is but (IMO) people find it generally easier to explain and understand why their upstream is unhealthy (it returned an unhealthy status code from their healthcheck endpoint) with passive healthchecking I can foresee some support requests etc. where users are confused as to why their services are deemed unhealthy, so having logs etc. to explain things would be great we could also take advantage of the active healthchecking event log configuration as well I think a great outcome of implementing this feature could be a debugging guide, including some mention of what stats to monitor (e.g. from https://www.envoyproxy.io/docs/envoy/latest/configuration/upstream/cluster_manager/cluster_stats#config-cluster-manager-cluster-stats-outlier-detection) we would have to likely add a bootstrap flag to change the bootstrap There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, adding a bootstrap flag to handle whether to enable logging can better help people confirm the cause of the failure. If it is considered feasible, I will create a new PR to implement this function. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can also add an active health check flag to record health check failure events |
||
|
||
[1]: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier | ||
[2]: https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/bootstrap/v3/bootstrap.proto#envoy-v3-api-field-config-bootstrap-v3-clustermanager-outlier-detection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nice advantage of this passive healthcheck config is we can get more granular failure details like this vs. our existing active http healthcheck configuration, I would err on the side of always enabling this to separate server errors from possible network level errors, but curious to see what others think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I would lean on the side of enabling this by default. Regarding granularity, we should also allow users to tune
consecutive_gateway_failure
andconsecutive_local_origin_failure
; otherwise, this toggle by itself feels too rigid.We should also make sure to set sane defaults and that the accompanying documentation is clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your review.The setting of
consecutive_local_origin_failure
is already in the design document. Consecutive_gateway_failure was not intended to be configured at the beginning. Of course, if this is a good design, I will update this design document to supportconsecutive_gateway_failure
.@clayton-gonsalves @sunjayBhatia
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my 2 cents here is to remove the
SplitExternalLocalOriginErrors
field and set it to always true in the generated Envoy configin addition, we should choose one of the 5xx error codes or gateway errors to expose as a configurable parameter to users for upstream originated error control, my 2 cents here again is that the
consecutive_5xx
(existingConsecutiveServerErrors
field in this config) is sufficient, I don't see too much utility in limiting the classified errors to 502, 503, and 504