Skip to content

Commit f2f8bff

Browse files
jiangzhomorhidi
authored andcommitted
[FLINK-29939]Add metrics for Kubernetes Client Response 5xx count and rate
1 parent 3c5ec6c commit f2f8bff

File tree

6 files changed

+263
-19
lines changed

6 files changed

+263
-19
lines changed

docs/content/docs/operations/metrics-logging.md

+29-11
Original file line numberDiff line numberDiff line change
@@ -62,17 +62,35 @@ In addition to the simple counts we further track a few selected state transitio
6262

6363
The Operator gathers various metrics related to Kubernetes API server access.
6464

65-
| Scope | Metrics | Description | Type |
66-
|--------|----------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|
67-
| System | KubeClient.HttpRequest.Count | Number of HTTP request sent to the Kubernetes API Server | Counter |
68-
| System | KubeClient.HttpRequest.<RequestMethod>.Count | Number of HTTP request sent to the Kubernetes API Server per request method. <RequestMethod> can take values from: GET, POST, PUT, PATCH, DELETE, etc. | Counter |
69-
| System | KubeClient.HttpRequest.Failed.Count | Number of failed HTTP requests that has no response from the Kubernetes API Server | Counter |
70-
| System | KubeClient.HttpResponse.Count | Number of HTTP responses received from the Kubernetes API Server | Counter |
71-
| System | KubeClient.HttpResponse.<ResponseCode>.Count | Number of HTTP responses received from the Kubernetes API Server per response code. <ResponseCode> can take values from: 200, 404, 503, etc. | Counter |
72-
| System | KubeClient.HttpRequest.NumPerSecond | Number of HTTP requests sent to the Kubernetes API Server per second | Meter |
73-
| System | KubeClient.HttpRequest.Failed.NumPerSecond | Number of failed HTTP requests sent to the Kubernetes API Server per second | Meter |
74-
| System | KubeClient.HttpResponse.NumPerSecond | Number of HTTP responses received from the Kubernetes API Server per second | Meter |
75-
| System | KubeClient.HttpResponse.TimeNanos | Latency statistics obtained from the HTTP responses received from the Kubernetes API Server | Histogram |
65+
| Scope | Metrics | Description | Type |
66+
|--------|-----------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|
67+
| System | KubeClient.HttpRequest.Count | Number of HTTP request sent to the Kubernetes API Server | Counter |
68+
| System | KubeClient.HttpRequest.<RequestMethod>.Count | Number of HTTP request sent to the Kubernetes API Server per request method. <RequestMethod> can take values from: GET, POST, PUT, PATCH, DELETE, etc. | Counter |
69+
| System | KubeClient.HttpRequest.Failed.Count | Number of failed HTTP requests that has no response from the Kubernetes API Server | Counter |
70+
| System | KubeClient.HttpResponse.Count | Number of HTTP responses received from the Kubernetes API Server | Counter |
71+
| System | KubeClient.HttpResponse.<ResponseCode>.Count | Number of HTTP responses received from the Kubernetes API Server per response code. <ResponseCode> can take values from: 200, 404, 503, etc. | Counter |
72+
| System | KubeClient.HttpResponse.<ResponseCode>.NumPerSecond | Number of HTTP responses received from the Kubernetes API Server per response code per second. <ResponseCode> can take values from: 200, 404, 503, etc.| Meter |
73+
| System | KubeClient.HttpRequest.NumPerSecond | Number of HTTP requests sent to the Kubernetes API Server per second | Meter |
74+
| System | KubeClient.HttpRequest.Failed.NumPerSecond | Number of failed HTTP requests sent to the Kubernetes API Server per second | Meter |
75+
| System | KubeClient.HttpResponse.NumPerSecond | Number of HTTP responses received from the Kubernetes API Server per second | Meter |
76+
| System | KubeClient.HttpResponse.TimeNanos | Latency statistics obtained from the HTTP responses received from the Kubernetes API Server | Histogram |
77+
78+
#### Kubernetes client metrics by Http Response Code
79+
80+
It's possible to publish additional metrics by Http response code received from API server by setting `kubernetes.client.metrics.http.response.code.groups.enabled` to `true` .
81+
82+
| Scope | Metrics | Description | Type |
83+
|--------|-----------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|
84+
| System | KubeClient.HttpResponse.1xx.Count | Number of HTTP Code 1xx responses (informational) received from the Kubernetes API Server per response code. | Counter |
85+
| System | KubeClient.HttpResponse.2xx.Count | Number of HTTP Code 2xx responses (success) received from the Kubernetes API Server per response code. | Counter |
86+
| System | KubeClient.HttpResponse.3xx.Count | Number of HTTP Code 3xx responses (redirection) received from the Kubernetes API Server per response code. | Counter |
87+
| System | KubeClient.HttpResponse.4xx.Count | Number of HTTP Code 4xx responses (client error) received from the Kubernetes API Server per response code. | Counter |
88+
| System | KubeClient.HttpResponse.5xx.Count | Number of HTTP Code 5xx responses (server error) received from the Kubernetes API Server per response code. | Counter |
89+
| System | KubeClient.HttpResponse.1xx.NumPerSecond | Number of HTTP Code 1xx responses (informational) received from the Kubernetes API Server per response code per second. | Meter |
90+
| System | KubeClient.HttpResponse.2xx.NumPerSecond | Number of HTTP Code 2xx responses (success) received from the Kubernetes API Server per response code per second. | Meter |
91+
| System | KubeClient.HttpResponse.3xx.NumPerSecond | Number of HTTP Code 3xx responses (redirection) received from the Kubernetes API Server per response code per second. | Meter |
92+
| System | KubeClient.HttpResponse.4xx.NumPerSecond | Number of HTTP Code 4xx responses (client error) received from the Kubernetes API Server per response code per second. | Meter |
93+
| System | KubeClient.HttpResponse.5xx.NumPerSecond | Number of HTTP Code 5xx responses (server error) received from the Kubernetes API Server per response code per second. | Meter |
7694

7795
### JVM Metrics
7896

docs/layouts/shortcodes/generated/kubernetes_operator_metric_configuration.html

+6
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,12 @@
2626
<td>Boolean</td>
2727
<td>Enable KubernetesClient metrics for measuring the HTTP traffic to the Kubernetes API Server.</td>
2828
</tr>
29+
<tr>
30+
<td><h5>kubernetes.operator.kubernetes.client.metrics.http.response.code.groups.enabled</h5></td>
31+
<td style="word-wrap: break-word;">false</td>
32+
<td>Boolean</td>
33+
<td>Enable KubernetesClient metrics for measuring the HTTP traffic to the Kubernetes API Server by response code group, e.g. 1xx, 2xx.</td>
34+
</tr>
2935
<tr>
3036
<td><h5>kubernetes.operator.metrics.histogram.sample.size</h5></td>
3137
<td style="word-wrap: break-word;">1000</td>

flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/config/FlinkOperatorConfiguration.java

+7
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ public class FlinkOperatorConfiguration {
5454
boolean josdkMetricsEnabled;
5555
int metricsHistogramSampleSize;
5656
boolean kubernetesClientMetricsEnabled;
57+
boolean kubernetesClientMetricsHttpResponseCodeGroupsEnabled;
5758
Duration flinkCancelJobTimeout;
5859
Duration flinkShutdownClusterTimeout;
5960
String artifactsBaseDir;
@@ -157,6 +158,11 @@ public static FlinkOperatorConfiguration fromConfiguration(Configuration operato
157158
operatorConfig.get(
158159
KubernetesOperatorMetricOptions.OPERATOR_KUBERNETES_CLIENT_METRICS_ENABLED);
159160

161+
boolean kubernetesClientMetricsHttpResponseCodeGroupsEnabled =
162+
operatorConfig.get(
163+
KubernetesOperatorMetricOptions
164+
.OPERATOR_KUBERNETES_CLIENT_METRICS_HTTP_RESPONSE_CODE_GROUPS_ENABLED);
165+
160166
int metricsHistogramSampleSize =
161167
operatorConfig.get(
162168
KubernetesOperatorMetricOptions.OPERATOR_METRICS_HISTOGRAM_SAMPLE_SIZE);
@@ -178,6 +184,7 @@ public static FlinkOperatorConfiguration fromConfiguration(Configuration operato
178184
josdkMetricsEnabled,
179185
metricsHistogramSampleSize,
180186
kubernetesClientMetricsEnabled,
187+
kubernetesClientMetricsHttpResponseCodeGroupsEnabled,
181188
flinkCancelJobTimeout,
182189
flinkShutdownClusterTimeout,
183190
artifactsBaseDir,

flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/metrics/KubernetesClientMetrics.java

+47-8
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@
3030
import okhttp3.Response;
3131

3232
import java.io.IOException;
33+
import java.util.ArrayList;
34+
import java.util.List;
3335
import java.util.Map;
3436
import java.util.concurrent.ConcurrentHashMap;
3537

@@ -40,6 +42,11 @@ public class KubernetesClientMetrics implements Interceptor {
4042
public static final String HTTP_REQUEST_GROUP = "HttpRequest";
4143
public static final String HTTP_REQUEST_FAILED_GROUP = "Failed";
4244
public static final String HTTP_RESPONSE_GROUP = "HttpResponse";
45+
public static final String HTTP_RESPONSE_1XX = "1xx";
46+
public static final String HTTP_RESPONSE_2XX = "2xx";
47+
public static final String HTTP_RESPONSE_3XX = "3xx";
48+
public static final String HTTP_RESPONSE_4XX = "4xx";
49+
public static final String HTTP_RESPONSE_5XX = "5xx";
4350
public static final String COUNTER = "Count";
4451
public static final String METER = "NumPerSecond";
4552
public static final String HISTO = "TimeNanos";
@@ -57,7 +64,10 @@ public class KubernetesClientMetrics implements Interceptor {
5764
private final SynchronizedMeterView requestFailedRateMeter;
5865
private final SynchronizedMeterView responseRateMeter;
5966

60-
private final Map<Integer, Counter> responseCodeCounters = new ConcurrentHashMap<>();
67+
private final boolean httpResponseCodeGroupsEnabled;
68+
private final List<SynchronizedMeterView> responseCodeGroupMeters = new ArrayList<>(5);
69+
private final Map<Integer, SynchronizedMeterView> responseCodeMeters =
70+
new ConcurrentHashMap<>();
6171
private final Map<String, Counter> requestMethodCounter = new ConcurrentHashMap<>();
6272

6373
public KubernetesClientMetrics(
@@ -88,6 +98,26 @@ public KubernetesClientMetrics(
8898
this.responseLatency =
8999
responseMetricGroup.histogram(
90100
HISTO, OperatorMetricUtils.createHistogram(flinkOperatorConfiguration));
101+
102+
this.httpResponseCodeGroupsEnabled =
103+
flinkOperatorConfiguration.isKubernetesClientMetricsHttpResponseCodeGroupsEnabled();
104+
if (this.httpResponseCodeGroupsEnabled) {
105+
this.responseCodeGroupMeters.add(
106+
createMeterViewForMetricsGroup(
107+
responseMetricGroup.addGroup(HTTP_RESPONSE_1XX)));
108+
this.responseCodeGroupMeters.add(
109+
createMeterViewForMetricsGroup(
110+
responseMetricGroup.addGroup(HTTP_RESPONSE_2XX)));
111+
this.responseCodeGroupMeters.add(
112+
createMeterViewForMetricsGroup(
113+
responseMetricGroup.addGroup(HTTP_RESPONSE_3XX)));
114+
this.responseCodeGroupMeters.add(
115+
createMeterViewForMetricsGroup(
116+
responseMetricGroup.addGroup(HTTP_RESPONSE_4XX)));
117+
this.responseCodeGroupMeters.add(
118+
createMeterViewForMetricsGroup(
119+
responseMetricGroup.addGroup(HTTP_RESPONSE_5XX)));
120+
}
91121
}
92122

93123
@Override
@@ -114,7 +144,10 @@ private void updateResponseMetrics(Response response, long startTimeNanos) {
114144
if (response != null) {
115145
this.responseRateMeter.markEvent();
116146
this.responseLatency.update(latency);
117-
getCounterByResponseCode(response.code()).inc();
147+
getMeterViewByResponseCode(response.code()).markEvent();
148+
if (this.httpResponseCodeGroupsEnabled) {
149+
responseCodeGroupMeters.get(response.code() / 100 - 1).markEvent();
150+
}
118151
} else {
119152
this.requestFailedRateMeter.markEvent();
120153
}
@@ -128,11 +161,17 @@ private Counter getCounterByRequestMethod(String method) {
128161
requestMetricGroup.addGroup(key).counter(COUNTER)));
129162
}
130163

131-
private Counter getCounterByResponseCode(int code) {
132-
return responseCodeCounters.computeIfAbsent(
133-
code,
134-
key ->
135-
OperatorMetricUtils.synchronizedCounter(
136-
responseMetricGroup.addGroup(key).counter(COUNTER)));
164+
private SynchronizedMeterView getMeterViewByResponseCode(int code) {
165+
return responseCodeMeters.computeIfAbsent(
166+
code, key -> createMeterViewForMetricsGroup(responseMetricGroup.addGroup(key)));
167+
}
168+
169+
private SynchronizedMeterView createMeterViewForMetricsGroup(MetricGroup metricGroup) {
170+
return OperatorMetricUtils.synchronizedMeterView(
171+
metricGroup.meter(
172+
METER,
173+
new MeterView(
174+
OperatorMetricUtils.synchronizedCounter(
175+
metricGroup.counter(COUNTER)))));
137176
}
138177
}

flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/metrics/KubernetesOperatorMetricOptions.java

+8
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,14 @@ public class KubernetesOperatorMetricOptions {
4444
.withDescription(
4545
"Enable KubernetesClient metrics for measuring the HTTP traffic to the Kubernetes API Server.");
4646

47+
public static final ConfigOption<Boolean>
48+
OPERATOR_KUBERNETES_CLIENT_METRICS_HTTP_RESPONSE_CODE_GROUPS_ENABLED =
49+
operatorConfig("kubernetes.client.metrics.http.response.code.groups.enabled")
50+
.booleanType()
51+
.defaultValue(false)
52+
.withDescription(
53+
"Enable KubernetesClient metrics for measuring the HTTP traffic to the Kubernetes API Server by response code group, e.g. 1xx, 2xx.");
54+
4755
public static final ConfigOption<Boolean> OPERATOR_RESOURCE_METRICS_ENABLED =
4856
operatorConfig("resource.metrics.enabled")
4957
.booleanType()

0 commit comments

Comments
 (0)