CLDSRV-573: Fix crash because of prom-client timeout #5694

BourgoisMickael · 2024-11-06T11:51:49Z

Instead of crashing it will now stay alive, return a 500 with body {"message":"Error: Operation timed out."} and log

{"name":"S3","clientIP":"::1","clientPort":53518,"httpMethod":"GET","httpURL":"/metrics","err":{"message":"Operation timed out."},"time":1730901149499,"req_id":"31b0c58cb14cad5c9583","elapsed_ms":5002.625237,"level":"warn","message":"monitoring error","hostname":"MDM-RING-46789-store-1","pid":115}

For other arsenal error we will have the message field: "err":{"MethodNotAllowed":true,"message":"The specified method is not allowed against this resource."}

This changes will not go into ZENKO as they don't use the cluster module with prom-client

Important

This problem happens often in low resource platform (like CI).
This will fix many flaky CI on Federation that fails because the step Check if s3 Prometheus exporter is active retry 3 times ith small delay, crashing s3 multiple time

bert-e · 2024-11-06T11:51:53Z

Hello bourgoismickael,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options

name	description	privileged	authored
`/after_pull_request`	Wait for the given pull request id to be merged before continuing with the current one.
`/bypass_author_approval`	Bypass the pull request author's approval	⭐
`/bypass_build_status`	Bypass the build and test status	⭐
`/bypass_commit_size`	Bypass the check on the size of the changeset `TBA`	⭐
`/bypass_incompatible_branch`	Bypass the check on the source branch prefix	⭐
`/bypass_jira_check`	Bypass the Jira issue check	⭐
`/bypass_peer_approval`	Bypass the pull request peers' approval	⭐
`/bypass_leader_approval`	Bypass the pull request leaders' approval	⭐
`/approve`	Instruct Bert-E that the author has approved the pull request.		✍️
`/create_pull_requests`	Allow the creation of integration pull requests.
`/create_integration_branches`	Allow the creation of integration branches.
`/no_octopus`	Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
`/unanimity`	Change review acceptance criteria from `one reviewer at least` to `all reviewers`
`/wait`	Instruct Bert-E not to run until further notice.

Available commands

name	description	privileged
`/help`	Print Bert-E's manual in the pull request.
`/status`	Print Bert-E's current status in the pull request `TBA`
`/clear`	Remove all comments from Bert-E from the history `TBA`
`/retry`	Re-start a fresh build `TBA`
`/build`	Re-start a fresh build `TBA`
`/force_reset`	Delete integration branches & pull requests, and restart merge process from the beginning.
`/reset`	Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

dvasilas

minor (very minor if this only happens in the CI):
"err":{},"results":{"message":"Error: Operation timed out."}

The results field is a bit unintuitive I think (I wouldn't grep for results in the logs).

Could we maybe have something like "error":{"message":"Error: Operation timed out."} ?

bert-e · 2024-11-06T12:32:51Z

Integration data created

I have created the integration data for the additional destination branches.

this pull request will merge bugfix/CLDSRV-573-prom-client into
development/7.70
w/8.6/bugfix/CLDSRV-573-prom-client will be merged into development/8.6
w/8.7/bugfix/CLDSRV-573-prom-client will be merged into development/8.7
w/8.8/bugfix/CLDSRV-573-prom-client will be merged into development/8.8

The following branches will NOT be impacted:

development/7.10
development/7.4

You can set option create_pull_requests if you need me to create
integration pull requests in addition to integration branches, with:

@bert-e create_pull_requests

The following options are set: approve

anurag4DSB · 2024-11-06T12:36:58Z

minor (very minor if this only happens in the CI): "err":{},"results":{"message":"Error: Operation timed out."}

The results field is a bit unintuitive I think (I wouldn't grep for results in the logs).

Could we maybe have something like "error":{"message":"Error: Operation timed out."} ?

I haven't seen the issue in production labs but I have in pre-production labs, one of them was FreePro.
Edit: It happens during CS startup time.

anurag4DSB

We need better errors in logs
I want to approve, but as the /approve command is there, it will be merged automatically without control.

lib/utilities/monitoringHandler.js

anurag4DSB · 2024-11-06T12:41:08Z

lib/utilities/monitoringHandler.js

@@ -48,7 +56,7 @@ function monitoringHandler(clientIP, req, res, log) {
    function monitoringEndHandler(err, results) {


Worth renaming this method to handleMonitoringResponse
Probably out of scope of this PR.

Fix crashes of primary because of prom-client 5s timeout. Mostly to happen at startup when workers are not ready. Should also fix error write EPIPE in workers by preventing primary to crash.

BourgoisMickael · 2024-11-06T13:57:07Z

/create_integration_branches

BourgoisMickael · 2024-11-06T13:59:21Z

/approve

bert-e · 2024-11-06T13:59:33Z

Integration data created

I have created the integration data for the additional destination branches.

this pull request will merge bugfix/CLDSRV-573-prom-client into
development/7.70
w/8.6/bugfix/CLDSRV-573-prom-client will be merged into development/8.6
w/8.7/bugfix/CLDSRV-573-prom-client will be merged into development/8.7
w/8.8/bugfix/CLDSRV-573-prom-client will be merged into development/8.8

The following branches will NOT be impacted:

development/7.10
development/7.4

You can set option create_pull_requests if you need me to create
integration pull requests in addition to integration branches, with:

@bert-e create_pull_requests

The following options are set: approve, create_integration_branches

bert-e · 2024-11-06T14:14:07Z

I have successfully merged the changeset of this pull request
into targetted development branches:

✔️ development/7.70
✔️ development/8.6
✔️ development/8.7
✔️ development/8.8

The following branches have NOT changed:

development/7.10
development/7.4

Please check the status of the associated issue CLDSRV-573.

Goodbye bourgoismickael.

The following options are set: approve, create_integration_branches

BourgoisMickael requested review from anurag4DSB, nicolas2bert, dvasilas, fredmnl, williamlardier and a team November 6, 2024 11:56

scality deleted a comment from bert-e Nov 6, 2024

dvasilas approved these changes Nov 6, 2024

View reviewed changes

BourgoisMickael force-pushed the bugfix/CLDSRV-573-prom-client branch from 7a490b8 to cbaa5d6 Compare November 6, 2024 12:27

scality deleted a comment from bert-e Nov 6, 2024

anurag4DSB reviewed Nov 6, 2024

View reviewed changes

anurag4DSB approved these changes Nov 6, 2024

View reviewed changes

BourgoisMickael added 2 commits November 6, 2024 14:56

CLDSRV-573: Catch prom cluster timeout

bdb4f23

Fix crashes of primary because of prom-client 5s timeout. Mostly to happen at startup when workers are not ready. Should also fix error write EPIPE in workers by preventing primary to crash.

CLDSRV-573: Bump version

9981e50

BourgoisMickael force-pushed the bugfix/CLDSRV-573-prom-client branch from cbaa5d6 to 9981e50 Compare November 6, 2024 13:56

scality deleted a comment from bert-e Nov 6, 2024

bert-e merged commit 9981e50 into development/7.70 Nov 6, 2024
11 checks passed

bert-e deleted the bugfix/CLDSRV-573-prom-client branch November 6, 2024 14:14

This was referenced Nov 6, 2024

[HF] CLDSRV-573: Fix crash because of prom-client timeout (S3C 9.4) #5695

Closed

[HF] CLDSRV-573: Fix crash because of prom-client timeout (S3C 9.4) #5696

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLDSRV-573: Fix crash because of prom-client timeout #5694

CLDSRV-573: Fix crash because of prom-client timeout #5694

BourgoisMickael commented Nov 6, 2024 •

edited

Loading

bert-e commented Nov 6, 2024

dvasilas left a comment •

edited

Loading

bert-e commented Nov 6, 2024

anurag4DSB commented Nov 6, 2024 •

edited

Loading

anurag4DSB left a comment •

edited

Loading

anurag4DSB Nov 6, 2024 •

edited

Loading

BourgoisMickael commented Nov 6, 2024

BourgoisMickael commented Nov 6, 2024

bert-e commented Nov 6, 2024

bert-e commented Nov 6, 2024

		@@ -48,7 +56,7 @@ function monitoringHandler(clientIP, req, res, log) {
		function monitoringEndHandler(err, results) {

CLDSRV-573: Fix crash because of prom-client timeout #5694

CLDSRV-573: Fix crash because of prom-client timeout #5694

Conversation

BourgoisMickael commented Nov 6, 2024 • edited Loading

bert-e commented Nov 6, 2024

Hello bourgoismickael,

dvasilas left a comment • edited Loading

Choose a reason for hiding this comment

bert-e commented Nov 6, 2024

Integration data created

anurag4DSB commented Nov 6, 2024 • edited Loading

anurag4DSB left a comment • edited Loading

Choose a reason for hiding this comment

anurag4DSB Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

BourgoisMickael commented Nov 6, 2024

BourgoisMickael commented Nov 6, 2024

bert-e commented Nov 6, 2024

Integration data created

bert-e commented Nov 6, 2024

BourgoisMickael commented Nov 6, 2024 •

edited

Loading

dvasilas left a comment •

edited

Loading

anurag4DSB commented Nov 6, 2024 •

edited

Loading

anurag4DSB left a comment •

edited

Loading

anurag4DSB Nov 6, 2024 •

edited

Loading