Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node_exporter may hang on a DB node with the error encoding and sending metric family: write tcp %IP%:9100 error #8692

Open
1 of 2 tasks
vponomaryov opened this issue Sep 13, 2024 · 2 comments
Assignees

Comments

@vponomaryov
Copy link
Contributor

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Setting up 2023.1.11 Scylla version one of the nodes hung with the following errors:

2024-09-13T11:33:57.764+00:00 rolling-upgrade-ltncy-rgrssn--ubunt-db-node-68892a87-0-1     !INFO | scylla[15426]:  \
    [shard  0] stream_session - [Stream #10657cf0-71c4-11ef-830a-21e3b321ba22] Streaming plan for Bootstrap-system_distributed-index-10 succeeded, peers={10.142.0.14}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
2024-09-13T11:34:01.006+00:00 rolling-upgrade-ltncy-rgrssn--ubunt-db-node-68892a87-0-1     !INFO | node_exporter[14047]: \
    ts=2024-09-13T11:34:00.709Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 10.142.0.10:9100" msg="->10.142.0.22:60390: write: broken pipe"
2024-09-13T11:34:01.017+00:00 rolling-upgrade-ltncy-rgrssn--ubunt-db-node-68892a87-0-1     !INFO | node_exporter[14047]: \
    ts=2024-09-13T11:34:00.728Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 10.142.0.10:9100" msg="->10.142.0.22:60390: write: broken pipe"
...
2024-09-13T12:31:19.282+00:00 rolling-upgrade-ltncy-rgrssn--ubunt-db-node-68892a87-0-1     !INFO | node_exporter[14047]: \
    ts=2024-09-13T12:31:19.031Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 10.142.0.10:9100" msg="->10.142.0.22:33436: write: broken pipe"

Later CI job was aborted.

Steps to Reproduce

  1. Setup custom_d1 (with special disk config) 3-node DB cluster
  2. See error
  3. [and so on...]

Expected behavior: node exporter must always be working correctly.

Actual behavior: node exporter may randomly hang.

Impact

Setup of a DB nodes hangs making a test run be spoiled.

How frequently does it reproduce?

~3/11 test runs. It is too frequent.

Installation details

SCT Version: master
Scylla version (or git commit hash): 2023.1.11-0.20240729.5a79e79a0320 with build-id 4daf2e1487b1ab784ff564a6c8fd75f9ddd8a9ac

Logs

@fruch
Copy link
Contributor

fruch commented Sep 15, 2024

@vponomaryov

if it's the node_exporter on the DB node, I think it's something that needs to be reported on scylla core...

@vponomaryov
Copy link
Contributor Author

@vponomaryov

if it's the node_exporter on the DB node, I think it's something that needs to be reported on scylla core...

We have a lot of configuration code for it in SCT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants