-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missed attestations, beacon node increase CPU, increased disk read + writes #5105
Comments
Could be related to this issue, which will be fixed in v4.6.0: #4918 Please try upgrading once the release is available (soon) and let us know if it's resolved |
Will be happy to do so |
|
No, there is a single validator connected to the beacon node and nothing else doing API queries to the beacon node. |
@vogelito Could you please DM me debug logs and metrics from one of these node on Discord (I am |
Sent! |
Hi - we’re seeing another big instance of this happening today after quietness for a few days |
Since increasing the memory of the node to 32GB (on |
Sounds good! Let’s see how it goes after the new release… |
Some more logs prior to our memory upgrade worth sharing:
This is what the last event looks like:
|
@vogelito I think #5270 should help quite a bit. Lighthouse is not using particularly excessive amounts of memory (which would be indicative of a bug). All your OOM logs show it gettting killed at around 8GB:
( On our infrastructure Lighthouse uses 4-8GB regularly, so I expect the spikes to 8GB were too much for your node with the 16GB hard limit. Geth is using the rest of the memory by the looks of things, its RSS shown in the kernel backtrace is |
Sounds like a plan to me.... |
Upgraded to
Will report back! |
Another crash:
|
@vogelito Please go back to 32GB or try |
If you can DM me debug logs again too, we can take a look. Lighthouse shouldn't really be hitting 9GB. |
Looking progressively worse....
I've restarted with DM'ed you logs on Discord |
The issue continues after the restart, though at a lesser frequency
|
@vogelito Can you please DM me a dump of We don't have any imminent memory improvements coming, so Lighthouse's memory usage is expected to remain spikey until we merge |
tree-states looks promising :) I've sent you the output of log of restarts, still running 16GB, will restart to 32GB now until
|
I updated to the Tree People release last Thursday (June 13). I see a smaller and less choppy memory footprint but with a clear upwards trend. See chart below: Do we want to give 16GB RAM a try? |
@vogelito It should plateau. It won't grow indefinitely |
Try the 16GB RAM, I think it should be OK now |
Alright, back to 16GB RAM. Will report any findings.... |
No restarts in the last 28 hours running 5.2.0 on 16GB RAM. Will report again in a week.
|
I can confirm that everything look very nice and stable on a 16GB machine for the first time in 7 months :) Thanks for the hard work!! Let me know if you need anything from me or if I can be helpful in any way! |
Great to hear @vogelito! Do you mind if we close this issue? |
Of course! Well earned close!! |
Closing the issue as discussed above |
Although definitely no longer an issue, wanted to give you visibility over a couple of OOM kills that have taken place since I downgraded to a 16GB machine
|
As always, happy to help debug. |
Hi, The OOM kills are back with a vengeance for the last 2 days. Nothing changed in our setup.
Same behavior as before.... |
It's been running 5.3.0 since Aug 14, but kills just took place over the weekend:
|
How much memory is on the node again, was it 16GB or 32GB? If it is 16GB, it is recommended to upgrade to 32GB Which execution client are you using? |
Hi @vogelito. Looking at our own nodes I'm also seeing some occasional memory spikes around 7-8GB. I'm investigating now. It may not be an easy fix however, so CK's suggestions of 32GB is probably the way to go in the short term. |
Using geth and running on 16GB. It had run without issues since mid August but I’ll go back to 32GB for now. happy, as always, to help debug if I can be helpful! |
Description
5 weeks ago I started noticing my validator node missing attestations. I then realized these missed attestations matched moments where CPU usage in my node was high. I then added a probe to see what process in my node was spiking the CPU usage and realized it was mostly due to higher CPU usage from my beacon node. I also realized these periods matched periods of higher than usual disk reading and writing.
This is new behavior that started roughly 5 weeks ago.
Version
I'm downloading the release binaries from Github and currently running:
Present Behaviour
I have missed attestations on each of these periods of high CPU usage:
Expected Behaviour
I shouldn't be missing attestations.
Steps to resolve
I'm unaware of what steps I can take to resolve this, but happy to work with the devs on resolving them :)
The text was updated successfully, but these errors were encountered: