High memory usage in public nodes #486

scottyeager · 2024-10-25T02:41:33Z

I'm not sure if this is related to #423, but the graphs from the relevant time periods look fairly different so this felt worth a new issue. Both of the public nodes in Finland appear to be in an OOM reaping loop, getting restarted every 10-15 minutes.

The situation for other nodes is generally better but many have similar issues. Singapore has a similar sawtooth pattern on memory usage and has currently been up for about 1.5 hours. The nodes in Germany have both settled into a steady state of seemingly high memory usage after periods of memory use spikes and what look like OOM kills. The node in India has been up for almost two days but was showing a similar pattern before that.

Perhaps of particular interest, the nodes in Belgium both appear to have stopped forwarding packets altogether with strong correlation to their own memory spikes over the last 10 days.

I started investigating after having trouble connecting to remote hosts over Mycelium today. Not sure if this is directly related, but I noticed a large amount of messages in my laptop's Mycelium logs (hundreds per second) indicating routes lost and acquired from the nodes in Finland. That seems to have subsided now and connectivity from my laptop has improved.

LeeSmet · 2024-10-25T14:59:14Z

The high memory usage is indeed new, it seems that memory usage doesn't drop to expected levels after the queue of inbound messages is cleared. This is something which will need some debugging. Note that when 3.15 is released on mainnet, the zos nodes should update to a new mycelium version which reduces protocol traffic and should significantly improve the situation.

The belgian nodes are running a modified binary which rates limit inboud connections, to allow a steady build up over time. unfortunately it seems certain connections are unstable, so they lose connections at roughly the same rate as they accept them now. For these nodes, it can take some time before you manage to connect to them in the current situation. This rate limiting is something which needs some tuning (that's also why it's only these nodes that have it).

As a side note, these crappy connections which get reset all the time due to (presumably) the lower network being unstable are also pretty bad, since they generate additional protocol traffic and waste some cpu time on the peer to retract the node and then add it again after the reconnect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage in public nodes #486

High memory usage in public nodes #486

scottyeager commented Oct 25, 2024

LeeSmet commented Oct 25, 2024 •

edited

Loading

High memory usage in public nodes #486

High memory usage in public nodes #486

Comments

scottyeager commented Oct 25, 2024

LeeSmet commented Oct 25, 2024 • edited Loading

LeeSmet commented Oct 25, 2024 •

edited

Loading