You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Postmortemt#5738 showed that node can crash and restart if a runtime api hangs, the danger here is that if one API is hanging/taking a long time the behaviour is similar on all nodes, in this case all nodes crashed and restarted at the same time.
That's not good for the network so we should explore ideas for reducing the blast radius, on possible method is to timeout on runtime api calls and make sure the subsystems graciously handle this type of errors.
One thing to take into consideration here is that even if the subsystem call timed-out the runtime could still have that API running in the background and burning CPUs time so we need to make sure we graciously cancel kill tasks that are not needed anymore.
The text was updated successfully, but these errors were encountered:
Postmortemt #5738 showed that node can crash and restart if a runtime api hangs, the danger here is that if one API is hanging/taking a long time the behaviour is similar on all nodes, in this case all nodes crashed and restarted at the same time.
That's not good for the network so we should explore ideas for reducing the blast radius, on possible method is to timeout on runtime api calls and make sure the subsystems graciously handle this type of errors.
One thing to take into consideration here is that even if the subsystem call timed-out the runtime could still have that API running in the background and burning CPUs time so we need to make sure we graciously cancel kill tasks that are not needed anymore.
The text was updated successfully, but these errors were encountered: