-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Content Node Always Down #32432
Comments
Version 8.270.8 |
Quick correction - the pod itself does not restart but it is the vespa-proton indexing service that keeps starting again and again. From what I understand, this is actually not an issue but expected behavior. I tried stopping and starting services again, but the node continues to show a "Connection reset" error on the cluster controller page. The restart penalty is up to 1800 seconds now. |
The document store data is corrupt for some reason (corruption, incomplete write, bug). We would be interested in looking at it, but I think that will be hard for non-technical reasons, and you are also on a quite old version. Unless you have configured redundancy 1 the data will already be restored in secondary copies on the other nodes so you can get out of this situation by deleting the data of this node. |
In Vespa 8.413.11 we have extended the chunk exception with more details (#32452) that will be logged if something similar happens again. Please upgrade to the newest version and report back. |
Sounds great, thank you! |
I have a content node that is constantly down (it keeps restarting every 30 min or so). The logs look mostly fine, but I did note this message.
Steps to reproduce:
Nothing specific here, I created a cluster, ingested documents and now I find 1 node is struggling.
Any ideas on how to debug or proceed here? I also tried replacing the node (no data loss since the data is persisted on a mount) but the problem still exists.
"terminate called after throwing as instance of search::chunkException
terminate called recursively
incremented restart penalty to 14 seconds"
The text was updated successfully, but these errors were encountered: