Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content Node Always Down #32432

Closed
aryamanvinchhi opened this issue Sep 19, 2024 · 5 comments
Closed

Content Node Always Down #32432

aryamanvinchhi opened this issue Sep 19, 2024 · 5 comments

Comments

@aryamanvinchhi
Copy link

I have a content node that is constantly down (it keeps restarting every 30 min or so). The logs look mostly fine, but I did note this message.

Steps to reproduce:
Nothing specific here, I created a cluster, ingested documents and now I find 1 node is struggling.

Any ideas on how to debug or proceed here? I also tried replacing the node (no data loss since the data is persisted on a mount) but the problem still exists.

"terminate called after throwing as instance of search::chunkException
terminate called recursively
incremented restart penalty to 14 seconds"

@aryamanvinchhi
Copy link
Author

Version 8.270.8

@aryamanvinchhi
Copy link
Author

Quick correction - the pod itself does not restart but it is the vespa-proton indexing service that keeps starting again and again. From what I understand, this is actually not an issue but expected behavior.

I tried stopping and starting services again, but the node continues to show a "Connection reset" error on the cluster controller page. The restart penalty is up to 1800 seconds now.

@bratseth
Copy link
Member

The document store data is corrupt for some reason (corruption, incomplete write, bug). We would be interested in looking at it, but I think that will be hard for non-technical reasons, and you are also on a quite old version.

Unless you have configured redundancy 1 the data will already be restored in secondary copies on the other nodes so you can get out of this situation by deleting the data of this node.

@geirst
Copy link
Member

geirst commented Sep 25, 2024

In Vespa 8.413.11 we have extended the chunk exception with more details (#32452) that will be logged if something similar happens again.

Please upgrade to the newest version and report back.

@aryamanvinchhi
Copy link
Author

Sounds great, thank you!

@kkraune kkraune closed this as completed Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

4 participants