Content node taking too long to come back up after restart #32396

shubh9194 · 2024-09-13T06:36:10Z

Describe the bug
we have a cluster with 1 container and 1 content node with almost 85 schemas. We have observed that when we restart the content node or when content node is restarted due to any reason., it takes a long time to come back up.
Also cluster is storing and reading data from encrypted path.

vespa version : 8.388.11

Expected behavior
content node should be up witth in few mins max and not take hours.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS: [e.g. iOS]
Infrastructure: Podman
Versions [e.g. 22]

Vespa version
vespa version : 8.388.11

Additional context
Add any other context about the problem here.

Will be sharing the logs seperately

hmusum · 2024-09-13T20:03:13Z

Do you use vespa-stop-services when you restart? Read the doc about vespa-stop-services and the prepareRestart command, which might reduce startup time. If you do that and see that prepareRestart times out, Vespa 8.406.26 and later have a higher timeout for this command, as we have seen that content nodes with a lot of data might need more time for this.

shubh9194 · 2024-09-16T17:15:29Z

@hmusum
I am updating the vespa version to one that you suggested.
In the mean time, can you please help with this as well.
Reindexing is also taking a lot of time and sometimes it seems it is stuck and don’t see any progress.
In the content logs, i dont see any error but seeing a lot warnings like below.

Warning searchnode proton.persistence.util Slow processing of message Put (BucketId(0x300000000000000), id:default:docid, timestamp 1726503690000164, size 19266). processing time: 58.0 s (>=5.0) (Repeated 107 times since 1726503752.512184)

let me know if you need more logs

hmusum · 2024-09-16T18:42:09Z

Sounds like you have a setup that cannot handle the feed load.

Hardware, local or remote disk, number of documents, # of and type of indexes, paged attributes, tensors/embeddings, HNSW indexes are some of the things that might affect this. You might want to look into https://docs.vespa.ai/en/performance/sizing-feeding.html to find what/how you can do something about this, if possible. Hard to say without knowing more about the application, schema details, HW etc.

shubh9194 · 2024-09-17T12:04:54Z

@hmusum
We have a single container and content node on different hosts.
Content node host have 62GB RAM and around 400GB disk.
In our cluster, right now we have deployed around 90 schemas out of which only around 30 schemas have documents for now.
Total docs around 1.5m And 1 schema has more than 70% docs.

Will reducing the number of schemas help?

also week back reindexing succeeded once after node restart and around that time in logs, i have found some deadlock related logs

proton. deadlock.detector
warning Thread Modified bucket checker 0x10308644200 has gone 296151
milliseconds without registering a tick.
Global slack not expended yet.
warning for now. Attempting to dump stack of thread... Backtrace:\n /lib64/1ibc.so. 6(syscall+0x1d) [0x7ff21aa919bd] \n /1ib64/1ibstdc++.so.6(std::__atomic_futex_unsigned_base::_M_futex_wait_until(unsigned int*, unsigned int, bool, std::chrono:: duration<long, std::ratio<1l, 1l> >, std::chrono: : duration<long, std: :ratio<1l, 10000000001> >+0xe1) [0X7ff21b4777c1]\n /opt/vespa/sbin/vespa-proton-bin (proton::PersistenceEngine::getModifiedBuckets(document: :BucketSpace) const+0x454) [0x5ffe24] \n /opt/vespa/lib64/libstorage.so(storage::ModifiedBucketChecker::requestModifiedBucketsFromProvider(document::BucketSpace)+0x32) [0х7ff229234932]
/opt/vespa/11b64/11bstorage.sostorage:
:ModifiedBucketChecker:: tick+0x1a3) [0x7ff229234d3] \n
/opt/vespa/lib64/1ibstorage.so(storage::ModifiedBucketChecker:: run (storage:: framework:: ThreadHandle&)+Oxdb) [Ox7ff229234cb]\n /opt/vespa/lib64/libstorage.so(storage:: framework::defaultimplementation::ThreadImpl:: run (+0x20) [0x7ff22937b390] \n
/lib64/libstdc++.so.6(+0xc2b23) [ox7ff21b479b23]\n /opt/vespa/lib64/vespa/malloc/libvespamalloc.so(+0x94de) [0x7ff22b2f64de]\n /1ib64/1ibpthread.so.0(+0x81ca) /lib64/libc.so.6(clone+0x43)
[0x7ff21aa91e73]\n

Also when content node is restarted
In logs, we transaction log replay progress is very slow. We also see high cpu usage and it remains until content node is up. cpu usage jumps from around 10% to 60 or even 80% sometimes and high page out rate.
Let me know if you need any more logs or info.

shubh9194 · 2024-09-18T21:00:58Z

@hmusum @kkraune
We have reduced the number of schemas to less than 50 and updated vespa version to 8.406.26.
Previously in order to make sure cluster is working fine, We used the reindex api to stop the reindexing of one the schema.
After upgrading, I started the reindexing again but it always get stuck and we don't see any progress.
During that time, cpu usage and page out rate is very high.
And if we stop the vespa process and podman container during this time and try to restart them, vespa services takes lot of time come back up as transaction log replay is very slow for schema that is having the most docs.

reindexing is not completed for schema in last 2 weeks. And only way to make sure ingestions are working fine, we stopped the reindexing using reindex api.
Dont see any error in the logs except lot of warnings like
Warning searchnode proton.persistence.util Slow processing of message Put (BucketId(0x300000000000000), id:default:docid, timestamp 1726503690000164, size 19266). processing time: 58.0 s (>=5.0) (Repeated 107 times since 1726503752.512184)

RAM usage of host where content node is running is around 60% so we do have significant RAM available as well.

How do we find the where the actual bottleneck is during reindexing?

bratseth · 2024-09-19T07:42:51Z

To avoid tx replay on startup, let vespa-stop-services compl,ete before stopping the container.

How many documents are in this schema, do you use HSNW indexes (how many), and are the vector attributes paged?

shubh9194 · 2024-09-19T09:00:42Z

@bratseth we stop the vespa services before stopping the container.
Will confirm on whether vespa-services are stopped properly.

But can you help with why reindexing is getting stuck. All ingestions are getting timed out as well unless we stop the reindexing using reindex api.
During reindexing we see the logs filled with these warning msg
Warning searchnode proton.persistence.util Slow processing of message Put (BucketId(0x300000000000000)

bratseth · 2024-09-19T10:04:28Z

Is it reindexing that is the problem, or is it replaying the tx log at startup?

How many documents are in this schema, do you use HSNW indexes (how many), and are the vector attributes paged?

shubh9194 · 2024-09-19T10:34:36Z

@bratseth
We observed this issue started with reindexing and tx log start up issue also seen when node is restarted during reindexing.
Any other time, after node restart node comes back in few mins.

So we think reindexing is the root cause of all this and want to understand what is the bottleneck for reindexing and why it is getting stuck for one schema.
That schema has around 1.36 m docs.
And yes this schema has more than 2000 index fields and good number of paged attributes.

bratseth · 2024-09-20T12:40:54Z

Are the fields that are HNSW indexed paged?

shubh9194 · 2024-09-23T06:50:22Z

@bratseth no, no HNSW indexed paged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content node taking too long to come back up after restart #32396

Content node taking too long to come back up after restart #32396

shubh9194 commented Sep 13, 2024

hmusum commented Sep 13, 2024

shubh9194 commented Sep 16, 2024 •

edited

Loading

hmusum commented Sep 16, 2024

shubh9194 commented Sep 17, 2024 •

edited

Loading

shubh9194 commented Sep 18, 2024 •

edited

Loading

bratseth commented Sep 19, 2024

shubh9194 commented Sep 19, 2024

bratseth commented Sep 19, 2024

shubh9194 commented Sep 19, 2024 •

edited

Loading

bratseth commented Sep 20, 2024

shubh9194 commented Sep 23, 2024

Content node taking too long to come back up after restart #32396

Content node taking too long to come back up after restart #32396

Comments

shubh9194 commented Sep 13, 2024

hmusum commented Sep 13, 2024

shubh9194 commented Sep 16, 2024 • edited Loading

hmusum commented Sep 16, 2024

shubh9194 commented Sep 17, 2024 • edited Loading

shubh9194 commented Sep 18, 2024 • edited Loading

bratseth commented Sep 19, 2024

shubh9194 commented Sep 19, 2024

bratseth commented Sep 19, 2024

shubh9194 commented Sep 19, 2024 • edited Loading

bratseth commented Sep 20, 2024

shubh9194 commented Sep 23, 2024

shubh9194 commented Sep 16, 2024 •

edited

Loading

shubh9194 commented Sep 17, 2024 •

edited

Loading

shubh9194 commented Sep 18, 2024 •

edited

Loading

shubh9194 commented Sep 19, 2024 •

edited

Loading