-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
address any errors or oddness from scheduled full cluster reboot #301
Comments
- ES didn't restart on Ramos
Ramos came back up without the array filesystem (/srv/data) mounted.
I'm not aware that any manual action was taken to restart ES once the
array FS reappeared.
I can't help wondering if all the delays were due to filesystem
checks; ramos has the largest array, and took the longest to come back
(despite being the system with the most CPU/memory).
I think the actions to consider:
1. Understand the restart delays, and any actions that might be taken to reduce/eliminate them.
2. Consider running news-search-api instance on all ES servers (with
each NSA instance speaking to only the local ES server): this would
have allowed the web search site to become available as soon as any
two out of three ES servers were available.
3. Longer term: make it possible to run the pipeline when only two ES
servers are available (NOTE! permanent loss of an ES server is
precursor to total failure, and is NOT to be taken lightly). Two
approaches:
a. Run the indexer stack on a docker swarm/cluster consisting of all ES servers.
b. Run the indexer stack on a "compute" docker server (or cluster) SEPARATE
from the ES servers. Consider running importer on each ES server
speaking only to the local ES.
|
More notes on why ramos and bradley took another two hours to become SSH-able:
And a related note on UPS:
|
@philbudne had suggested getting umass it to run a reboot test to watch the reboot process- is this still needed after the above response? |
... having gone through the message, my vote would be YES. Every previous outage has uncovered issues with the current setup (misconfigure software, hardware, etc.). I think it' important to test a reboot just to be sure that we're now PROD proper (as well as determine how long does the system take to be back up on a good day). |
Pending a fan replacement on one of the machines from them- we can bundle these tasks for them when that is scheduled. |
We had a scheduled outage Wed June 5th to replace batteries (#285) and try out rebooting. What went wrong or merits investigation? Please add items to this item list, and/or add a comment with relevant explanation (@thepsalmist @philbudne). Or break off into new issues if more complex. Things I noted:
The text was updated successfully, but these errors were encountered: