address any errors or oddness from scheduled full cluster reboot #301

rahulbot · 2024-06-06T14:46:35Z

We had a scheduled outage Wed June 5th to replace batteries (#285) and try out rebooting. What went wrong or merits investigation? Please add items to this item list, and/or add a comment with relevant explanation (@thepsalmist @philbudne). Or break off into new issues if more complex. Things I noted:

we were unable to SSH to ramos, bradley, woodward for a few hours
ES didn't restart on Ramos

philbudne · 2024-06-06T15:51:52Z

- ES didn't restart on Ramos

Ramos came back up without the array filesystem (/srv/data) mounted. I'm not aware that any manual action was taken to restart ES once the array FS reappeared. I can't help wondering if all the delays were due to filesystem checks; ramos has the largest array, and took the longest to come back (despite being the system with the most CPU/memory). I think the actions to consider: 1. Understand the restart delays, and any actions that might be taken to reduce/eliminate them. 2. Consider running news-search-api instance on all ES servers (with each NSA instance speaking to only the local ES server): this would have allowed the web search site to become available as soon as any two out of three ES servers were available. 3. Longer term: make it possible to run the pipeline when only two ES servers are available (NOTE! permanent loss of an ES server is precursor to total failure, and is NOT to be taken lightly). Two approaches: a. Run the indexer stack on a docker swarm/cluster consisting of all ES servers. b. Run the indexer stack on a "compute" docker server (or cluster) SEPARATE from the ES servers. Consider running importer on each ES server speaking only to the local ES.

pgulley · 2024-07-03T14:52:22Z

More notes on why ramos and bradley took another two hours to become SSH-able:

If I remember correctly, the delay was connected to us needing to go through and change the UEFI/BIOS settings on all of the machines after replacing the CMOS batteries. And there were a few of your machines that still required BIOS for some reason, unlike the others.

And a related note on UPS:

Also: if you ever plan on replacing your battery backup (UPS) devices, please check in with me first about your plans. Most of our racks are set up for 220V and your two UPS are 110V. We’d like to have everything standardized to 220V, if possible.

pgulley · 2024-07-03T14:53:25Z

@philbudne had suggested getting umass it to run a reboot test to watch the reboot process- is this still needed after the above response?

kilemensi · 2024-07-04T11:02:50Z

... having gone through the message, my vote would be YES. Every previous outage has uncovered issues with the current setup (misconfigure software, hardware, etc.). I think it' important to test a reboot just to be sure that we're now PROD proper (as well as determine how long does the system take to be back up on a good day).

pgulley · 2024-07-10T14:46:06Z

Pending a fan replacement on one of the machines from them- we can bundle these tasks for them when that is scheduled.

rahulbot added question Further information is requested infrastructure labels Jun 6, 2024

rahulbot added this to the Production Beta 7 milestone Jun 6, 2024

pgulley added this to Ingest + Index Infrastructure Jun 28, 2024

pgulley moved this to In Progress in Ingest + Index Infrastructure Jul 3, 2024

pgulley modified the milestones: June, July Jul 3, 2024

pgulley self-assigned this Jul 17, 2024

pgulley modified the milestones: 2 - July, 3 - August Jul 31, 2024

pgulley modified the milestones: 3 - August, 4 - September Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

address any errors or oddness from scheduled full cluster reboot #301

address any errors or oddness from scheduled full cluster reboot #301

rahulbot commented Jun 6, 2024

philbudne commented Jun 6, 2024 via email

pgulley commented Jul 3, 2024

pgulley commented Jul 3, 2024

kilemensi commented Jul 4, 2024

pgulley commented Jul 10, 2024

address any errors or oddness from scheduled full cluster reboot #301

address any errors or oddness from scheduled full cluster reboot #301

Comments

rahulbot commented Jun 6, 2024

philbudne commented Jun 6, 2024 via email

pgulley commented Jul 3, 2024

pgulley commented Jul 3, 2024

kilemensi commented Jul 4, 2024

pgulley commented Jul 10, 2024