Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

address any errors or oddness from scheduled full cluster reboot #301

Open
rahulbot opened this issue Jun 6, 2024 · 5 comments
Open

address any errors or oddness from scheduled full cluster reboot #301

rahulbot opened this issue Jun 6, 2024 · 5 comments
Assignees
Labels
infrastructure question Further information is requested
Milestone

Comments

@rahulbot
Copy link
Contributor

rahulbot commented Jun 6, 2024

We had a scheduled outage Wed June 5th to replace batteries (#285) and try out rebooting. What went wrong or merits investigation? Please add items to this item list, and/or add a comment with relevant explanation (@thepsalmist @philbudne). Or break off into new issues if more complex. Things I noted:

  • we were unable to SSH to ramos, bradley, woodward for a few hours
  • ES didn't restart on Ramos
@rahulbot rahulbot added question Further information is requested infrastructure labels Jun 6, 2024
@rahulbot rahulbot added this to the Production Beta 7 milestone Jun 6, 2024
@philbudne
Copy link
Contributor

philbudne commented Jun 6, 2024 via email

@pgulley
Copy link
Member

pgulley commented Jul 3, 2024

More notes on why ramos and bradley took another two hours to become SSH-able:

If I remember correctly, the delay was connected to us needing to go through and change the UEFI/BIOS settings on all of the machines after replacing the CMOS batteries. And there were a few of your machines that still required BIOS for some reason, unlike the others.

And a related note on UPS:

Also: if you ever plan on replacing your battery backup (UPS) devices, please check in with me first about your plans. Most of our racks are set up for 220V and your two UPS are 110V. We’d like to have everything standardized to 220V, if possible.

@pgulley
Copy link
Member

pgulley commented Jul 3, 2024

@philbudne had suggested getting umass it to run a reboot test to watch the reboot process- is this still needed after the above response?

@pgulley pgulley moved this to In Progress in Ingest + Index Infrastructure Jul 3, 2024
@pgulley pgulley modified the milestones: June, July Jul 3, 2024
@kilemensi
Copy link
Contributor

... having gone through the message, my vote would be YES. Every previous outage has uncovered issues with the current setup (misconfigure software, hardware, etc.). I think it' important to test a reboot just to be sure that we're now PROD proper (as well as determine how long does the system take to be back up on a good day).

@pgulley
Copy link
Member

pgulley commented Jul 10, 2024

Pending a fan replacement on one of the machines from them- we can bundle these tasks for them when that is scheduled.

@pgulley pgulley self-assigned this Jul 17, 2024
@pgulley pgulley modified the milestones: 2 - July, 3 - August Jul 31, 2024
@pgulley pgulley modified the milestones: 3 - August, 4 - September Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure question Further information is requested
Projects
Status: In Progress
Development

No branches or pull requests

4 participants