Skip to content

Inception

Neil MacGregor edited this page Sep 8, 2023 · 11 revisions

Broad Overview

[Aug 31, 2023] This is highly speculative; we're exploring other ways to crack this nut, including ansible-config/projects/weekly-reboot/

As a SysAdmin, I require automated reboots with flexible scheduling, because we need to minimize downtime, while responding to events that require a reboot, including, but not limited to, the release of a new kernel. We have such a system, but it's about to be rendered useless by the impending transition from our own vCenter cluster, to IST's managed cluster.

Review of the current system

  • At the beginning of the week, a script queries vCenter for a list of all the Linux servers. It walks down the list, sorted alphabetically. The 1st server is added to the Tuesday list, the 2nd server is added to the Wednesday list, and the 3rd server is added to the Thursday list.
  • On Tuesday/Wed/Thur at the appointed time, a different script wakes up, queries the appropriate list, shuts down all the servers in the list, and then restarts the servers. Limited allowances are made for "servers with special needs", via customized configuration.
  • Pro:
    • Reliable
    • Flexible enough for our needs
    • It performs a full power-cycle reboot, useful for applying (hardware) changes made in the vCenter GUI
  • Con:
    • Reboots way more than actually required for new kernels
    • No GUI; custom programming is required for all changes/exceptions; thus inflexible
    • Requires access to the vCenter API (which we're about to lose)
    • Does not reboot everything. No support for other VM's hosted on other hypervisors outside the cluster
    • Not represented in Ansible
    • Not packaged as an RPM / container for deployment
    • Slow: it takes minimum 3 days to reboot all the VM's, and that is frequently extended, if a kernel arrives on Wednesday at 2am, it might be 8 days before all the VM's have been rebooted, and this extended schedule causes conniptions in Nagios throughout the week
    • No control over the order in which servers are rebooted:
      • Dev before Staging / Staging before Prod
      • HAProxy standbys, before primaries
      • DB replicas, before DB primaries

Base Requirements

  • Security is Job #1
  • Simplicity is key to low maintenance, low cost, and Security
  • This is all about flexible, easy control over which servers will reboot, and when!
  • Having a complete inventory of all servers is hard... can we make this easier?
  • We don't want all the servers rebooting at once, but within a random window
  • There may be situations (Christmas) in which we want to disable all reboots
  • As sysAdmin, I need to be able to broadly visualize which servers will be rebooted "tonight" and maybe over different time periods
  • As a sysAdmin, I know that an individual server should not reboot if it's already running the new kernel
  • As a sysAdmin, I know that an individual server should not reboot if the new kernel has not been installed
  • I need to be able to disable the steady-state of all reboots, when Christmas or summer vacations appear on our calendar
  • I expect that each server will only reboot once, in response to a new kernel
  • Order:
    • I want the Dev and Staging systems to be rebooted FIRST; Prod systems should wait to be rebooted
    • I want the Database Replicas to be rebooted before the Primaries
    • I want the HAproxy Secondaries to be rebooted after the Primaries
    • I know some applications are Critical, have multiple webservers behind the load-balancer, and should not have both webservers rebooting at the same time.
  • Limiting the Blast Radius:
    • Obviously, we want all the systems to be running the new kernel ASAP...
    • But I need to avoid creating a trainwreck or multi-car-pileup. If servers are commanded to reboot, and they go down, but do not come back up, it's important to halt all further reboot operations until a sysAdmin resolves the situation
  • I know that v5 hosts westvault, and if we reboot the hypervisor, it will automatically reboot the guest OS. We need to allow for this. Metadata for westvault should indicate it "never" gets rebooted, only the hypervisor.
  • I don't want to have to provide a lot of "state" to the system. It should be able to sort Dev-and-Staging systems from Prod automatically, based on their network. But I'll need to represent servers that belong to the same application...
  • As a sysAdmin, if the server receives a query about a client that it's never seen before, maybe it should reply "No", and log the event, so I can add it to the metadata
  • As a sysAdmin, if a new kernel arrives, I want automation to kick in & start arranging the reboot schedule. I don't want to have to take any actions, unless there is an exceptional situation arising, like throwing the master automation switch to NO!, before Christmas holidays begin.
  • SysAdmins know that even if the Client doesn't have a new kernel installed, after 45 days, it should be rebooted anyway.

Limiting the Scope

  • Sometimes I need a COLD shutdown of a server in VMWare, to make some virtual-hardware change visible to the guest OS. I recognize that this system is not capable of interacting with VMWare, so I cannot use it to accomplish that goal.
  • As sysAdmin, occasionally, I need to schedule an exceptional reboot on one or more servers. I won't use this system to accomplish that; I have the skills to automate in a fashion that won't repeat. Or, maybe this is supported on the Client - touch a file, the system reboots, and removes the file?

Security As the #1 Job

  • If we deploy a service that is insecure, we've lost.

Proposed Solution, viewed from 3000m

  • A "client" side wakes up frequently during the day, queries a server whether it should be rebooted, and performs the reboot
  • The "server" is a web-service running on "repo", where firewall rules automatically allow connection
  • Nagios integration:
    • Nagios is manually configured with the knowledge of what kernel version "should" be running.
    • But we could do better - we could write a new plugin that does a better job of recognizing when a new kernel package appears in a repository, and both reconfigures Nagios with the new knowledge, and alerts "the Server" that the reboot cycle should be triggered

Server Features

  • The server is primarily an API which takes the hostname as input (although it could just key off the IP address of the client), and returns Yes (should reboot) or No (reboot not required)... maybe with a reason, for the client's log
  • How does the server know whether a client should be rebooted? Inventory is King!
    • The Server needs a static dependency graph for all Clients, to define the order of reboots:
      • For instance: telford => torquay => burton => bury => haproxy-dev1 => haproxy-dev2
      • Database servers very similar (Dev first, then Staging, then Prod, but also, Replica's first)
      • Application servers - just pick one to go first
      • For instance, this could be a simple JSON file, SQLite DB, etc
  • I want a single "master switch" that controls all servers at once, eg it's either On or Off, eg for the Christmas Vacation requirement.
  • I know I can refer to Nagios to determine which servers require a reboot. But, that only works after I receive a fairly random email from a mailing list I've signed up for, and I read it & find out that a new kernel has been released, and manually change Nagios configuration to the new kernel. Or I'm logged into a system, and realize that the system has a newer kernel ready to install via "yum check-update". Or, Nagios reports that the system has already rebooted, and isn't running the "correct" kernel (it's running a newer kernel).
  • Looked at through the dependency graph, for HAProxy, there's a string of 6 servers to reboot. Each one takes at least 30 seconds. We could have them all rebooted in the same night, as long as there are not too many down all at the same time.
  • The only reason the current system staggers the reboots across 3 nights is it has no knowledge of the dependency graph, and it takes all the servers in the group down at the same time.
  • Decision tree:
    • No, if a SysAdmin threw the “vacation” switch
    • No, if your dependencies are not satisfied
    • No, if too many servers are down at once
    • Yes, if your uptime is too high, and no new kernel is installed
    • Yes, if you have no dependencies, or dependencies are satisfied
    • No metadata? No reboot! (Log this; an Administrator will add metadata to forgotten servers)

Nagios Integration?

  • Nagios can reach all the clients, eg via ping
  • All the clients can reach repo
  • Nagios knows what kernel is installed on each VM, via custom plugin
  • Maybe we could extend the custom plugin, to also report whether a newer kernel has been installed, and inform the server about the running kernel, and any installed newer kernels
  • NOTE: writing code to determine which kernel is newer is gonna be hard, experience teaches
  • Reporting up/down / Avoiding Multicar Pileups
    • Maybe the client is bigger than a cronjob. Maybe at startup it informs the Server about successful startup
    • But maybe Nagios (event handler?) reports to the Server about transitions from up-to-down and vice-versa
      • haproxy-dev2 is earlier reported by Nagios to be running v2.1, and v2.2 is intalled, to the server
      • haproxy-dev2 queries the Server -- Server replies "Yes, new kernel"
      • telford queries the Server, which replies "No, dependency burton,bury,haproxy-dev1,haproxy-dev2", after consulting metadata
      • The Server is informed haproxy-dev2 is down by Nagios
      • The Server is informed haproxy-dev2 is up by Nagios
      • The Server is informed haproxy-dev2 is running the latest kernel

Client Features

  • It's a root crontab, on each VM
  • Maybe it's set to wake at 0300, maybe only Tues/Wed/Thu
  • It includes querying the server whether it should reboot, supplying parameters: hostname, uptime, current kernel, latest kernel
    • If it gets a "No" response, write the reason to a log & exit
    • If it gets a "Wait" response, go to sleep & try again for up to ... 30 minutes? Write the reason to a log
    • If it gets a "Yes" response, write the reason to a log, and reboot
  • Maybe, for exceptional situations where only one server needs to be rebooted, it supports a "file override" - if a specific "trigger file" exists, it deletes the file and performs a reboot.
  • Maybe, the client has more intelligence, checks what kernel is running & whether a new kernel has been installed, and the uptime, supplies those as parameters to the Server?
  • Maybe, the client includes a one-time service that runs at boot time, reporting successful reboot to the Server
  • Beau suggests:
[root@otrs-app-prd-1 ~]# dnf needs-restarting -r
Core libraries or services have been updated since boot-up:  
* kernel

Reboot is required to fully utilize these updates.

OTOH

  • yum clean history - Timing of the arrival of a patch in a repo, timing since the server last reloaded repo state
    • "bad timing" on a Dev server could stop a whole sequence of reboots
  • Is this going to lead to non-deterministic behavior, and a lot of troubleshooting?
Clone this wiki locally