Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD 157 Notices to Operators #118

Open
davepacheco opened this issue Oct 24, 2018 · 6 comments
Open

RFD 157 Notices to Operators #118

davepacheco opened this issue Oct 24, 2018 · 6 comments

Comments

@davepacheco
Copy link
Contributor

This issue represents an opportunity for discussion of RFD 157 Notices to Operators while it remains in a pre-published state.

@qdzlug
Copy link
Contributor

qdzlug commented Oct 24, 2018

One thing we've talked about doing in the past with manta/triton (we being the support team) was considering some sort of "taint" flag for deployed images to note that they had been modified from the original image (ie, they had a patch applied, they had their ram increased, etc). Not sure how much that would tie into this effort, but just wanted to raise it as part of this.

@chudley
Copy link
Contributor

chudley commented Oct 25, 2018

It's recommended that we call operators' attention to active NTOs by linking to a list of them in the message-of-the-day on all headnodes.

I like this idea, but I can't say for certain that, for example, every CM that we perform will start with logging into a headnode. Is the intention of this MOTD to be a discovery mechanism for NTOs?

The section above the quoted line in the RFD on reviewing NTOs seems appropriate on how the notices could be communicated/discovered.

I wondered about also first-classing the services and/or procedures affected by the NTO into their own fields. For example, NTO-123's affected services would be "postgres/manatee", and perhaps the affected procedure might be "database/sitter restarts". NTO-125 might have "upgrades" or "reprovisions" for its affected procedures.

I think that might make these a little too granular, but the situation I'm considering is where an incident manager or responder has now been tasked with reviewing NTOs based on a suggested/intended action to resolve an incident, and now has to read through potentially tens of detailed NTOs.

Quickly answering a question like "An operator wants to restart a postgres zone. Do we have any NTOs relating to postgres affecting this deployment?" seems valuable, but also possibly easy to get wrong if the full details of the NTOs are overlooked based on just a few fields. I think that situation is probably better served by keeping active NTOs to a minimum and only changing the format if we find that it's a problem, but I wanted to jot it down here in case there's any other thoughts on the subject.

@rmustacc
Copy link
Contributor

rmustacc commented Oct 30, 2018 via email

@davepacheco
Copy link
Contributor Author

Thanks everyone for the thoughtful feedback! Responding to all the comments so far.

@qdzlug:

One thing we've talked about doing in the past with manta/triton (we being the support team) was considering some sort of "taint" flag for deployed images to note that they had been modified from the original image (ie, they had a patch applied, they had their ram increased, etc). Not sure how much that would tie into this effort, but just wanted to raise it as part of this.

I think this is very important, and I think MANTA-3460 is our best suggestion for a solution so far. (There's not been much discussion about prioritizing that work though.)

@chudley:

It's recommended that we call operators' attention to active NTOs by linking to a list of them in the message-of-the-day on all headnodes.

I like this idea, but I can't say for certain that, for example, every CM that we perform will start with logging into a headnode. Is the intention of this MOTD to be a discovery mechanism for NTOs?

It's more intended as a reminder than a guaranteed way to see them.

I wondered about also first-classing the services and/or procedures affected by the NTO into their own fields. For example, NTO-123's affected services would be "postgres/manatee", and perhaps the affected procedure might be "database/sitter restarts". NTO-125 might have "upgrades" or "reprovisions" for its affected procedures.

I think that might make these a little too granular, but the situation I'm considering is where an incident manager or responder has now been tasked with reviewing NTOs based on a suggested/intended action to resolve an incident, and now has to read through potentially tens of detailed NTOs.

Quickly answering a question like "An operator wants to restart a postgres zone. Do we have any NTOs relating to postgres affecting this deployment?" seems valuable, but also possibly easy to get wrong if the full details of the NTOs are overlooked based on just a few fields. I think that situation is probably better served by keeping active NTOs to a minimum and only changing the format if we find that it's a problem, but I wanted to jot it down here in case there's any other thoughts on the subject.

Agreed about all of that. I do think this could be useful, but I'm wary about over-structuring this on the first pass. My hope is that there are few enough of these and they'll change rarely enough that it won't be too burdensome to re-skim them all during CM review. After all, today we rely on people actually remembering all of these conditions!

@rmustacc:

On 10/25/18 7:22 , Richard Bradley wrote:

It's recommended that we call operators' attention to active NTOs by linking to a list of them in the message-of-the-day on all headnodes.
I like this idea, but I can't say for certain that, for example, every CM that we perform will start with logging into a headnode. Is the intention of this MOTD to be a discovery mechanism for NTOs?
I think the headnodes are an easy place to potentially miss things during an investigation. Also we'd want to think through how that would be persistant, as otherwise a reboot will replace the motd.

Agreed. Putting these in the MOTD is not intended as the primary way people find out about these or discover them, but rather a reminder for people who are likely to be interacting with the system's internals. If we did this, we'd have to figure an appropriate way to implement it.

That said, I agree that this should be a first class thing. I'd actually like them to be visible and filterable in adminui or other CLI tools if possible. Though I agree at a first pass, we shouldn't worry about that.

Agreed.

Based on the examples, there are at least two important subdivisions:

  1. NTO that are specific to an individual state. These are ones that are the side effect of choices we've made.

  2. NTOs that are inherent to all operators of Manta. For example, NTO-125 is a case that all deployments of Manta need to worry about, not just ones that we're operating. Arguably, a variant of NTO-123 should be sent out to everyone.

It seems like if we do first class it, being able to send out NTOs like a variant of an alert that comes with a systems update or some other thing, might be useful. But again, that might be too forward thinking.

This is a good point. At this point, NTOs are essentially a Joyent process to help operate our Manta deployments, not a feature of the system itself. If this proves useful, we may want to bake them (or something like them) into the software.

@davepacheco
Copy link
Contributor Author

I've drafted a space for this internally at https://wiki.joyent.us/display/MP/Notices+to+Operators.

@feetwins
Copy link

Sorry to be so late to the game on this. I've been meaning to follow-up! Thank you again for initiating this. To further rm's point on NTO's that could affect all Manta deployments, we will want some way to tie in Support so that a public NTO can be posted off help.joyent.com for Manta on-prem operators. I'll think more on how we can bridge that together, but just wanted to mention it here now so I don't forget.

Additionally, I probably need to take on a more active role with helping to develop some of these - particularly for anything that surfaces due to an unexpected event during a change that we believe we could see again. One immediate example that comes to mind is the issue we encountered in ap-southeast during nameservice updates where some binder processes required a restart due to stale cached info (this is believed to be a bug). Right now, a lot of this is getting documented in the CM tickets themselves (or not at all). I've made an attempt to start tracking any caveats or known issues in the QA/Release management wiki (see https://wiki.joyent.us/display/QA/JPC+Release+Deployments for an example), but having official NTO's will be far more useful than that.

I also wonder if we should track these in the same place where we are maintaining change plan templates since the change plan itself will have to be altered to include any applicable NTO's as a part of execution. Alternatively, maybe an 'NTO field' can be added to the CM JIRA project where we specify applicable NTO's (if any) and then just link to that (probably easier/more practical?).

For now, I think it will be useful to link to https://wiki.joyent.us/display/MP/Notices+to+Operators off the main Change Management JIRA dashboards for both SPC and JPC (at a minimum, to serve as a reminder to review those). Again, just a braindump of thoughts around this. I'm happy to help contribute and help manage these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants