Proposal for Service Level Objectives in witchcraft #36

gdearment · 2019-02-07T19:06:43Z

The goal of this PR is to introduce an easy way of adding service level objectives (SLOs) to witchcraft, and have the SLO status reflected via a health check. I'm looking for feedback on the implementation
of the SLO package and thoughts on the right way of integrating them with the route registration process.

This change is

gdearment · 2019-02-07T19:07:54Z

One semi-related thing here is I found it awkward to have to add the health check source in the server builder instead of being able to add it during the initialization function. Is there a better way of doing this today?

nmiyake · 2019-02-11T18:55:09Z

You can add health/readiness/liveness sources either in the builder OR in the init function -- the Router field of the witchcraft.InitInfo struct passed to the initialization function implements ConfigurableRouter, so you can make the following call in the WithInitFunc function:

info.Router.WithHealth(slos)

This lets you register health, readiness and liveness sources in the init function as opposed to at construction time if you prefer to do so.

nmiyake · 2019-02-11T18:55:26Z

Taking a look at the broader PR now -- will focus on high-level feedback rather than exact code content (can review content once we ensure that we have agreement on top-level).

nmiyake

Took a look and I understand the intent -- definitely on-board with it generally.

I don't have a concrete suggestion for how to do this yet, but having to duplicate the endpoint template (or name) in registering the endpoint with an SLO is a bit of a bummer, as it leads to a lot of the same constants recurring. Here's example registration code when using wresource (which is pretty common):

res := wresource.New("installEndpoints", router)
	return res.Get("installNum", "/installNum", slos.Register("installNum", func(rw http.ResponseWriter, req *http.Request) {
		rest.WriteJSONResponse(rw, num, http.StatusOK)
	}, time.Millisecond*100, time.Millisecond*200, time.Millisecond*300))

In this code, "installNum" is the endpointName and SLO name, and then we also have /installNum as the path. This makes the process of registering an endpoint pretty verbose, although I don't have a great suggestion for an alternative.

As a bigger-picture question, I wonder if this is the best approach to this -- this approach requires all products to update their code and provide concrete time-based estimates for endpoints on a per-endpoint basis ahead of time. Since we have request logs for endpoints (which use the endpoint template), it seems like this could be done for products as log analysis rather than as a runtime check. Not sure if that's realistic, but I think it's worth at least thinking through.

Reviewable status: 0 of 4 files reviewed, all discussions resolved (waiting on @bmoylan, @ellisjoe, and @nmiyake)

gdearment · 2019-02-15T22:32:25Z

The problem with the request log based approach is that nobody actively does it. While this has been around forever, and devs have also had endpoint specific duration metrics (which you could create the p95, p99, p999 alerts on), it requires someone to go actively create those monitors. The value in doing this in the health endpoint is that it gets auto-wired into our existing monitoring framework.

Thoughts on a path forward? I can move this to an external repo if you think it would be better there until we find a cleaner way of integrating it. It also probably doesn't hurt to have it in witchcraft-go-server and let people opt-in to it.

nmiyake · 2019-02-18T18:36:41Z

OK yeah I'm not opposed to exposing this, and don't have major concerns since this doesn't introduce any new dependencies.

The one thing that I still find a bit strange is that this approach requires products to declare their expected runtime performance thresholds as part fo the compiled output. Are we going to provide any guidance to developers about how they should properly choose these values for given endpoints, or do we anticipate this as an after-the-fact thing where we monitor the values in production and then have products re-update with thresholds based on real-world data? I'm aligned on the goals here, but want to make sure we have a story on how we expect developers to actually make use of this in practice.

Once we determine that we're in agreement that we should implement this (and I think I'm close -- would just like some clarification/thoughts on the above), then I'll move to review the contents of the PR and we can merge on approval.

gdearment · 2019-02-19T16:41:35Z

I think there are two ways in which I'd expect a dev to approach using SLOs:

They have an endpoint that they have some initial expectation as to the performance of. This could be from knowing the IO pattern or the purpose of the function and knowing it should never taking longer than X (eg., all it does is take a payload and throw it into a channel that should be non-blocking) or it could be the expected calling pattern is such that it needs to be less than X so they can define the SLO upfront as a way of validating their assumptions (eg., the endpoint will be on the hot path of every auth-n or auth-z request coming from kubernetes, and therefore needs to be less than 100ms).
They don't understand what is expected and need data before being able to define an SLO. SLOs become a way of protecting against performance regressions. The dev can use the existing metrics output from witchcraft-go-server to understand the p95/p99 on and endpoint (via server.response.[p95|p99]

I also suspect there will be a set of endpoints where it doesn't make sense to define an SLO, as the duration is highly coupled to the nature of the request (anything that has a highly variable request or response body).

MatthewMaclean · 2019-03-12T20:33:36Z

Is this going to be limited to endpoints? In Java you can add it to arbitrary methods, which is very useful for being able to monitor and prevent regressions on some functions. It will definitely not encapsulate dependencies affecting your processing very well (which a problem generic to SLOs).

gdearment added 2 commits February 6, 2019 17:04

WIP: SLO

5958ae2

Draft of how we might implement SLOs for endpoints in witchcraft

01d0e90

gdearment requested review from ellisjoe, bmoylan and nmiyake February 7, 2019 19:06

nmiyake reviewed Feb 11, 2019

View reviewed changes

gdearment and others added 2 commits February 15, 2019 09:37

Merge branch 'develop' into gd/slo

0db30e6

Move SLO health registration to init function

ccac243

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for Service Level Objectives in witchcraft #36

Proposal for Service Level Objectives in witchcraft #36

gdearment commented Feb 7, 2019 •

edited by bmoylan

Loading

gdearment commented Feb 7, 2019

nmiyake commented Feb 11, 2019

nmiyake commented Feb 11, 2019

nmiyake left a comment

gdearment commented Feb 15, 2019

nmiyake commented Feb 18, 2019

gdearment commented Feb 19, 2019

MatthewMaclean commented Mar 12, 2019

Proposal for Service Level Objectives in witchcraft #36

Are you sure you want to change the base?

Proposal for Service Level Objectives in witchcraft #36

Conversation

gdearment commented Feb 7, 2019 • edited by bmoylan Loading

gdearment commented Feb 7, 2019

nmiyake commented Feb 11, 2019

nmiyake commented Feb 11, 2019

nmiyake left a comment

Choose a reason for hiding this comment

gdearment commented Feb 15, 2019

nmiyake commented Feb 18, 2019

gdearment commented Feb 19, 2019

MatthewMaclean commented Mar 12, 2019

gdearment commented Feb 7, 2019 •

edited by bmoylan

Loading