Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for Service Level Objectives in witchcraft #36

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

gdearment
Copy link
Contributor

@gdearment gdearment commented Feb 7, 2019

The goal of this PR is to introduce an easy way of adding service level objectives (SLOs) to witchcraft, and have the SLO status reflected via a health check. I'm looking for feedback on the implementation
of the SLO package and thoughts on the right way of integrating them with the route registration process.


This change is Reviewable

@gdearment
Copy link
Contributor Author

One semi-related thing here is I found it awkward to have to add the health check source in the server builder instead of being able to add it during the initialization function. Is there a better way of doing this today?

@nmiyake
Copy link
Contributor

nmiyake commented Feb 11, 2019

You can add health/readiness/liveness sources either in the builder OR in the init function -- the Router field of the witchcraft.InitInfo struct passed to the initialization function implements ConfigurableRouter, so you can make the following call in the WithInitFunc function:

info.Router.WithHealth(slos)

This lets you register health, readiness and liveness sources in the init function as opposed to at construction time if you prefer to do so.

@nmiyake
Copy link
Contributor

nmiyake commented Feb 11, 2019

Taking a look at the broader PR now -- will focus on high-level feedback rather than exact code content (can review content once we ensure that we have agreement on top-level).

Copy link
Contributor

@nmiyake nmiyake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look and I understand the intent -- definitely on-board with it generally.

I don't have a concrete suggestion for how to do this yet, but having to duplicate the endpoint template (or name) in registering the endpoint with an SLO is a bit of a bummer, as it leads to a lot of the same constants recurring. Here's example registration code when using wresource (which is pretty common):

res := wresource.New("installEndpoints", router)
	return res.Get("installNum", "/installNum", slos.Register("installNum", func(rw http.ResponseWriter, req *http.Request) {
		rest.WriteJSONResponse(rw, num, http.StatusOK)
	}, time.Millisecond*100, time.Millisecond*200, time.Millisecond*300))

In this code, "installNum" is the endpointName and SLO name, and then we also have /installNum as the path. This makes the process of registering an endpoint pretty verbose, although I don't have a great suggestion for an alternative.


As a bigger-picture question, I wonder if this is the best approach to this -- this approach requires all products to update their code and provide concrete time-based estimates for endpoints on a per-endpoint basis ahead of time. Since we have request logs for endpoints (which use the endpoint template), it seems like this could be done for products as log analysis rather than as a runtime check. Not sure if that's realistic, but I think it's worth at least thinking through.

Reviewable status: 0 of 4 files reviewed, all discussions resolved (waiting on @bmoylan, @ellisjoe, and @nmiyake)

@gdearment
Copy link
Contributor Author

The problem with the request log based approach is that nobody actively does it. While this has been around forever, and devs have also had endpoint specific duration metrics (which you could create the p95, p99, p999 alerts on), it requires someone to go actively create those monitors. The value in doing this in the health endpoint is that it gets auto-wired into our existing monitoring framework.

Thoughts on a path forward? I can move this to an external repo if you think it would be better there until we find a cleaner way of integrating it. It also probably doesn't hurt to have it in witchcraft-go-server and let people opt-in to it.

@nmiyake
Copy link
Contributor

nmiyake commented Feb 18, 2019

OK yeah I'm not opposed to exposing this, and don't have major concerns since this doesn't introduce any new dependencies.

The one thing that I still find a bit strange is that this approach requires products to declare their expected runtime performance thresholds as part fo the compiled output. Are we going to provide any guidance to developers about how they should properly choose these values for given endpoints, or do we anticipate this as an after-the-fact thing where we monitor the values in production and then have products re-update with thresholds based on real-world data? I'm aligned on the goals here, but want to make sure we have a story on how we expect developers to actually make use of this in practice.

Once we determine that we're in agreement that we should implement this (and I think I'm close -- would just like some clarification/thoughts on the above), then I'll move to review the contents of the PR and we can merge on approval.

@gdearment
Copy link
Contributor Author

I think there are two ways in which I'd expect a dev to approach using SLOs:

  1. They have an endpoint that they have some initial expectation as to the performance of. This could be from knowing the IO pattern or the purpose of the function and knowing it should never taking longer than X (eg., all it does is take a payload and throw it into a channel that should be non-blocking) or it could be the expected calling pattern is such that it needs to be less than X so they can define the SLO upfront as a way of validating their assumptions (eg., the endpoint will be on the hot path of every auth-n or auth-z request coming from kubernetes, and therefore needs to be less than 100ms).
  2. They don't understand what is expected and need data before being able to define an SLO. SLOs become a way of protecting against performance regressions. The dev can use the existing metrics output from witchcraft-go-server to understand the p95/p99 on and endpoint (via server.response.[p95|p99]

I also suspect there will be a set of endpoints where it doesn't make sense to define an SLO, as the duration is highly coupled to the nature of the request (anything that has a highly variable request or response body).

@MatthewMaclean
Copy link

Is this going to be limited to endpoints? In Java you can add it to arbitrary methods, which is very useful for being able to monitor and prevent regressions on some functions. It will definitely not encapsulate dependencies affecting your processing very well (which a problem generic to SLOs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants