Define and implement scheduling latency SLO #1500

wojtek-t · 2020-09-29T05:48:06Z

@ahg-g - if you can provide more details based on our internal work

vamossagar12 · 2020-10-02T12:14:46Z

@wojtek-t is this something that I can pick up?

ahg-g · 2020-10-02T12:25:52Z

I will provide some details early next week.

vamossagar12 · 2020-10-16T11:25:52Z

hi... Would it be possible to provide the details?

ahg-g · 2020-10-19T13:24:46Z

@wojtek-t I am wondering if we should expose a new metric similar to the ones proposed in pod resource metrics that reports pod-level latency metrics instead of relying on the aggregated histogram metrics we currently have. Such metric should make it a lot easier to implement various eligibility criteria. Let me raise that on the KEP.

wojtek-t · 2020-10-19T15:29:29Z

Hmm - can we afford a metric stream per pod? We can have 150k pods in the cluster...

vamossagar12 · 2020-10-31T06:14:11Z

@ahg-g , just wanted to know would you be creating a KEP for this or is it something still that is under discussion?

ahg-g · 2020-11-02T03:27:10Z

Eligibility Criteria

The scheduling latency depends on multiple external dependencies that are not under the scheduler’s control, and this includes:

Customer’s workload characteristics
Storage provisioning
Cluster autoscaler adding new capacity

To eliminate those dependencies, we define the following eligibility criteria:

Pods that successfully schedule from the first attempt. This means pods scheduled as a result of preemption are not eligible. This guarantees that those pods don’t block on constraints becoming met (like affinities) or on VMs becoming available, both of which the scheduler can’t control.
The number of pending pods (unschedulable and ones to yet be attempted) is under a defined limit.
Pod creation rate is under a defined limit to guard against low-priority pods getting starved by a constant flow of high priority pods.

Implementation

The scheduler reports cumulative histogram metrics. The implementation will rely on three metrics:

pod_scheduling_duration_seconds: a histogram of pods e2e scheduling latencies labeled by the number of scheduling attempts (the label already exists) and whether or not it triggered volume provisioning (label to be added).
pending_pods: a gauge metric that tracks the number of pending pods in the scheduler queue.
queue_incoming_pods_total: a counter metric that tracks the number newly added pods to the queue.

The first two eligibility criteria are simple to enforce: pod_scheduling_duration_seconds{attempts=0}.

To enforce the last two criteria, we take the following approach:

Define a continuously recurring time window of duration W (e.g., 60 seconds). This is basically the windowing period that we will use in the metrics query.
A window Wi is considered “positive” if the following is true: pending_pods is below M and the delta of queue_incoming_pods_total over W is below N.
pod_scheduling_duration_seconds samples in Wi are considered eligible if windows Wi-1 and Wi are both “positive”.

vamossagar12 · 2020-11-02T11:09:40Z

Thanks @ahg-g . I was going through the description that you provided and needed a couple of clarifications:
You have mentioned making use of 3 metrics which the scheduler reports. All of them are mentioned here: https://github.com/kubernetes/kubernetes/blob/44cd4fcedccbf35c2b674f9d53faa6fc3230b8fa/pkg/scheduler/metrics/metrics.go.. These metrics are reported by scheduler and are stored in Prometheus?
Also, the other SLOs that i have seen makes queries to Prometheus servers to get these metrics. Do you envisage doing the same thing for the Scheduler Latency measurement?

ahg-g · 2020-11-02T15:10:45Z

These metrics are exported by the scheduler, I am not sure how and where clusterloader scrape them though.

vamossagar12 · 2020-11-03T06:34:03Z

I see that it is being invoked already for a measurement here:

https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/metrics_for_e2e.go#L65-L72

It's pulling even the scheduler metrics here. I believe we should be able to use this to implement the logic you described above. WDYT @wojtek-t ?

wojtek-t · 2020-11-04T19:42:29Z

We don't have access to scheduler metrics in every environment. But I'm fine with assuming we have at least initially to have that enforced in our oss tests.

BTW - the eligibility criteria here is something that we've never fully figured out for pod startup SLO. We should do the same for that SLO for consistency, as this effectively is exactly what we want there.

@mm4tt - FYI

vamossagar12 · 2020-11-14T09:16:35Z

Thanks @wojtek-t . i started looking at this and am slightly confused as to which method to use for scraping this data.
I see 2 different approaches for this:

metrics_for_e2e invokes the metricsGrabber interface which invokes APIs to get the data. So, on approach could be we hit this API after configured duration of time and get the values for the metrics we care for and use the logic pointed out in the eligibility criteria across windows to measure the performance. I see another approach for pod_startup_latency where it registers informer and uses the events and in the gather phase calculates the transition latencies.
the second approach is creating a PrometheusMeasurement and writing prometheus queries to fetch the metrics. In this case similar to the ones being used in api_responsiveness.

wojtek-t · 2020-11-16T07:02:00Z

We should go with (2).

fejta-bot · 2021-02-14T07:24:06Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

wojtek-t · 2021-02-15T07:16:31Z

/remove-lifecycle stale
/lifecycle frozen

wojtek-t added area/clusterloader area/slo labels Sep 29, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 14, 2021

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define and implement scheduling latency SLO #1500

Define and implement scheduling latency SLO #1500

wojtek-t commented Sep 29, 2020

vamossagar12 commented Oct 2, 2020

ahg-g commented Oct 2, 2020

vamossagar12 commented Oct 16, 2020

ahg-g commented Oct 19, 2020

wojtek-t commented Oct 19, 2020

vamossagar12 commented Oct 31, 2020

ahg-g commented Nov 2, 2020

vamossagar12 commented Nov 2, 2020

ahg-g commented Nov 2, 2020

vamossagar12 commented Nov 3, 2020

wojtek-t commented Nov 4, 2020

vamossagar12 commented Nov 14, 2020 •

edited

Loading

wojtek-t commented Nov 16, 2020

fejta-bot commented Feb 14, 2021

wojtek-t commented Feb 15, 2021

Define and implement scheduling latency SLO #1500

Define and implement scheduling latency SLO #1500

Comments

wojtek-t commented Sep 29, 2020

vamossagar12 commented Oct 2, 2020

ahg-g commented Oct 2, 2020

vamossagar12 commented Oct 16, 2020

ahg-g commented Oct 19, 2020

wojtek-t commented Oct 19, 2020

vamossagar12 commented Oct 31, 2020

ahg-g commented Nov 2, 2020

Eligibility Criteria

Implementation

vamossagar12 commented Nov 2, 2020

ahg-g commented Nov 2, 2020

vamossagar12 commented Nov 3, 2020

wojtek-t commented Nov 4, 2020

vamossagar12 commented Nov 14, 2020 • edited Loading

wojtek-t commented Nov 16, 2020

fejta-bot commented Feb 14, 2021

wojtek-t commented Feb 15, 2021

vamossagar12 commented Nov 14, 2020 •

edited

Loading