-
Notifications
You must be signed in to change notification settings - Fork 542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define and implement scheduling latency SLO #1500
Comments
@wojtek-t is this something that I can pick up? |
I will provide some details early next week. |
hi... Would it be possible to provide the details? |
@wojtek-t I am wondering if we should expose a new metric similar to the ones proposed in pod resource metrics that reports pod-level latency metrics instead of relying on the aggregated histogram metrics we currently have. Such metric should make it a lot easier to implement various eligibility criteria. Let me raise that on the KEP. |
Hmm - can we afford a metric stream per pod? We can have 150k pods in the cluster... |
@ahg-g , just wanted to know would you be creating a KEP for this or is it something still that is under discussion? |
Eligibility CriteriaThe scheduling latency depends on multiple external dependencies that are not under the scheduler’s control, and this includes:
To eliminate those dependencies, we define the following eligibility criteria:
ImplementationThe scheduler reports cumulative histogram metrics. The implementation will rely on three metrics:
The first two eligibility criteria are simple to enforce: pod_scheduling_duration_seconds{attempts=0}. To enforce the last two criteria, we take the following approach:
|
Thanks @ahg-g . I was going through the description that you provided and needed a couple of clarifications: |
These metrics are exported by the scheduler, I am not sure how and where clusterloader scrape them though. |
I see that it is being invoked already for a measurement here: It's pulling even the scheduler metrics here. I believe we should be able to use this to implement the logic you described above. WDYT @wojtek-t ? |
We don't have access to scheduler metrics in every environment. But I'm fine with assuming we have at least initially to have that enforced in our oss tests. BTW - the eligibility criteria here is something that we've never fully figured out for pod startup SLO. We should do the same for that SLO for consistency, as this effectively is exactly what we want there. @mm4tt - FYI |
Thanks @wojtek-t . i started looking at this and am slightly confused as to which method to use for scraping this data.
|
We should go with (2). |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
@ahg-g - if you can provide more details based on our internal work
The text was updated successfully, but these errors were encountered: