Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2976] Handle multiple require node allocations per node #1001

Closed
wants to merge 2 commits into from

Conversation

wilfred-s
Copy link
Contributor

@wilfred-s wilfred-s commented Dec 3, 2024

What is this PR for?

If an allocation requires a specific node the scheduler should not consider any other node. We should allow multiple allocations that require the same node to reserve the node at the same time. A required node allocation must be placed on the node before anything else. If other non required node reservations are made on a node remove the existing reservations that do not require that node. Make sure that the releases are tracked correctly in the partition.

After the repeat count removal reservations can be simplified:

  • track reservations using the allocation key
  • removed the composite key setup
  • removed collection listener call on reserve or unreserve of a node

Cleanup testing of the node collection

What type of PR is it?

  • - Bug Fix
  • - Improvement

Todos

  • e2e test in the shim requesting multiple daemon sets at once

What is the Jira issue?

YUNIKORN-2976

How should this be tested?

unit tests are updated new e2e test should be added

If an allocation requires a specific node the scheduler should not
consider any other node. We should allow multiple allocations that
require the same node to reserve the node at the same time. A required
node allocation must be placed on the node before anything else.
If other non required node reservations are made on a node remove the
existing reservations that do not require that node. Make sure that the
releases are tracked correctly in the partition.

After the repeat count removal reservations can be simplified:
- track reservations using the allocation key
- removed the composite key setup
- removed collection listener call on reserve or unreserve of a node

Cleanup testing of the node collection
Copy link

codecov bot commented Dec 3, 2024

Codecov Report

Attention: Patch coverage is 79.54545% with 54 lines in your changes missing coverage. Please review.

Project coverage is 81.99%. Comparing base (0356a3a) to head (35c3cad).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
pkg/scheduler/objects/application.go 75.69% 40 Missing and 4 partials ⚠️
pkg/scheduler/partition.go 47.36% 9 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1001      +/-   ##
==========================================
+ Coverage   81.56%   81.99%   +0.43%     
==========================================
  Files          97       97              
  Lines       15646    15621      -25     
==========================================
+ Hits        12761    12809      +48     
+ Misses       2601     2535      -66     
+ Partials      284      277       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, nice change :)

First, I just have a general observation: there seems to be too many uncovered paths in application.go. I'm sure some of those already existed before, but this change touches critical part of the allocation cycle, so I'd be happy to see those codecov reports go away, like:

#L883 - L887
#L944 - L947
#L1106 - L1120
#L1143
#L1241 (this looks weird, because the flow continues)
#L1317 (looks weird too)
#L1362

Perhaps they're covered indirectly from partition, but it's hard to reason about it.

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments after round #1.

pkg/scheduler/objects/reservation.go Show resolved Hide resolved
pkg/scheduler/objects/reservation.go Show resolved Hide resolved
pkg/scheduler/objects/application.go Outdated Show resolved Hide resolved
@craigcondit
Copy link
Contributor

craigcondit commented Dec 3, 2024

I've installed this patch on a cluster where I've spun up a kwok-based autoscaler. I encountered a scenario where a monitoring pod (daemonset member) wouldn't schedule on a full node by triggering preemption. The logs ping-pong back and forth between these two messages now:

2024-12-03T23:37:20.251Z    INFO    core.scheduler.queue    objects/queue.go:1438    allocation found on queue    {"queueName": "root.monitoring", "appID": "monitoring-ac69ed37-d4d7-4c0f-92bd-8f89a16cec74", "resultType": "Reserved", "allocation": "allocationKey ac69ed37-d4d7-4c0f-92bd-8f89a16cec74, applicationID monitoring-ac69ed37-d4d7-4c0f-92bd-8f89a16cec74, Resource map[pods:1], Allocated false"}
2024-12-03T23:37:20.251Z    INFO    core.scheduler.partition    scheduler/partition.go:957    Application is already reserved on node    {"appID": "monitoring-ac69ed37-d4d7-4c0f-92bd-8f89a16cec74", "nodeID": "kind-worker-6fhtj"}

Events for the pod:

Normal  Scheduling     4m13s  yunikorn  monitoring/kube-prometheus-stack-prometheus-node-exporter-5b6p7 is queued and waiting for allocation
Normal  Informational  4m12s  yunikorn  Unschedulable request 'ac69ed37-d4d7-4c0f-92bd-8f89a16cec74' with required node 'kind-worker-6fhtj', no preemption victim found
Normal  Scheduling     4m5s   yunikorn  monitoring/kube-prometheus-stack-prometheus-node-exporter-5b6p7 is queued and waiting for allocation

I'm using admissionController.filtering.generateUniqueAppId: true to use per-pod App ID generation, if that makes a difference. Restarting YK doesn't seem to help.

Update: Manually killing a pod to make room doesn't make the logs go away, but does schedule the pod. This would seem to indicate that the reservations aren't being cleaned up properly.

@wilfred-s
Copy link
Contributor Author

Sorry for the late response:

The logs ping-pong back and forth between these two messages now:

2024-12-03T23:37:20.251Z    INFO    core.scheduler.queue    objects/queue.go:1438    allocation found on queue    {"queueName": "root.monitoring", "appID": "monitoring-ac69ed37-d4d7-4c0f-92bd-8f89a16cec74", "resultType": "Reserved", "allocation": "allocationKey ac69ed37-d4d7-4c0f-92bd-8f89a16cec74, applicationID monitoring-ac69ed37-d4d7-4c0f-92bd-8f89a16cec74, Resource map[pods:1], Allocated false"}
2024-12-03T23:37:20.251Z    INFO    core.scheduler.partition    scheduler/partition.go:957    Application is already reserved on node    {"appID": "monitoring-ac69ed37-d4d7-4c0f-92bd-8f89a16cec74", "nodeID": "kind-worker-6fhtj"}

One function was not updated to allow multiple daemon set pods to reserve multiple nodes. A test case was missing for that unit test also.

Added more unit tests
remove function and calls to getAllocationReservation
remove function and calls to IsAllocationReserved
rewrote function IsReservedOnNode to NodeReservedForAsk and make the
caller check the nodeID

fix the logic around a previously reserved allocation in tryRequiredNode
Put a safeguard into the partition to handle the broken case mentioned
in the review
@wilfred-s wilfred-s requested a review from pbacsko December 12, 2024 10:24
@wilfred-s
Copy link
Contributor Author

filed two follow up jiras to add unit tests to tryNodesNoReserve and tryPlaceholderAllocate on the application

Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM. I've verified the updated patch on a cluster with a kwok autoscaler, and have been unable to reproduce the original issues. This looks good.

craigcondit pushed a commit that referenced this pull request Dec 13, 2024
)

If an allocation requires a specific node the scheduler should not
consider any other node. We should allow multiple allocations that
require the same node to reserve the node at the same time. A required
node allocation must be placed on the node before anything else.
If other non required node reservations are made on a node remove the
existing reservations that do not require that node. Make sure that the
releases are tracked correctly in the partition.

After the repeat count removal reservations can be simplified:
- track reservations using the allocation key
- removed the composite key setup
- removed collection listener call on reserve or unreserve of a node

Closes: #1001

Signed-off-by: Craig Condit <[email protected]>
rhh777 pushed a commit to rhh777/yunikorn-core that referenced this pull request Dec 25, 2024
…ache#1001)

If an allocation requires a specific node the scheduler should not
consider any other node. We should allow multiple allocations that
require the same node to reserve the node at the same time. A required
node allocation must be placed on the node before anything else.
If other non required node reservations are made on a node remove the
existing reservations that do not require that node. Make sure that the
releases are tracked correctly in the partition.

After the repeat count removal reservations can be simplified:
- track reservations using the allocation key
- removed the composite key setup
- removed collection listener call on reserve or unreserve of a node

Closes: apache#1001

Signed-off-by: Craig Condit <[email protected]>
(cherry picked from commit b02e15e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants