-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add gpu topology-aware scheduling proposal #1115
base: main
Are you sure you want to change the base?
add gpu topology-aware scheduling proposal #1115
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
55d848f
to
4ded088
Compare
…tor-sh#1116) Signed-off-by: happy2048 <[email protected]>
4ded088
to
23068f5
Compare
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## main #1115 +/- ##
==========================================
- Coverage 66.99% 66.98% -0.01%
==========================================
Files 263 263
Lines 28978 28978
==========================================
- Hits 19413 19412 -1
- Misses 8201 8205 +4
+ Partials 1364 1361 -3
Flags with carried forward coverage won't be shown. Click here to find out more. see 1 file with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed in the bi-weekly meeting. Many issues/questions regarding this proposal are still unresolved. I'll mark this proposal as WIP temporarily and you can feel free to request a review when you think it's ready.
## Motivation | ||
NVIDIA Collective Communication Library (NCCL) is a Magnum IO library provided by NVIDIA, which can realize GPU-accelerated collective operations. NCCL is topology-aware (automatically perceives the connection type between GPU cards, no manual configuration is required) and is optimized to pass PCIe, NVLink, Ethernet, and InfiniBand interconnects enable high bandwidth and low latency. In the deep learning distributed training job, the distributed training framework (Pytorch, MPI) combined with the NCCL library can achieve the acceleration effect. The NCCL library can perceive the connection between the GPU cards. Different connection types have different bandwidths. The size of the bandwidth affects the training time of the training job. | ||
|
||
The following is a matrix describing the bandwidth between 8 GPU cards on a node, and the unit of value is GB/s: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, not all 8-card gpu type support nvlink, should we annouce some example gpu models like v100\100?(or 1080Ti not support)?
The following is a matrix describing the bandwidth between 8 GPU cards on a node, and the unit of value is GB/s: | ||
``` | ||
Bandwidth Matrix: | ||
gpu_0 gpu_1 gpu_2 gpu_3 gpu_4 gpu_5 gpu_6 gpu_7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a picture in images of "nvlink", may help reader to understand the speed bewteen different cards
4. If a node cannot place all the pods of the training job, it will try to place these pods with the fewest nodes to avoid node resource fragmentation. | ||
### Non-Goals/Future Work | ||
|
||
1. In this proposal, it is assumed that a training job can tolerate some pods running on the node first while the remaining pods are pending. If the training job cannot tolerate this situation, the GPU topology plugin needs to be used in conjunction with the gang plugin to implement All Or Nothing scheduling; that is, this solution does not implement the All Or Nothing scheduling logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have question for this part, why we need to say we let some pods running on the node first? may be we should declare that we may need podgroup to describe a group of pod, and find a best schedule-result of them. whether the pods need launch together or not after scheduling is not very related to the topic, this is my opinion.
## Proposal | ||
### User stories | ||
#### Story 1 | ||
**Single Pod requests GPU cards:** There is only one pod for the training job, and the number of GPU cards requested by the pod exceeds 1. At the same time, the training job uses the NCCL library for communication between GPU cards. The communication bandwidth between GPU cards needs to be considered when allocating GPU cards to pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when allocating GPU cards to pods.-> when allocating GPU cards to pod. may be a grammer mistake.
#### Story 1 | ||
**Single Pod requests GPU cards:** There is only one pod for the training job, and the number of GPU cards requested by the pod exceeds 1. At the same time, the training job uses the NCCL library for communication between GPU cards. The communication bandwidth between GPU cards needs to be considered when allocating GPU cards to pods. | ||
#### Story 2 | ||
**Multiple Pods request GPU cards:** The distributed training job has multiple workers (or multiple pods), the underlying communication framework of the workers uses the NCCL library, and there is data communication between GPU cards. If a node can run these workers, then these workers should be run on a node first to reduce the communication delay between GPUs. If one node cannot run these workers, consider multiple nodes to run these workers; when each node selects GPUs for the workers, which should be run on the node, communication bandwidth between GPU cards should be considered, and GPU combination with the largest bottleneck bandwidth is preferred. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“when each node selects GPUs for the workers, which should be run on the node” emm...seems a little strange.
#### Story 2 | ||
**Multiple Pods request GPU cards:** The distributed training job has multiple workers (or multiple pods), the underlying communication framework of the workers uses the NCCL library, and there is data communication between GPU cards. If a node can run these workers, then these workers should be run on a node first to reduce the communication delay between GPUs. If one node cannot run these workers, consider multiple nodes to run these workers; when each node selects GPUs for the workers, which should be run on the node, communication bandwidth between GPU cards should be considered, and GPU combination with the largest bottleneck bandwidth is preferred. | ||
|
||
In this scenario, the following situation may occur: some workers(or pods) of the training job are running, while the remaining pods are pending due to untimely scheduling for some reasons. If the training job can tolerate this situation, no special handling is required; If the training job cannot tolerate this situation, the running pods occupy resources and waste resources. To avoid the situation, it is necessary to ensure All Or Nothing resource scheduling. In this case, gang scheduling is required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part comment is the same as before.
#### Story 1 | ||
**Single Pod requests GPU cards:** There is only one pod for the training job, and the number of GPU cards requested by the pod exceeds 1. At the same time, the training job uses the NCCL library for communication between GPU cards. The communication bandwidth between GPU cards needs to be considered when allocating GPU cards to pods. | ||
#### Story 2 | ||
**Multiple Pods request GPU cards:** The distributed training job has multiple workers (or multiple pods), the underlying communication framework of the workers uses the NCCL library, and there is data communication between GPU cards. If a node can run these workers, then these workers should be run on a node first to reduce the communication delay between GPUs. If one node cannot run these workers, consider multiple nodes to run these workers; when each node selects GPUs for the workers, which should be run on the node, communication bandwidth between GPU cards should be considered, and GPU combination with the largest bottleneck bandwidth is preferred. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if pods must be scheduled between some nodes, for example 2 node, which means 1 node resource must be used as free->0, and the other node may used as free->0 or not 0, for free->0 node, we has no need to consider topology, for free not 0 node, should we consider the topology as worst not best? becuase we may need to remain the good topology to other pod which can run in one node; in other side, if pods cross the node, the bottleneck is the network speed between different nodes, so tring to get a best-schedule on a free not 0 node seems useless.
#### main steps | ||
the main steps are described as: | ||
|
||
- When pod1 starts to be scheduled, the GPU topology plugin uses two specific pod labels(will be introduced later) to find pods that have not been scheduled in the same group (including pod1 itself) in preFilter extension, for example, [pod1, pod2, pod3]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why I prefer this feature is so related to cosscheduling? may be we should consider pod1\pod2\pod3 are closely connected in the scheduling queue, only this way can make sure your scheduling-plan can be achieve. however, we can only load one sort plugin in schedule-framework, and coscheduling is also need the sort plugin. actually, we haven't found a case that need nvlink but not need coscheduling, so may be we can reuse coscheduling plugin to achive this feature. coscheduling can also help to recognize the relationship of pod1\pod2\pod3
} | ||
``` | ||
|
||
- If one node cannot place [pod1, pod2, pod2], then try to place these three pods with 2 nodes. After allocating GPUs to the pods, the combination with less remaining GPU resources on the node is preferred |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[pod1, pod2, pod2]->[pod1, pod2, pod3]
} | ||
``` | ||
|
||
- If one node cannot place [pod1, pod2, pod2], then try to place these three pods with 2 nodes. After allocating GPUs to the pods, the combination with less remaining GPU resources on the node is preferred |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there 10 pod of a job, and node only can fill 2 pod, so will you try every arrange and combine? like pod1+pod2 or pod1 + pod3 or ....? (C2/100). may be we can assume that all pod in a gpu job that need nvlink is Isomorphic(actually it's indeed Isomorphic in real world), this will lead the problem much easier. we want this semanteme of "A node can place 2 pod, B node can place 1 pod...", not "A node can place pod1+pod3, node2 can place pod2"...
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if pod1\pod2\pod3 and node1\node2\node3, the arrange and combine is [pod1\pod2 on node1, pod1\pod3 on node1...], [pod1\pod2 on node2, pod1\pod3 on node2...], which determined by pod num and node num, which is terrible, the process may we first calculate each node's max can-assign num, and sort the node, then place the Isomorphic pod to the node from max->min
Ⅰ. Describe what this PR does
Add gpu topology-aware scheduling proposal
Ⅱ. Does this pull request fix one issue?
Ⅲ. Describe how to verify it
Ⅳ. Special notes for reviews
V. Checklist
make test