Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Withdraw site resource when vm creation is failed #300

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

kimeunju108
Copy link
Collaborator

What type of PR is this?

/kind feature

What this PR does / why we need it:
This PR proposes to withdraw site (cluster) resource allocated to a pod when the pod's vm creation is failed.

Which issue(s) this PR fixes:
Fixes #286

Special notes for your reviewer:

Does this PR introduce a user-facing change?:
NONE

How to test
~/go/src/k8s.io/arktos$./hack/globalscheduler/globalscheduler-up.sh
open a new terminal
$cd ~/go/src/k8s.io/arktos/globalscheduler/test/yaml
$kubectl apply -f sample_6_clusters.yaml
$kubectl apply -f sample_2_schedulers.yaml
$kubectl apply -f sample_2_distributors.yaml
$kubectl apply -f sample_2_dispatchers.yaml
$kubectl apply -f sample_6_pods.yaml
$kubectl get pods

@@ -643,20 +653,28 @@ func (n *SiteCacheInfo) updateSiteFlavor(resourceTypes []string, regionFlavors m
}
}

func (n *SiteCacheInfo) deductFlavor() {
func (n *SiteCacheInfo) updateFlavorCount(deduct bool) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newly added function like this needs a unit test

Copy link
Collaborator Author

@kimeunju108 kimeunju108 May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. I will do it.
Update: I am working on unit testcase. It takes some time to finish so I will make a seperate PR.

@@ -135,3 +136,15 @@ func (s *Snapshot) Get(siteID string) (*schedulersitecacheinfo.SiteCacheInfo, er
func (s *Snapshot) GetFlavors() (map[string]*typed.RegionFlavor, error) {
return s.RegionFlavorMap, nil
}

func (s *Snapshot) GetRegionFlavors(region string) (map[string]*typed.RegionFlavor, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newly added function like this needs a unit test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. I will do it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was test added?


// Bind binds pods to site using the k8s client.
// Same function with Bind except return bound resource info
func (b DefaultBinder) BindResource(ctx context.Context, state *interfaces.CycleState, stack *types.Stack,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newly added function like this needs a unit test

consider refactor some segments out into functions for unit testing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. I will do it.

Copy link
Collaborator

@pdgetrf pdgetrf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need unit tests for some of the complex functions

@pdgetrf
Copy link
Collaborator

pdgetrf commented May 26, 2021

I have some superficial comments.

please get another approval from @jshaofuturewei and/or @CoderKevinZhang before merging


//withdraw reserved resources to a pod & add it to cash to other pods
func (sched *Scheduler) withdrawResource(podName string) error {
resource := sched.PodSiteResourceMap[podName]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map is not thread safe. When multiple events of one pod are triggered, how can we prevent synchronization issues here?

Copy link
Collaborator Author

@kimeunju108 kimeunju108 May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map is not thread safe. When multiple events of one pod are triggered, how can we prevent synchronization issues here?
This function is only related to pod creation failure. And pod creation is scheduled by schduling Q which is synchonized.

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that a scheduled job will sync resources(cpu & mem) from openstack. Before informers get failed pods from apiserver, the scheduled job has already synchronized from openstack, which means the resource has already claimed back. In that case, is it possible that the failed pod adds resources back twice?
=> No it won't happen because PodSiteResourceMap is only for 60 seconds time gap. So whenever sync resources(cpu & mem) from openstack, this PodSiteResourceMap will be empty at the moment already.

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to my comments instead of editing them directly.

The assumption is based on informers take fewer than 60 seconds to trigger the events. However, if at that time the resource map is not empty but the scheduled resource synchronization job happens to fetch resource information from openstack, the resources might be wrong.

=> resource map is empty, so the conflict won't happen. If this function creates too much issue, we should think again if we remove this function. It will be better to remove this 60 seconds gap. or resource collector should collect information in advance before waiting 60 seconds. 60 secnds is resource collector's issue mainly. Resource collection has to keep collecting information to remove this gap.

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to my comments instead of editing them directly.

For example, when you get failed pod event and want to withdraw resources, we assume to update resource based on resource collector does not update openstack resource information at that time. But if it does, we withdraw resource information twice.

We can happen to get a failed pod event just after resource collector updates the most recent openstack resource information, right?

Please correct me if I am wrong.
=> Regardless right or not, collecting resource information is resource collector's work. I tried to cooperate 60 seconds issue of resource collctor. But if there are so many issues, it will be better resource collector take it and solve it.

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to my comments instead of editing them directly.

Resource collector used to be the only source we know the resource information (mem & cpu). If you introduce the current codes to update resources, please make them work with each other. Thanks

=> It will be better for you to check resource collector's requirement.

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to my comments instead of editing them directly.

Can you tell us besides resource collector, are the pull request to reduce/withdraw resources to update the openstack resources? Thanks
=> No, this PR for "withdraw", which resolves the issue caused by resource collector issue. It will be better to reconsider if this function should be implemented by scheduler or not.

selector := fields.ParseSelectorOrDie(
"status.phase=" + string(v1.PodAssigned) +
",status.assignedScheduler.name=" + schedulerName)
"status.assignedScheduler.name=" + schedulerName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pod which is scheduled can be watched by informer. In case there are millions of pods in the system, the informer here has to watch all of them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to design change, scheduler should watch pods which are : assigned, failed, bound, ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to watch bound pods as well? I suppose that bound pods will be passed to dispatchers. They are no longer needed to process in scheduler.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"status.phase!=" + string(v1.PodAssigned)

also add a comment here explaining the reason why these pods are being watched

Copy link
Collaborator

@pdgetrf pdgetrf Jun 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to watch the bound pods?

=> I think so. There is a function in scheduler.go.
func assignedPod(pod *v1.Pod) bool {
return pod.Spec.VirtualMachine != nil && pod.Status.Phase == v1.PodBound
}

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to my comments instead of editing them. Otherwise I did not get any notifications.

According to the code history, the HQ team added the condition to watch assigned pods at 1812947#diff-ffa1d9598838131f8ef28441851152f1434f44e5c396195b12ec36c8528162a0R48

You removed "status.phase=" + string(v1.PodAssigned)". Did you ask the HQ team?
=> This part is not done by me. I remember you changed this part.

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to my comments instead of editing them. Otherwise I did not get any notifications.

Can you just click the link? The hq team member Jun Wang made the change.
=> That is why you have to talk to him if he did.

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to my comments instead of editing them. Otherwise I did not get any notifications.

Jun Wang made this change by adding "status.phase=" + string(v1.PodAssigned)" and we think it is fine. You removed it and we are asking you why to make the change of removing the condition that Jun Wang made. Maybe you can ask Jun Wang to get his permission.
=> Because you want to change this part, it will be better for you to discuss with him

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to my comments instead of editing them. Otherwise I did not get any notifications.

I did not want to change it. I just add comments based on your changes to remove the condition.
=> According to your comment, you need to talk to him.

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to my comments instead of editing them. Otherwise I did not get any notifications.

I think you are the pull request owner and made the change. Can you talk to the hq team since your changes are based on the Jun Wang's change? Thanks

=> Because this change is your idea, so, it will be better for you to talk to HQ. I don't want to change it. Any way, the selector is updated to:
selector := fields.ParseSelectorOrDie(
"status.phase = " + string(v1.PodAssigned) +
",status.phase = " + string(v1.PodFailed) +
",status.assignedScheduler.name = " + schedulerName)

@@ -44,8 +44,9 @@ func (i *podInformer) Lister() corelisters.PodLister {
// NewPodInformer creates a shared index informer that returns only non-terminal pods.
func NewPodInformer(schedulerName string, client clientset.Interface,
resyncPeriod time.Duration) coreinformers.PodInformer {
//This selector is to avoid to receive all pods event so that it improves scheduling performance. //
Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please add the steps to process different pod status phase here? Like failed, assigned, and bound?
=> status are added.

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please not just add status. Please add the STEPS to process different pod status phase
=> If you have specific preference on comments, it will be better you do it as you wish. Or please send me paragraph, then I will add it.

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to my comments instead of editing them.

I have already given you examples below

Failed. I assume their resources are withdrawn
Assigned. I assume they are waiting for clusters to bind
Bound. As Peng Du and I mentioned, we don't know how schedulers want to process bound pods.
=> if you want, why don't you change it by yourself. I don't think this is important.

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We asked this question since the bound pods are not needed to be watched. We want to confirm it with you. You are the pull request owner. Can you please answer the question so that we know if we are right?
=> This PR is for revoke resource not "bound". "Bound" pod check was already there. There should be clear reason why we remove it if we have to do it. o you have to talk with HQ who implemented it. And if it is ok to remove, we should redesign this part and do it.

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to my comments instead of editing them.

Can we say the informer does not need to watch bound pods here? The reason that you did not want to add condition "status.phase != Bound" is that there is existing bound pod check?
=> I will check if we can do it again. But if we can not do it, we need to redesign and it takes qute time to implement. I am not sure this is correct at this moment.
selector is updated:
selector := fields.ParseSelectorOrDie(
"status.phase = " + string(v1.PodAssigned) +
",status.phase = " + string(v1.PodFailed) +
",status.assignedScheduler.name = " + schedulerName)

@@ -44,8 +44,10 @@ func (i *podInformer) Lister() corelisters.PodLister {
// NewPodInformer creates a shared index informer that returns only non-terminal pods.
func NewPodInformer(schedulerName string, client clientset.Interface,
resyncPeriod time.Duration) coreinformers.PodInformer {
//This selector is to avoid to receive unneccesary pods event (e.g. scheduled) so that it improves scheduling performance.
//This receives pod events only their status is one of failed, assigned, and bound
Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean how those pods with different status are processed.

  1. Failed. I assume their resources are withdrawn
  2. Assigned. I assume they are waiting for clusters to bind
  3. Bound. As Peng Du and I mentioned, we don't know how schedulers want to process bound pods.

=> Can you please change this part as you want if you have specific opinion. Please change this comment as you wish.

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to my comments instead of editing them.

As I mentioned before, we only watch pods we want to process. For failed pods and assigned ones, we know why informers want to watch them as I updated the comments above. But you updated the information watch conditions to watch bound pods. But we don't know why we need to watch bound pods. Maybe you can tell us?

=> last meeting we agreed that selector will be "pod status != scheduled". And it is updated accordingto the discussion. "Bound" one was checked by HQ source code from very beginning. That is out of our scope currently.

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to my comments instead of editing them.

Since you don't know how to add multiple conditions here, we suggest using "pod status != scheduled" as an alternative "pod status == assigned AND pod status == bound And pod status == failed". However, as Pend and I found out, the bound pods are not needed, we want to confirm it with you and make the changes.

=> You have to check if it is OK to remove HQ code on Monday and check if there is any issue. Then we can remove it. func assignedPod(pod *v1.Pod) bool {
return pod.Spec.VirtualMachine != nil && pod.Status.Phase == v1.PodBound
}

Copy link
Collaborator

@jshaofuturewei jshaofuturewei Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to my comments instead of editing them.

I did not want to remove any codes. I just want to add clear comments for your code changes. Jun Wang added a simple condition to search pods whose condition phase is Assigned that we understand. If you want to include more status phases, like Failed, Bound, and Assigned, please let us know why to do it.
=> You requested to see "Failed" pod to resolve resource collector issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revoke cluster resources when scheduling failed to create a vm
3 participants