Withdraw site resource when vm creation is failed #300

kimeunju108 · 2021-05-25T20:11:00Z

What type of PR is this?

/kind feature

What this PR does / why we need it:
This PR proposes to withdraw site (cluster) resource allocated to a pod when the pod's vm creation is failed.

Which issue(s) this PR fixes:
Fixes #286

Special notes for your reviewer:

Does this PR introduce a user-facing change?:
NONE

How to test
~/go/src/k8s.io/arktos$./hack/globalscheduler/globalscheduler-up.sh
open a new terminal
$cd ~/go/src/k8s.io/arktos/globalscheduler/test/yaml
$kubectl apply -f sample_6_clusters.yaml
$kubectl apply -f sample_2_schedulers.yaml
$kubectl apply -f sample_2_distributors.yaml
$kubectl apply -f sample_2_dispatchers.yaml
$kubectl apply -f sample_6_pods.yaml
$kubectl get pods

…ource-scheduler into feature/resource

globalscheduler/pkg/scheduler/eventhandlers.go

globalscheduler/pkg/scheduler/factory/factory.go

globalscheduler/pkg/scheduler/framework/interfaces/framework.go

globalscheduler/pkg/scheduler/framework/plugins/defaultbinder/default_binder.go

pdgetrf · 2021-05-26T17:51:42Z

globalscheduler/pkg/scheduler/sitecacheinfo/sitecache_info.go

@@ -643,20 +653,28 @@ func (n *SiteCacheInfo) updateSiteFlavor(resourceTypes []string, regionFlavors m
 	}
 }

-func (n *SiteCacheInfo) deductFlavor() {
+func (n *SiteCacheInfo) updateFlavorCount(deduct bool) {


newly added function like this needs a unit test

Definitely. I will do it.
Update: I am working on unit testcase. It takes some time to finish so I will make a seperate PR.

pdgetrf · 2021-05-26T17:52:13Z

globalscheduler/pkg/scheduler/internal/cache/snapshot.go

@@ -135,3 +136,15 @@ func (s *Snapshot) Get(siteID string) (*schedulersitecacheinfo.SiteCacheInfo, er
 func (s *Snapshot) GetFlavors() (map[string]*typed.RegionFlavor, error) {
 	return s.RegionFlavorMap, nil
 }
+
+func (s *Snapshot) GetRegionFlavors(region string) (map[string]*typed.RegionFlavor, error) {


newly added function like this needs a unit test

Definitely. I will do it.

was test added?

pdgetrf · 2021-05-26T17:53:07Z

globalscheduler/pkg/scheduler/framework/plugins/defaultbinder/default_binder.go

+
+// Bind binds pods to site using the k8s client.
+// Same function with Bind except return bound resource info
+func (b DefaultBinder) BindResource(ctx context.Context, state *interfaces.CycleState, stack *types.Stack,


newly added function like this needs a unit test

consider refactor some segments out into functions for unit testing

Definitely. I will do it.

pdgetrf

need unit tests for some of the complex functions

pdgetrf · 2021-05-26T17:55:00Z

I have some superficial comments.

please get another approval from @jshaofuturewei and/or @CoderKevinZhang before merging

globalscheduler/pkg/scheduler/eventhandlers.go

jshaofuturewei · 2021-05-27T16:12:08Z

globalscheduler/pkg/scheduler/eventhandlers.go

+
+//withdraw reserved resources to a pod & add it to cash to other pods
+func (sched *Scheduler) withdrawResource(podName string) error {
+	resource := sched.PodSiteResourceMap[podName]


map is not thread safe. When multiple events of one pod are triggered, how can we prevent synchronization issues here?

map is not thread safe. When multiple events of one pod are triggered, how can we prevent synchronization issues here?
This function is only related to pod creation failure. And pod creation is scheduled by schduling Q which is synchonized.

My concern is that a scheduled job will sync resources(cpu & mem) from openstack. Before informers get failed pods from apiserver, the scheduled job has already synchronized from openstack, which means the resource has already claimed back. In that case, is it possible that the failed pod adds resources back twice?
=> No it won't happen because PodSiteResourceMap is only for 60 seconds time gap. So whenever sync resources(cpu & mem) from openstack, this PodSiteResourceMap will be empty at the moment already.

Please reply to my comments instead of editing them directly.

The assumption is based on informers take fewer than 60 seconds to trigger the events. However, if at that time the resource map is not empty but the scheduled resource synchronization job happens to fetch resource information from openstack, the resources might be wrong.

=> resource map is empty, so the conflict won't happen. If this function creates too much issue, we should think again if we remove this function. It will be better to remove this 60 seconds gap. or resource collector should collect information in advance before waiting 60 seconds. 60 secnds is resource collector's issue mainly. Resource collection has to keep collecting information to remove this gap.

Please reply to my comments instead of editing them directly.

For example, when you get failed pod event and want to withdraw resources, we assume to update resource based on resource collector does not update openstack resource information at that time. But if it does, we withdraw resource information twice.

We can happen to get a failed pod event just after resource collector updates the most recent openstack resource information, right?

Please correct me if I am wrong.
=> Regardless right or not, collecting resource information is resource collector's work. I tried to cooperate 60 seconds issue of resource collctor. But if there are so many issues, it will be better resource collector take it and solve it.

Please reply to my comments instead of editing them directly.

Resource collector used to be the only source we know the resource information (mem & cpu). If you introduce the current codes to update resources, please make them work with each other. Thanks

=> It will be better for you to check resource collector's requirement.

Please reply to my comments instead of editing them directly.

Can you tell us besides resource collector, are the pull request to reduce/withdraw resources to update the openstack resources? Thanks
=> No, this PR for "withdraw", which resolves the issue caused by resource collector issue. It will be better to reconsider if this function should be implemented by scheduler or not.

jshaofuturewei · 2021-05-27T16:14:05Z

globalscheduler/pkg/scheduler/factory/factory.go

 	selector := fields.ParseSelectorOrDie(
-		"status.phase=" + string(v1.PodAssigned) +
-			",status.assignedScheduler.name=" + schedulerName)
+		"status.assignedScheduler.name=" + schedulerName)


A pod which is scheduled can be watched by informer. In case there are millions of pods in the system, the informer here has to watch all of them?

Due to design change, scheduler should watch pods which are : assigned, failed, bound, ...

Do we need to watch bound pods as well? I suppose that bound pods will be passed to dispatchers. They are no longer needed to process in scheduler.

"status.phase!=" + string(v1.PodAssigned)

also add a comment here explaining the reason why these pods are being watched

do we need to watch the bound pods?

=> I think so. There is a function in scheduler.go.
func assignedPod(pod *v1.Pod) bool {
return pod.Spec.VirtualMachine != nil && pod.Status.Phase == v1.PodBound
}

Please reply to my comments instead of editing them. Otherwise I did not get any notifications.

According to the code history, the HQ team added the condition to watch assigned pods at 1812947#diff-ffa1d9598838131f8ef28441851152f1434f44e5c396195b12ec36c8528162a0R48

You removed "status.phase=" + string(v1.PodAssigned)". Did you ask the HQ team?
=> This part is not done by me. I remember you changed this part.

Please reply to my comments instead of editing them. Otherwise I did not get any notifications.

Can you just click the link? The hq team member Jun Wang made the change.
=> That is why you have to talk to him if he did.

Please reply to my comments instead of editing them. Otherwise I did not get any notifications.

Jun Wang made this change by adding "status.phase=" + string(v1.PodAssigned)" and we think it is fine. You removed it and we are asking you why to make the change of removing the condition that Jun Wang made. Maybe you can ask Jun Wang to get his permission.
=> Because you want to change this part, it will be better for you to discuss with him

Please reply to my comments instead of editing them. Otherwise I did not get any notifications.

I did not want to change it. I just add comments based on your changes to remove the condition.
=> According to your comment, you need to talk to him.

Please reply to my comments instead of editing them. Otherwise I did not get any notifications.

I think you are the pull request owner and made the change. Can you talk to the hq team since your changes are based on the Jun Wang's change? Thanks

=> Because this change is your idea, so, it will be better for you to talk to HQ. I don't want to change it. Any way, the selector is updated to:
selector := fields.ParseSelectorOrDie(
"status.phase = " + string(v1.PodAssigned) +
",status.phase = " + string(v1.PodFailed) +
",status.assignedScheduler.name = " + schedulerName)

globalscheduler/pkg/scheduler/framework/interfaces/framework.go

globalscheduler/pkg/scheduler/eventhandlers.go

jshaofuturewei · 2021-06-03T19:54:12Z

globalscheduler/pkg/scheduler/factory/factory.go

@@ -44,8 +44,9 @@ func (i *podInformer) Lister() corelisters.PodLister {
 // NewPodInformer creates a shared index informer that returns only non-terminal pods.
 func NewPodInformer(schedulerName string, client clientset.Interface,
 	resyncPeriod time.Duration) coreinformers.PodInformer {
+	//This selector is to avoid to receive all pods event so that it improves scheduling performance.	//


Would you please add the steps to process different pod status phase here? Like failed, assigned, and bound?
=> status are added.

Please not just add status. Please add the STEPS to process different pod status phase
=> If you have specific preference on comments, it will be better you do it as you wish. Or please send me paragraph, then I will add it.

Please reply to my comments instead of editing them.

I have already given you examples below

Failed. I assume their resources are withdrawn
Assigned. I assume they are waiting for clusters to bind
Bound. As Peng Du and I mentioned, we don't know how schedulers want to process bound pods.
=> if you want, why don't you change it by yourself. I don't think this is important.

We asked this question since the bound pods are not needed to be watched. We want to confirm it with you. You are the pull request owner. Can you please answer the question so that we know if we are right?
=> This PR is for revoke resource not "bound". "Bound" pod check was already there. There should be clear reason why we remove it if we have to do it. o you have to talk with HQ who implemented it. And if it is ok to remove, we should redesign this part and do it.

Please reply to my comments instead of editing them.

Can we say the informer does not need to watch bound pods here? The reason that you did not want to add condition "status.phase != Bound" is that there is existing bound pod check?
=> I will check if we can do it again. But if we can not do it, we need to redesign and it takes qute time to implement. I am not sure this is correct at this moment.
selector is updated:
selector := fields.ParseSelectorOrDie(
"status.phase = " + string(v1.PodAssigned) +
",status.phase = " + string(v1.PodFailed) +
",status.assignedScheduler.name = " + schedulerName)

jshaofuturewei · 2021-06-04T16:53:16Z

globalscheduler/pkg/scheduler/factory/factory.go

@@ -44,8 +44,10 @@ func (i *podInformer) Lister() corelisters.PodLister {
 // NewPodInformer creates a shared index informer that returns only non-terminal pods.
 func NewPodInformer(schedulerName string, client clientset.Interface,
 	resyncPeriod time.Duration) coreinformers.PodInformer {
+	//This selector is to avoid to receive unneccesary pods event (e.g. scheduled) so that it improves scheduling performance.
+	//This receives pod events only their status is one of failed, assigned, and bound


I mean how those pods with different status are processed.

Failed. I assume their resources are withdrawn

Assigned. I assume they are waiting for clusters to bind

Bound. As Peng Du and I mentioned, we don't know how schedulers want to process bound pods.

=> Can you please change this part as you want if you have specific opinion. Please change this comment as you wish.

Please reply to my comments instead of editing them.

As I mentioned before, we only watch pods we want to process. For failed pods and assigned ones, we know why informers want to watch them as I updated the comments above. But you updated the information watch conditions to watch bound pods. But we don't know why we need to watch bound pods. Maybe you can tell us?

=> last meeting we agreed that selector will be "pod status != scheduled". And it is updated accordingto the discussion. "Bound" one was checked by HQ source code from very beginning. That is out of our scope currently.

Please reply to my comments instead of editing them.

Since you don't know how to add multiple conditions here, we suggest using "pod status != scheduled" as an alternative "pod status == assigned AND pod status == bound And pod status == failed". However, as Pend and I found out, the bound pods are not needed, we want to confirm it with you and make the changes.

=> You have to check if it is OK to remove HQ code on Monday and check if there is any issue. Then we can remove it. func assignedPod(pod *v1.Pod) bool {
return pod.Spec.VirtualMachine != nil && pod.Status.Phase == v1.PodBound
}

Please reply to my comments instead of editing them.

I did not want to remove any codes. I just want to add clear comments for your code changes. Jun Wang added a simple condition to search pods whose condition phase is Assigned that we understand. If you want to include more status phases, like Failed, Bound, and Assigned, please let us know why to do it.
=> You requested to see "Failed" pod to resolve resource collector issue.

kimeunju108 added 12 commits May 11, 2021 18:56

changed klog.Infof to klog.V(4).Infof

74f1776

updated code - removed else and added continue

207f229

Merge branch 'master' of https://github.com/CentaurusInfra/global-res…

86cde93

…ource-scheduler into feature/resource

implemented withdraw reserved resource for pod

c75e438

implemented resource revokation when vm creation failed

0c14cbe

implemented resource revokation when vm creation failed

bba5311

implemented resource revokation when vm creation failed

7773eeb

applied review

3b5a1c3

updated according to review

f1e1542

updated resource data structure

fe2ba14

applied review

ec4735b

pulled from pr 289

4a49283

kimeunju108 requested review from pdgetrf, jshaofuturewei and CoderKevinZhang May 25, 2021 22:37

kimeunju108 added the ready for review label May 25, 2021