-
Notifications
You must be signed in to change notification settings - Fork 55
Conversation
8e75f25
to
753962f
Compare
I've rebased, so only new commits are present now. I also re-enabled the stress test. It passed in local testing. |
VolumeSource: corev1.VolumeSource{ | ||
Secret: &corev1.SecretVolumeSource{ | ||
// TODO: replace name, create secret via cert-manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit more complicated. When deploying via OLM, we need OLM to create the certificates. My understanding is that this can be done for "normal" webhooks, just not for conversion webhooks.
For deployment of the operator via YAML files we need to document how to set up those certs.
In both cases, re-configuring the cluster to use the webhooks is something that needs to be done manually.
If the user of the operator doesn't need the scheduler extensions, then it's still okay to start the statefulset unconditionally. The pod simply won't run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit more complicated. When deploying via OLM, we need OLM to create the certificates. My understanding is that this can be done for "normal" webhooks, just not for conversion webhooks.
I doubt if OLM can creates certificates for services created by the operator for the CRs, though it provides service certificates for operator itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pohly How this scheduler extension and the Pod mutation works in the context of multiple PmemCSIDeployments? I think we should /cannot support multiple deployments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the mutating pod webhook: it must use a resource name which includes the driver name. I thought it already did this, but currently it doesn't:
pmem-csi/pkg/scheduler/mutate_pod.go
Line 31 in 692818b
Resource = "pmem-csi.intel.com/scheduler" |
For the scheduler extender: it must be triggered by the same resource name (driver name to be added) and then check only pods that belong to that instance - already implemented:
pmem-csi/pkg/scheduler/capacity.go
Lines 103 to 106 in 82bcde1
if label.GetName() == "driver_name" && | |
label.GetValue() != c.driverName { | |
return 0, wrongPod | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt if OLM can creates certificates for services created by the operator for the CRs, though it provides service certificates for operator itself.
When we discussed it this morning, my conclusion was that the user needs to supply TLS credentials and that the webhooks need to be disabled if not provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both cases, re-configuring the cluster to use the webhooks is something that needs to be done manually.
Have I mentioned how much I hate configuring a scheduler extension? 😢
The way we currently do it is by creating the certificates first, then setting up the "kubeadm init" configuration so that those certificates can be used by kube-scheduler
. This no longer works when creating those certificates inside the cluster with a tool like cert-manager.
It is possible to set up the cluster without the extension enabled, then create certificates, and then reconfigure kube-scheduler
. kubeadm upgrade
supports that. But this looks complicated. I think I'll stick with the current approach.
- -nodeid=$(KUBE_NODE_NAME) | ||
- -mode=webhooks | ||
- -schedulerListen=:8000 | ||
- -drivername=$(PMEM_CSI_DRIVER_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need the external-provisioner in the controller as fallback for kubernetes-csi/external-provisioner#544
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or a custom controller which removes the selected-node annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or just set AllowedTopologies
in our storage classes for late binding (kubernetes-csi/external-provisioner#544 (comment)).
patch: |- | ||
- op: add | ||
path: /spec/template/spec/containers/0/command/- | ||
value: -mode=webhooks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this "webhooks" mode does? Isn't it that valid mods are 'controller' and 'node'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, after going through the design documentation changes, I understood that the new 'webhooks' mode replaces the existing 'controller' mode. But the driver command-line flags in pkg/pmem-csi-dirver/main.go
is not synced with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
path: /spec/template/spec/containers/0/command/- | ||
value: -kube-api-burst=10 | ||
- op: remove | ||
path: /spec/template/spec/containers/1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we do theses(remove external provisioner from controller) changes to our base driver YAML instead of a kustomize layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can and I thought I had made that change when squashing. I'll check again.... it looks like I simply forgot to remove this file. It shouldn't be needed anymore.
Initially it wasn't done because distributed and centralized provisioning were both supported for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -155,88 +155,35 @@ Kata Container support has to be enabled explicitly via a [storage | |||
class parameter and Kata Containers must be set up | |||
appropriately](install.md#kata-containers-support) | |||
|
|||
## Driver modes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference is not removed from the table of contents.
docs/design.md
Outdated
|
||
PMEM-CSI solves this problem by deploying `external-provisioner` | ||
alongside each node driver and enabling ["distributed | ||
provisioning"](https://github.com/kubernetes-csi/external-provisioner/issues/487): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to link to the documentation if any, instead of the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, now that that it is merged. Will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
on which nodes the PMEM-CSI driver is running and how much capacity is | ||
available there. This information is retrieved by dynamically | ||
discovering PMEM-CSI pods and connecting to their [metrics | ||
endpoint](/docs/install.md#metrics-support). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes 'enabling the metrics support' a hard dependency on 'scheduler extension'. The code should implement this by always enabling node metrics when scheduler extension is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the node part know whether scheduler extensions are enabled? I don't think so, because those parameters are passed to the controller part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or did you mean some other code than the main.go and pmem-csi-driver.go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the node part know whether scheduler extensions are enabled? I don't think so, because those parameters are passed to the controller part.
Yes, your are true, it is not possible to enforce this in the code. Would be good to add a note about "enabling metrics while using scheduler extension" in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's documented that the scheduler extension code uses metrics. I think that is enough. We don't describe how to disable metrics support (it's always on in YAML and operator deployments), so the risk that someone accidentally deploys incorrectly is low.
docs/install.md
Outdated
@@ -8,7 +8,7 @@ | |||
- [Install PMEM-CSI driver](#install-pmem-csi-driver) | |||
- [Install using the operator](#install-using-the-operator) | |||
- [Install from source](#install-from-source) | |||
- [Expose persistent and cache volumes to applications](#expose-persistent-and-cache-volumes-to-applications) | |||
- [Expose volumes to applications](#expose-volumes-to-applications) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Links in Table of Contents has broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, also in the body.
pkg/pmem-csi-driver/main.go
Outdated
@@ -71,7 +67,7 @@ func Main() int { | |||
klog.V(3).Info("Version: ", version) | |||
|
|||
if config.schedulerListen != "" { | |||
if config.Mode != Controller { | |||
if config.Mode != Webhooks { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like offloading the webhook and scheduler extension part from the CSI driver to a separate binary makes much sense and cleaner than treating it as a mode of the driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit? Have you considered the drawbacks?
"Cleaner" is subjective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this perhaps about publishing one image that has multiple purposed vs. two different images with different content and different purpose?
I prefer to stick with a single image until someone has some strong argument against it. I doubt that the size of an image that only runs once in a cluster matters that much and publishing a single image is simpler, so I prefer that.
When we have a single image, then a single binary is likely to be smaller overall.
What that binary is called and how it is invoked hardly matters for end-users - they never invoke the binary. Renaming the binary is possible, but makes this PR more complex for (IMHO) little gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was not about different images, more about a different binary. As now there is no central controller driver component, IMHO the scheduler code is not suitable to be part of the driver code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this in the standup meeting.
"pmem-csi-driver" has already before served multiple different purposes and we agreed to not change that in this PR because doing so would make our image larger for not much gain (basically cleaner command line arguments of the binary which isn't that important).
We also raised the question whether "pmem-csi-operator" should be a separate binary. Probably not. But again, that is a different issue.
82bcde1
to
053de08
Compare
I've pushed a major update: now the operator supports configuration of the webhooks and distributed provisioning. Some E2E tests passed (in particular the ones for the operator), but for full coverage I let the CI do its work over night. I'll double-check tomorrow, but I think there is only one minor TODO left (the one about reading the node port from the shell script in deploy.go). I've not tried to ensure that all tests pass for all commits, so strictly speaking I would have to squash "distributed provisioning", "deploy: consistent object naming, app.kubernetes.io labels" and "operator: distributed provisioning, scheduler extensions". But I think it's easier to review when they are separate. |
That referred to TODOs inside the patch itself. I still need to address review comments. |
@@ -104,9 +98,6 @@ type Config struct { | |||
NodeID string | |||
//Endpoint exported csi driver endpoint | |||
Endpoint string | |||
//TestEndpoint adds the controller service to the server listening on Endpoint. | |||
//Only needed for testing. | |||
TestEndpoint bool | |||
//Mode mode fo the driver | |||
Mode DriverMode | |||
//RegistryEndpoint exported registry server endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RegistryEndpoint
no more needed and could be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed (but not pushed yet).
053de08
to
5518483
Compare
I've pushed an update with various test fixes. Several tests still used the old labels and/or had issues with the optional controller part. |
@@ -188,6 +196,16 @@ func (csid *csiDriver) Run() error { | |||
) | |||
csid.gatherers = append(csid.gatherers, cmm.GetRegistry()) | |||
|
|||
var client kubernetes.Interface | |||
if config.schedulerListen != "" || | |||
csid.cfg.Mode == Node { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the check supposed to be csid..cfg.Mode == Webhook
, in that case we could remove this check and move the client creation under webhooks section in switch case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, good catch. The initialization of cmm
can also be moved.
kind: ServiceAccount | ||
metadata: | ||
labels: | ||
pmem-csi.intel.com/deployment: lvm-testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is keeping this label(and also rest of this change) intentional ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The usage of the "pmem-csi.intel.com/deployment" to mark different kinds of deployments in deploy.go doesn't change.
|
||
That secret must contain the following data items: | ||
- `ca.crt`: root CA certificate | ||
- `tls.key`: secret key of the webhook |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't be better to mention that the secret/public key is for both webhook and scheduler extender.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scheduler extender is also a webhook. Both run in the controller.
With that in mind, it's better to say above that "The controller needs TLS certificates" - i'll use that instead. The name of the API was already chosen accordingly.
@@ -377,6 +424,7 @@ var subObjectHandlers = map[string]redeployObject{ | |||
}, | |||
"metrics service": { | |||
objType: reflect.TypeOf(&corev1.Service{}), | |||
enabled: controllerEnabled, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node metrics service cant be enabled regardless of the controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node metrics are always enabled. There's nothing in the operator API to control that.
5518483
to
42de002
Compare
# It must match the "provisioner" field above. | ||
- key: pmem-csi.intel.com/driver | ||
values: | ||
- started |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this breaks skew testing; I've not looked into it yet, but probably it is related to the older deployment not setting this label.
I wanted to ensure that only nodes where the driver actually runs (or at least ran) is used, vs. nodes where it is supposed to run. The latter is what we would get when using the label that admins need to set for the nodes with PMEM.
Will have to think about this some more...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the 0.8.0 deployment doesn't handle the new topology:
pmem-csi-controller-0/[email protected]: I0118 11:38:34.703703 1 tracing.go:19] GRPC call: /csi.v1.Controller/CreateVolume
pmem-csi-controller-0/[email protected]: I0118 11:38:34.703770 1 tracing.go:20] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"pmem-csi.intel.com/driver":"started"}}],"requisite":[{"segments":{"pmem-csi.intel.com/driver":"started"}}]},"capacity_range":{"required_bytes":115343360},"name":"pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0","parameters":{"eraseafter":"true"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
pmem-csi-controller-0/[email protected]: I0118 11:38:34.704225 1 controllerserver-master.go:181] Controller CreateVolume: Name:pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0 required_bytes:115343360 limit_bytes:0
pmem-csi-controller-0/[email protected]: I0118 11:38:34.704251 1 controllerserver-master.go:86] Controller CreateVolume: Create VolumeID:pvc-26-328dad453a5df7f8348d57f9f7422e6b084a58303f5400af6f92906a based on name:pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0
pmem-csi-controller-0/[email protected]: W0118 11:38:34.704273 1 controllerserver-master.go:236] failed to connect to : No node registered with id:
I'm a bit surprised that "pmem-csi.intel.com/node" doesn't show up at all; might be related to how we configure the external-provisioner. Anyway, we can't change our topology labels without breaking compatibility, so I'll switch to using the node selector labels that we use for the PMEM-CSI node pods also for the storage classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using our node selector label also doesn't work because the external-provisioner only seems to allow node labels that were set by the CSI driver:
pmem-csi-intel-com-node-q94vv/[email protected]: I0118 14:46:54.540845 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"skew-4425", Name:"direct-testingv4nl9", UID:"fb94e3a9-62e5-44ec-b48b-33471452142c", APIVersion:"v1", ResourceVersion:"1204", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "pmem-pmem-csi-sc-ext4-skew-4425-unused": error generating accessibility requirements: selected node '"pmem-csi-pmem-govm-worker2"' topology 'map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker2]' is not in allowed topologies: [map[feature.node.kubernetes.io/memory-nv.dax:true]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, running a "deprovisioner" in the PMEM-CSI controller looks like the simplest solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented that, albeit with the name "(PVC) rescheduler" because that's what it is.
I decided to keep the commits for the previous two attempts because it is something that we might want to refer to. I could also squash.
I used the existing external-provisioner helper lib. It's not a perfect fit because it needs a PV informer without ever doing anything with PVs, but using it was simpler than writing everything from scratch (which probably would have amounted to a lot of copy-and-paste).
I've pushed after some local testing ("make test", the new late binding test, some skew testing).
@avalluri can you take another look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've filed kubernetes-csi/external-provisioner#548 to discuss why we can't use the node label in AllowedTopology
. With a modified external-provisioner we might be able to restore that solution, which would be better than the current solution alone because then a bad node selection should happen a lot less. Even if we can enable that again, rescheduling as a second line of defense will still be useful (think "node selected base on labels", "labels change", "provisioning no longer possible" -> "must reschedule").
This can be run concurrently to creating volumes and then will show how objects change over time.
The previous error message only listed the last path that had been tried. Listing all the paths that were checked makes it more obvious what might be missing.
User input is already checked in DriverMode.Set, therefore we can assume that all DriverMode strings are valid.
This is useful for debugging outside of a cluster.
By moving the code into the device manager package it becomes reusable, for example in upcoming tests. While at it, the resync period gets moved and increased. The code shouldn't depend on resyncing to catch up with the current state of the world.
Events get recorded in one goroutine and checked in another. We must protect our data structure where we stored events.
The newer runtime has a nicer API: - client.Object combines access to meta data and object data, which in some cases means we can drop one parameter or extra code - context is passed explicitly to Reconcile()
This is merely a precaution. Right now, the existing initial events are secrets which should be read-only and thus safe to share.
The Go runtime sometimes fails the deployment test with an error about a concurrent write access to a map (copied below, see also intel#852). It's not clear what that map is. It looks like it's related to decoding into the same object, but objects shouldn't be shared between goroutines. As a workaround we can prevent parallel execution of Patch in the test. This is less restrictive than preventing parallel execution altogether. fatal error: concurrent map writes goroutine 1498 [running]: runtime.throw(0x17a34b4, 0x15) /nvme/gopath/go-1.15.2/src/runtime/panic.go:1116 +0x72 fp=0xc0019627a0 sp=0xc001962770 pc=0x439552 runtime.mapassign(0x15d4600, 0xc000206bd0, 0xc00051a7f0, 0xc001513440) /nvme/gopath/go-1.15.2/src/runtime/map.go:585 +0x5c5 fp=0xc001962820 sp=0xc0019627a0 pc=0x411ca5 reflect.mapassign(0x15d4600, 0xc000206bd0, 0xc00051a7f0, 0xc00051a820) /nvme/gopath/go-1.15.2/src/runtime/map.go:1319 +0x3f fp=0xc001962850 sp=0xc001962820 pc=0x46abbf github.com/modern-go/reflect2.(*UnsafeMapType).UnsafeSetIndex(...) /nvme/gopath/pkg/mod/github.com/modern-go/[email protected]/unsafe_map.go:76 github.com/json-iterator/go.(*mapDecoder).Decode(0xc0007ed090, 0xc001b40ba8, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/[email protected]/reflect_map.go:180 +0x1a5 fp=0xc0019628d0 sp=0xc001962850 pc=0x880f65 github.com/json-iterator/go.(*placeholderDecoder).Decode(0xc000998130, 0xc001b40ba8, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/[email protected]/reflect.go:324 +0x47 fp=0xc0019628f8 sp=0xc0019628d0 pc=0x879ea7 github.com/json-iterator/go.(*structFieldDecoder).Decode(0xc000ba0100, 0xc001b40b18, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/[email protected]/reflect_struct_decoder.go:1054 +0x78 fp=0xc001962980 sp=0xc0019628f8 pc=0x890f58 github.com/json-iterator/go.(*generalStructDecoder).decodeOneField(0xc000ba0960, 0xc001b40b18, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/[email protected]/reflect_struct_decoder.go:552 +0x217 fp=0xc0019629f8 sp=0xc001962980 pc=0x88e577 github.com/json-iterator/go.(*generalStructDecoder).Decode(0xc000ba0960, 0xc001b40b18, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/[email protected]/reflect_struct_decoder.go:508 +0x85 fp=0xc001962a68 sp=0xc0019629f8 pc=0x88e065 github.com/json-iterator/go.(*structFieldDecoder).Decode(0xc000ba1f20, 0xc001b40a20, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/[email protected]/reflect_struct_decoder.go:1054 +0x78 fp=0xc001962af0 sp=0xc001962a68 pc=0x890f58 github.com/json-iterator/go.(*twoFieldsStructDecoder).Decode(0xc0007db380, 0xc001b40a20, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/[email protected]/reflect_struct_decoder.go:617 +0xab fp=0xc001962b58 sp=0xc001962af0 pc=0x88ebab github.com/json-iterator/go.(*structFieldDecoder).Decode(0xc000e8c6a0, 0xc001b40a18, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/[email protected]/reflect_struct_decoder.go:1054 +0x78 fp=0xc001962be0 sp=0xc001962b58 pc=0x890f58 github.com/json-iterator/go.(*fiveFieldsStructDecoder).Decode(0xc00058d8c0, 0xc001b40a18, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/[email protected]/reflect_struct_decoder.go:743 +0xcb fp=0xc001962c48 sp=0xc001962be0 pc=0x88f5eb github.com/json-iterator/go.(*structFieldDecoder).Decode(0xc000e8d6e0, 0xc001b40900, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/[email protected]/reflect_struct_decoder.go:1054 +0x78 fp=0xc001962cd0 sp=0xc001962c48 pc=0x890f58 github.com/json-iterator/go.(*fiveFieldsStructDecoder).Decode(0xc00058daa0, 0xc001b40900, 0xc001513440) /nvme/gopath/pkg/mod/github.com/json-iterator/[email protected]/reflect_struct_decoder.go:741 +0x2f2 fp=0xc001962d38 sp=0xc001962cd0 pc=0x88f812 github.com/json-iterator/go.(*Iterator).ReadVal(0xc001513440, 0x176d840, 0xc001b40900) /nvme/gopath/pkg/mod/github.com/json-iterator/[email protected]/reflect.go:79 +0xc2 fp=0xc001962da8 sp=0xc001962d38 pc=0x877d42 github.com/json-iterator/go.(*frozenConfig).Unmarshal(0xc0001e4fa0, 0xc002213300, 0x11f7, 0x1300, 0x176d840, 0xc001b40900, 0x0, 0x0) /nvme/gopath/pkg/mod/github.com/json-iterator/[email protected]/config.go:348 +0xb7 fp=0xc001962e08 sp=0xc001962da8 pc=0x86da17 k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).Decode(0xc0000e8aa0, 0xc002213300, 0x11f7, 0x1300, 0x0, 0x1948e40, 0xc001b40900, 0x2b, 0x2b, 0xc000044390, ...) /nvme/gopath/pkg/mod/k8s.io/[email protected]/pkg/runtime/serializer/json/json.go:263 +0x602 fp=0xc001963130 sp=0xc001962e08 pc=0x10ab082 k8s.io/apimachinery/pkg/runtime/serializer/recognizer.(*decoder).Decode(0xc000230240, 0xc002213300, 0x11f7, 0x1300, 0x0, 0x1948e40, 0xc001b40900, 0xc0007270e0, 0x7f510e3e3498, 0xc0007270e0, ...) /nvme/gopath/pkg/mod/k8s.io/[email protected]/pkg/runtime/serializer/recognizer/recognizer.go:107 +0x4f7 fp=0xc001963218 sp=0xc001963130 pc=0x10a9377 k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).Decode(0xc0007270e0, 0xc002213300, 0x11f7, 0x1300, 0x0, 0x1948e40, 0xc001b40900, 0x3, 0x4, 0x19752a0, ...) /nvme/gopath/pkg/mod/k8s.io/[email protected]/pkg/runtime/serializer/versioning/versioning.go:136 +0xa9 fp=0xc001963300 sp=0xc001963218 pc=0x10b0bc9 sigs.k8s.io/controller-runtime/pkg/client/fake.(*fakeClient).Patch(0xc00376a060, 0x1978f60, 0xc003ad2fc0, 0x1948e40, 0xc001b40900, 0x194e480, 0xc002cb60a0, 0x0, 0x0, 0x0, ...) /nvme/gopath/pkg/mod/sigs.k8s.io/[email protected]/pkg/client/fake/client.go:378 +0x747 fp=0xc001963610 sp=0xc001963300 pc=0x14bf607 github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment_test.(*testClient).Patch(0xc00376a080, 0x1978f60, 0xc003ad2fc0, 0x1948e40, 0xc001b40900, 0x194e480, 0xc002cb60a0, 0x0, 0x0, 0x0, ...) <autogenerated>:1 +0xb6 fp=0xc001963680 sp=0xc001963610 pc=0x14c9596 github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment.(*objectPatch).apply(0xc0023af848, 0x1978f60, 0xc003ad2fc0, 0x198a040, 0xc00376a080, 0x0, 0x0) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/controller_driver.go:192 +0x6a4 fp=0xc0019637c0 sp=0xc001963680 pc=0x13cbee4 github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment.(*pmemCSIDeployment).redeploy(0xc0036020c0, 0x1978f60, 0xc003ad2fc0, 0xc000a26b40, 0x199d160, 0x176d840, 0x0, 0x183dc88, 0x183dce0, 0x183dd20, ...) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/controller_driver.go:329 +0x37d fp=0xc0019638f8 sp=0xc0019637c0 pc=0x13cd49d github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment.(*pmemCSIDeployment).reconcile.func1(0xc003602200, 0x179457c) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/controller_driver.go:233 +0x1d2 fp=0xc001963ab0 sp=0xc0019638f8 pc=0x13debd2 github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment.(*pmemCSIDeployment).reconcile(0xc0036020c0, 0x1978f60, 0xc003ad2fc0, 0xc000a26b40, 0x2, 0x2) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/controller_driver.go:242 +0x2e4 fp=0xc001963c00 sp=0xc001963ab0 pc=0x13cc244 github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment.(*ReconcileDeployment).Reconcile(0xc000a26b40, 0x0, 0x0, 0xc003014440, 0x16, 0xc002103e00, 0x0, 0x0, 0x0) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/deployment_controller.go:324 +0x774 fp=0xc001963dd8 sp=0xc001963c00 pc=0x13d7354 github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment_test.testReconcile(0xc000834d80, 0x19407a0, 0xc000a26b40, 0xc003014440, 0x16, 0x0) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/deployment_controller_test.go:154 +0x5a fp=0xc001963e48 sp=0xc001963dd8 pc=0x14c181a github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment_test.testReconcilePhase(0xc000834d80, 0x19407a0, 0xc000a26b40, 0x198a040, 0xc00376a080, 0xc003014440, 0x16, 0x0, 0x179520f, 0x7) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/deployment_controller_test.go:164 +0x6f fp=0xc001963e90 sp=0xc001963e48 pc=0x14c1a2f github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment_test.TestDeploymentController.func1.9.1.1(0xc000088300) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/deployment_controller_test.go:575 +0x769 fp=0xc001963f68 sp=0xc001963e90 pc=0x14c5549 github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment_test.TestDeploymentController.func1.9.1.2(0xc0009a9b00) /nvme/gopath/src/github.com/intel/pmem-csi/pkg/pmem-csi-operator/controller/deployment/deployment_controller_test.go:584 +0x2a fp=0xc001963f80 sp=0xc001963f68 pc=0x14c56aa testing.tRunner(0xc0009a9b00, 0xc000d900b0) /nvme/gopath/go-1.15.2/src/testing/testing.go:1127 +0xef fp=0xc001963fd0 sp=0xc001963f80 pc=0x521a6f runtime.goexit() /nvme/gopath/go-1.15.2/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc001963fd8 sp=0xc001963fd0 pc=0x471081 created by testing.(*T).Run /nvme/gopath/go-1.15.2/src/testing/testing.go:1178 +0x386
The code had two problems. It stored pointers to the objects passed in by the caller, which is dangerous because those instances might be reused by the caller, in which case the content changes. In practice this doesn't seem to happen, but we can't know that. It relied on receiving events in a certain order. The apiserver and (to some extend) the fake client may combine events, which apparently can also cause them to be delivered out-of-order. This has happened in practice and broke the test (intel#852). The solution is to de-duplicate after sorting. Because the order depends on what we compare against, it is done during the actual comparison together with de-duplication.
This is redundant when the scheduler extensions are active, but when those are disabled, then allowedTopologies prevents scheduling of pods onto nodes where the driver is not running. The central provisioner recovered from that by removing the bad "selected node" annotation. With distributed provisioning such a pod would get stuck forever. Testing of the scenario without scheduler extensions is covered by adding late binding tests to "operator-direct-production". Using an alternative driver name is covered by "operator-lvm-production".
The previous attempt with a fixed label did not work when downgrading to 0.8.0: pmem-csi-controller-0/[email protected]: I0118 11:38:34.703703 1 tracing.go:19] GRPC call: /csi.v1.Controller/CreateVolume pmem-csi-controller-0/[email protected]: I0118 11:38:34.703770 1 tracing.go:20] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"pmem-csi.intel.com/driver":"started"}}],"requisite":[{"segments":{"pmem-csi.intel.com/driver":"started"}}]},"capacity_range":{"required_bytes":115343360},"name":"pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0","parameters":{"eraseafter":"true"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]} pmem-csi-controller-0/[email protected]: I0118 11:38:34.704225 1 controllerserver-master.go:181] Controller CreateVolume: Name:pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0 required_bytes:115343360 limit_bytes:0 pmem-csi-controller-0/[email protected]: I0118 11:38:34.704251 1 controllerserver-master.go:86] Controller CreateVolume: Create VolumeID:pvc-26-328dad453a5df7f8348d57f9f7422e6b084a58303f5400af6f92906a based on name:pvc-269396e3-8ad6-4b4a-8296-9944cdd3acf0 pmem-csi-controller-0/[email protected]: W0118 11:38:34.704273 1 controllerserver-master.go:236] failed to connect to : No node registered with id: This attempt uses the node selector label for the PMEM-CSI node pod in the allowedTopologies field. Unfortunately, it also doesn't work: pmem-csi-intel-com-node-q94vv/[email protected]: I0118 14:46:54.540845 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"skew-4425", Name:"direct-testingv4nl9", UID:"fb94e3a9-62e5-44ec-b48b-33471452142c", APIVersion:"v1", ResourceVersion:"1204", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "pmem-pmem-csi-sc-ext4-skew-4425-unused": error generating accessibility requirements: selected node '"pmem-csi-pmem-govm-worker2"' topology 'map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker2]' is not in allowed topologies: [map[feature.node.kubernetes.io/memory-nv.dax:true]]
2dd4288
to
7c4ee86
Compare
4aee333
to
3a7d2cf
Compare
I intentionally left #857 out of the current implementation. It's not going to be a problem for a while and can be addressed later. |
verbs: | ||
- get | ||
- list | ||
- watch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With these changes, the ClusterRole
becomes a combination of permissions for the scheduler extensions and for the rescheduler.
I suspect putting "webhooks" into the name of the ClusterRole
was a poor choice from the beginning; "controller" would have been better. I didn't change it in this commit because I wanted to keep it small, but I can do it in a separate cleanup commit.
@@ -3,12 +3,3 @@ kind: StorageClass | |||
metadata: | |||
name: pmem-csi-sc | |||
provisioner: pmem-csi.intel.com | |||
# Setting allowedTopology ensures that pods using volumes | |||
# with late binding only get scheduled to pmem | |||
allowedTopologies: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pohly Why should we remove allowedTopologies
form the storageclass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it didn't work - see the commit message in 7c4ee86
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not wrong That attempt was for using {"pmem-csi.intel.com/driver":"started"}
as allowed topology. Isn't it?. But it should work for {storage: "pmem}
as before. And anyways your are passing the same label for controller's --nodeSelecter
here. Then why exactly this --nodeSelector
is needed as we get the same info from from the storage class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not wrong That attempt was for using
{"pmem-csi.intel.com/driver":"started"}
as allowed topology.
That's the first attempt (a03a700). The second attempt was {storage: pmem}
which also didn't work.
Then why exactly this --nodeSelector is needed as we get the same info from from the storage class?
We can't use AllowedTopology
at all at the moment, not with the currently released external-provisioner. Even if we can start using AllowedTopology
with some future external-provisioner, then the label check in kube-scheduler may still go wrong when labels get changed (less likely, but can happen). The rescheduler then handles that case and thus remains useful.
@@ -48,6 +48,7 @@ func init() { | |||
|
|||
/* Controller mode options */ | |||
flag.StringVar(&config.schedulerListen, "schedulerListen", "", "controller: listen address (like :8000) for scheduler extender and mutating webhook, disabled by default") | |||
flag.Var(&config.nodeSelector, "nodeSelector", "controller: reschedule PVCs with a selected node where PMEM-CSI is not meant to run because the node does not have these labels (represented as JSON map)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am bit confused, where exactly the provided nodeSelecter
is used in the code, I could only see that this is used to check if the rescheduler should be started or not by the controller. But I couldn't find anywhere the contents of the provided nodeSeleter is used in the code. In this case why can't it be a simple boolean value. @pohly Pardon and please correct me if I miss something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's passed to the rescheduler here:
pmem-csi/pkg/pmem-csi-driver/pmem-csi-driver.go
Lines 219 to 223 in 3a7d2cf
pcp = newRescheduler(ctx, | |
csid.cfg.DriverName, | |
client, pvcInformer, scInformer, pvInformer, csiNodeLister, | |
csid.cfg.nodeSelector, | |
serverVersion.GitVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it is not used used by rescheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it should be used here:
pmem-csi/pkg/pmem-csi-driver/rescheduler.go
Lines 196 to 201 in 3a7d2cf
driverMightRun := true | |
for key, value := range node.Labels { | |
if node.Labels[key] == value { | |
driverMightRun = false | |
} | |
} |
... but isn't! Duh. That code is of course bogus. Not sure what I was thinking. I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also need to add unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, see updated commit.
pkg/pmem-csi-driver/rescheduler.go
Outdated
|
||
driverMightRun := true | ||
for key, value := range node.Labels { | ||
if node.Labels[key] == value { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check is wrong, It is where pcp.nodeSelecter
supposed to be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed the commit.
} | ||
} | ||
|
||
reschedule := !driverMightRun && !driverIsRunning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it that we should make sure that the selected node has enough PMEM capacity instead of just checking if the driver is running on this node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the driver is running, then it will do that check itself.
Checking capacity here is redundant and racy (we check capacity, fail, removed selected node, while sometime later the driver processes the PVC and creates a volume although it is no longer assigned to the node).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, the rescheduler only may handle the case where the driver definitely isn't running.
When a PVC gets assigned to a node which has no PMEM-CSI running, whether it is because scheduler extensions are not enabled or there was a race while changing where to run the driver, then the new "descheduler" (= a stripped down external provisioner) unsets the "selected node" annotation. For the user this looks like this: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ExternalProvisioning 2s (x2 over 2s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "pmem-csi.intel.com" or manually created by system administrator Normal Provisioning 2s pmem-csi.intel.com_pmem-csi-intel-com-controller-0_9e16d74d-c645-4478-9f0d-50db58a962ce External provisioner is provisioning volume for claim "latebinding-7887/pvc-g7p97" Warning ProvisioningFailed 2s pmem-csi.intel.com_pmem-csi-intel-com-controller-0_9e16d74d-c645-4478-9f0d-50db58a962ce failed to provision volume with StorageClass "pmem-pmem-csi-sc-ext4-latebinding-7887": reschedule PVC latebinding-7887/pvc-g7p97 because it is assigned to node pmem-csi-pmem-govm-master which has no PMEM-CSI driver Normal WaitForPodScheduled 2s persistentvolume-controller waiting for pod pvc-volume-tester-writer-latebinding-4pkpv to be scheduled Normal Provisioning 2s pmem-csi.intel.com_pmem-csi-intel-com-node-4gblz_23b46c1d-b5c3-4cfe-a2f9-b22c77e666c6 External provisioner is provisioning volume for claim "latebinding-7887/pvc-g7p97" Normal ProvisioningSucceeded 2s pmem-csi.intel.com_pmem-csi-intel-com-node-4gblz_23b46c1d-b5c3-4cfe-a2f9-b22c77e666c6 Successfully provisioned volume pvc-66a4a486-5782-4cf2-8b13-f13f66e30e19 For the sake of simplicity, RBAC rules and resource allocation are based on what they would have to be when running both webhook and rescheduler.
3a7d2cf
to
2b24963
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
There was a test flake (?), filed as #858. I don't think that it was caused by this PR, so I am going to merge it despite that. |
This changes the architecture of PMEM-CSI such that the controller part only implements the webhooks and provisioning happens on the nodes.
This PR is based on some pending changes (PR #825, PR #836) and uses the external-provisioner canary image because v2.1.0 has been tagged, but the versioned image is not available yet. Therefore the PR is ready for review and further testing, but merging should wait.
Fixes: #823, #733, #729, #594, #526, #507