Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-detect the default BYOH interface for hosts #477

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

VibhorChinda
Copy link
Contributor

What this PR does / why we need it:
Makes changes in the codebase to implement the auto-detect the default BYOH interface for hosts.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes # #466

Additional information
Changed the template files [deleted the vim_interface env variable]

Special notes for your reviewer

@vmwclabot
Copy link

@VibhorChinda, you must sign our contributor license agreement before your changes are merged. Click here to sign the agreement. If you are a VMware employee, read this for further instruction.

@VibhorChinda
Copy link
Contributor Author

@anusha94 @sachinkumarsingh092 have a look :)

@dharmjit
Copy link
Contributor

dharmjit commented Apr 6, 2022

Thanks for working on this issue @VibhorChinda, I guess there is some code as well to handle the substitution of the default network interface name which could be removed.

@anusha94 We could also look at whether to keep the TemplateParser in cloudinit package

@sachinkumarsingh092
Copy link
Contributor

We also need to handle the logic for manual detection in agent/registration/host_registrar.go, as discussed in the issue.

@VibhorChinda
Copy link
Contributor Author

VibhorChinda commented Apr 6, 2022

We also need to handle the logic for manual detection in agent/registration/host_registrar.go, as discussed in the issue.

by handling logic @sachinkumarsingh092 do you mean I should delete this function as now it will be automatic and no need for manual detection ??

@VibhorChinda
Copy link
Contributor Author

Thanks for working on this issue @VibhorChinda, I guess there is some code as well to handle the substitution of the default network interface name which could be removed.

* [TemplateParser Intialization](https://github.com/vmware-tanzu/cluster-api-provider-bringyourownhost/blob/main/agent/main.go#L109-L122)

@anusha94 We could also look at whether to keep the TemplateParser in cloudinit package

ok @dharmjit will remove this function

@anusha94
Copy link
Contributor

anusha94 commented Apr 6, 2022

We could also look at whether to keep the TemplateParser in cloudinit package

Yup, agreed. We should remove all code that is unused. Bonus - these didn't have unit tests either :)

@anusha94
Copy link
Contributor

anusha94 commented Apr 6, 2022

@VibhorChinda Please rebase from main - we just fixed the golangci-lint check that is failing

@sachinkumarsingh092
Copy link
Contributor

I don't see any other utility of GetNetworkStatus() and UpdateNetwork() apart from updating the network status, which will be done by kube-vip anyway. So I think it should be safe to remove these. @anusha94 @dharmjit wdyt?

@anusha94
Copy link
Contributor

anusha94 commented Apr 6, 2022

@sachinkumarsingh092 I'm all for removing code as long as the pipeline remains green!

@VibhorChinda
Copy link
Contributor Author

I read all the comments and tried to carry out the changes required but i guess it would be not simple process because functions which have to be removed are being called some where in the codebase :|| for eg :-

  1. @sachinkumarsingh092 Register() -> UpdateNetwork() -> GetNetworkStatus()
    Above is the call hierarchy thus removing UpdateNetwork() -> GetNetworkStatus() would give errors in the Register() function.

  2. @dharmjit setupTemplateParser() is being called here to set up value for TemplateParser. Should I pass nil here if I remove the setupTemplateParser function.

  3. @dharmjit ParseTemplate() function in the template_parser.go is being called here
    And it has more call references too :((

@anusha94
Copy link
Contributor

anusha94 commented Apr 6, 2022

@VibhorChinda if / when you remove a function, it is better to remove the references too instead of passing nil.

@VibhorChinda VibhorChinda force-pushed the autoDetect_BYOH_Host branch 2 times, most recently from ad59020 to 927235d Compare April 7, 2022 08:23
@VibhorChinda
Copy link
Contributor Author

@anusha94 @sachinkumarsingh092 @dharmjit is there any way to duplicate the tests locally ??

Like for CAPI stuff, I used to run commands "make verify-modules" & "make test" that used to duplicate the tests locally and I could make changes as required for making them pass :))

@sachinkumarsingh092
Copy link
Contributor

We also follow CAPI style make targets, albeit not that extensively yet. So if you see it in the Makefile, we have a target for test, lint, test-e2e etc. which are used in the GitHub workflow and you can use them to test locally.
You can find all the make targets used during the workflow in .github/workflows/*.

@VibhorChinda
Copy link
Contributor Author

We also follow CAPI style make targets, albeit not that extensively yet. So if you see it in the Makefile, we have a target for test

Thanks will do from now on :))
Btw can you trigger tests here for the changes I made now

@anusha94
Copy link
Contributor

anusha94 commented Apr 7, 2022

@VibhorChinda

Word of caution. At the moment, our tests (unit / integration / e2e) run only on Linux - more specifically Ubuntu 20.04

@dharmjit dharmjit marked this pull request as draft April 8, 2022 03:10
@dharmjit dharmjit marked this pull request as draft April 8, 2022 03:10
@VibhorChinda
Copy link
Contributor Author

Hey everyone was working on this thing yesterday.

But pretty much can't figure out why my tests are failing. Some help will be good :))
@dharmjit @anusha94 @sachinkumarsingh092

@sachinkumarsingh092
Copy link
Contributor

@VibhorChinda as you change the functionality, you should also change the corresponding tests.

Yor CI fails as you haven't removed the test spec for the network status here after you removed the functionality, though I'm a bit skeptic about removing such test. But for now, just try this.

@VibhorChinda
Copy link
Contributor Author

@VibhorChinda as you change the functionality, you should also change the corresponding tests.

Yor CI fails as you haven't removed the test spec for the network status here after you removed the functionality, though I'm a bit skeptic about removing such test. But for now, just try this.

@sachinkumarsingh092 what you said totally make sense. My CI failed because of this.

But ig the e2e should have passed (It should create a workload cluster with single BYOH host what ever may be the change right ??)

@codecov-commenter
Copy link

codecov-commenter commented Apr 8, 2022

Codecov Report

Merging #477 (3092621) into main (783badf) will increase coverage by 0.07%.
The diff coverage is n/a.

❗ Current head 3092621 differs from pull request most recent head ef48d7f. Consider uploading reports for the commit ef48d7f to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #477      +/-   ##
==========================================
+ Coverage   67.11%   67.19%   +0.07%     
==========================================
  Files          22       22              
  Lines        1721     1719       -2     
==========================================
  Hits         1155     1155              
+ Misses        494      492       -2     
  Partials       72       72              
Impacted Files Coverage Δ
agent/main.go 16.10% <0.00%> (+0.21%) ⬆️

@sachinkumarsingh092
Copy link
Contributor

All right! We're getting close. I can narrow down the e2e failures to something related to the control plane. Can you please go through the particular tests that are failing in /test/e2e/e2e_test.go and try to find the exact cause of this? We can then try for a solution.

@VibhorChinda
Copy link
Contributor Author

Hey @sachinkumarsingh092 I was trying to debug this, but the problem is I am not able to replicate the tests locally in my environment :||

And here the I am not getting anything substantial with which we can move ahead, just we know the test which is failing but why is still a question :((

@VibhorChinda
Copy link
Contributor Author

At last was successful to run tests locally :))))

Found the following error message :

Failed to generate the manifest for "infrastructure-byoh" / "v0.1.0"
Unexpected error:
<*errors.withStack | 0xc00058c330>: {
error: <*errors.withMessage | 0xc000b76160>{
cause: <*errors.withStack | 0xc00058c2b8>{
error: <*exec.Error | 0xc000b76100>{
Name: "kustomize",
Err: <*errors.errorString | 0xc00007c3d0>{
s: "executable file not found in $PATH",
},
},
stack: [0x15da487, 0x1af30fa, 0x1af22c8, 0x1b31eab, 0x1b35ed0, 0x4d1e25, 0x4d13a5, 0x9b0cfa, 0x9aec3a, 0x9ae605, 0x9afefe, 0x9afb32, 0x9bf1e7, 0x9bed93, 0x9c1052, 0x9cc645, 0x9cc46a, 0x1b31a65, 0x515602, 0x46b181],
},
msg: "failed to execute kustomize: ",
},
stack: [0x1af3191, 0x1af22c8, 0x1b31eab, 0x1b35ed0, 0x4d1e25, 0x4d13a5, 0x9b0cfa, 0x9aec3a, 0x9ae605, 0x9afefe, 0x9afb32, 0x9bf1e7, 0x9bed93, 0x9c1052, 0x9cc645, 0x9cc46a, 0x1b31a65, 0x515602, 0x46b181],
}
failed to execute kustomize: : exec: "kustomize": executable file not found in $PATH
occurred

Could we get anything from this ?? @sachinkumarsingh092

@sachinkumarsingh092
Copy link
Contributor

sachinkumarsingh092 commented Apr 14, 2022

It seems that either you don't have kustomize installed or haven't included its path in the PATH variable. Check this for installation.

@VibhorChinda
Copy link
Contributor Author

Yes @sachinkumarsingh092 u were right kustomize was now installed on my setup.

e2e-tests working fine now.

All three tests are failing due to timeout error

After running the "journalctl -xeu kubelet " in the "test-kpd8d6-control-plane" I found this in the logs :

Apr 18 13:13:44 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:44.941812 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:45 test-kpd8d6-control-plane kubelet[252]: W0418 13:13:45.030312 252 reflector.go:324] k8s.io/client-go/informers/factory.go:134: failed to list *v1.CSIDriver: Get "https://test-kpd8d6-control-plane:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp [fc00:f853:ccd:e793::2]:6443: connect: connection refused
Apr 18 13:13:45 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:45.030349 252 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.CSIDriver: failed to list *v1.CSIDriver: Get "https://test-kpd8d6-control-plane:6443/apis/storage.k8s.io/v1/csidrivers?limit=500&resourceVersion=0": dial tcp [fc00:f853:ccd:e793::2]:6443: connect: connection refused
Apr 18 13:13:45 test-kpd8d6-control-plane kubelet[252]: I0418 13:13:45.037951 252 kubelet_node_status.go:70] "Attempting to register node" node="test-kpd8d6-control-plane"
Apr 18 13:13:45 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:45.042033 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:45 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:45.113234 252 kubelet_node_status.go:92] "Unable to register node with API server" err="Post "https://test-kpd8d6-control-plane:6443/api/v1/nodes\": dial tcp [fc00:f853:ccd:e793::2]:6443: connect: connection refused" node="test-kpd8d6-control-plane"
Apr 18 13:13:45 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:45.143043 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:45 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:45.243539 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:45 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:45.344310 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:45 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:45.444753 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:45 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:45.545208 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:45 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:45.645976 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:45 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:45.747025 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:45 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:45.847683 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:45 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:45.948505 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:46 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:46.049135 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:46 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:46.149754 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:46 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:46.250126 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:46 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:46.350683 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:46 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:46.451079 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:46 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:46.551981 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:46 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:46.652801 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:46 test-kpd8d6-control-plane kubelet[252]: I0418 13:13:46.714089 252 kubelet_node_status.go:70] "Attempting to register node" node="test-kpd8d6-control-plane"
Apr 18 13:13:46 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:46.753850 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:46 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:46.854369 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:46 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:46.954845 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:47 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:47.054910 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:47 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:47.155551 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:47 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:47.256032 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:47 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:47.323903 252 nodelease.go:49] "Failed to get node when trying to set owner ref to the node lease" err="nodes "test-kpd8d6-control-plane" not found" node="test-kpd8d6-control-plane"
Apr 18 13:13:47 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:47.356466 252 kubelet.go:2422] "Error getting node" err="node "test-kpd8d6-control-plane" not found"
Apr 18 13:13:47 test-kpd8d6-control-plane kubelet[252]: I0418 13:13:47.421280 252 kubelet_node_status.go:73] "Successfully registered node" node="test-kpd8d6-control-plane"
Apr 18 13:13:47 test-kpd8d6-control-plane kubelet[252]: I0418 13:13:47.513246 252 apiserver.go:52] "Watching apiserver"
Apr 18 13:13:47 test-kpd8d6-control-plane kubelet[252]: I0418 13:13:47.760265 252 reconciler.go:157] "Reconciler: start to sync state"
Apr 18 13:13:48 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:48.608560 252 kubelet.go:2347] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

I can see that network plugin is not being installed on the control plane node. Which makes it unreachable and all tests goes in timeout error.

What do you think ??

@VibhorChinda
Copy link
Contributor Author

PS : Sorry for delay replies, it takes me sometime to get some headway. Still a beginner :))

@sachinkumarsingh092
Copy link
Contributor

sachinkumarsingh092 commented Apr 18, 2022

@VibhorChinda a small tip: whenever giving a code block or some logs, put them inside a code block like below. It makes it easier to read and understand the logs :)

.....
Apr 18 13:13:47 test-kpd8d6-control-plane kubelet[252]: I0418 13:13:47.513246 252 apiserver.go:52] "Watching apiserver"
Apr 18 13:13:47 test-kpd8d6-control-plane kubelet[252]: I0418 13:13:47.760265 252 reconciler.go:157] "Reconciler: start to sync state"
Apr 18 13:13:48 test-kpd8d6-control-plane kubelet[252]: E0418 13:13:48.608560 252 kubelet.go:2347] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

The last line seems to suggest that you don't have any CNI plugins installed. The kindnet CNI plugins are added during the e2e test in the Makefile. So this should've worked. Can you confirm that the directory /etc/cni/net.d exists in test-kpd8d6-control-plane and has the config files for the CNI?

Alternatively, can you try installing a CNI and then confirm that it's working. That'd mean it's a problem while installing the CNI. You can use the following to install it:

kubectl apply -f test/e2e/data/cni/kindnet/kindnet.yaml

Ref:


PS : Sorry for delay replies, it takes me sometime to get some headway. Still a beginner :))

Don't worry about it. Take your time. We're all learning here :)

@VibhorChinda
Copy link
Contributor Author

@VibhorChinda a small tip: whenever giving a code block or some logs, put them inside a code block like below. It makes it easier to read and understand the logs :)

Makes sense, will take care of this from now on ✌️

@VibhorChinda
Copy link
Contributor Author

The last line seems to suggest that you don't have any CNI plugins installed. The kindnet CNI plugins are added during the e2e test in the Makefile. So this should've worked. Can you confirm that the directory /etc/cni/net.d exists in test-kpd8d6-control-plane and has the config files for the CNI?

yep @sachinkumarsingh092 this thing exists

@VibhorChinda
Copy link
Contributor Author

Tried installing the CNI using this command "kubectl apply -f test/e2e/data/cni/kindnet/kindnet.yaml" on the control plane node.
But It didn't work, as there was no such path on the control plane node.

On the other hand, tried using the command on the whole project not specifically inside the control plane node. As I can see the same path in the code base.
But that also failed due to some port error :))

@vmwclabot
Copy link

@VibhorChinda, VMware has approved your signed contributor license agreement.

@sachinkumarsingh092
Copy link
Contributor

Sorry for the late reply @VibhorChinda. Got caught up in a few things.

But It didn't work, as there was no such path on the control plane node.

Yes, that's because the control plane doesn't import the repository.

But that also failed due to some port error

Can you share the logs?

@VibhorChinda
Copy link
Contributor Author

Hey @sachinkumarsingh092 sharing the logs, can take some time. My ubuntu workstation seems to have some issues.
But will soon do that.

@anusha94
Copy link
Contributor

anusha94 commented Jun 9, 2022

@VibhorChinda Any more update to this PR?

@VibhorChinda
Copy link
Contributor Author

Hey @anusha94 no updates as such.

PS : Sorry for blocking the issue/PR for so long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants