Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix context issue during cleanup of kind clusters #6771

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 16 additions & 7 deletions ci/kind/kind-setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -528,19 +528,28 @@ function destroy_external_servers {

function clean_kind {
echo "=== Cleaning up stale kind clusters ==="
read -a all_kind_clusters <<< $(kind get clusters)
for kind_cluster_name in "${all_kind_clusters[@]}"; do
creationTimestamp=$(kubectl get nodes --context kind-$kind_cluster_name -o json -l node-role.kubernetes.io/control-plane | \
jq -r '.items[0].metadata.creationTimestamp')
LOCK_FILE="$THIS_DIR/.kube/config.lock"
(
flock -x 200
Comment on lines +531 to +533
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jainpulkit22 I'm a bit confused. I first I thought this would work because .kube/config.lock was also used by kind / kubectl. See https://github.com/kubernetes-sigs/kind/blob/1c5a56b30145d98a61c0538975ca0894ad65e0d0/pkg/cluster/internal/kubeconfig/internal/kubeconfig/lock.go#L27

But then I ran some tests and it seems to me that flock and the Go code used to acquire the lock (os.OpenFile(lockName(filename), os.O_CREATE|os.O_EXCL, 0)) are not "compatible". For example, a bash script using flock to acquire an exclusive lock using ~/.kube/config.lock can execute concurrently which a kind command that acquires an excluisve lock on the same file using the code I linked to above.

This is the code I used for testing:

test.go
package main

import (
	"fmt"
	"os"
	"path/filepath"
	"time"
)

// these are from
// https://github.com/kubernetes/client-go/blob/611184f7c43ae2d520727f01d49620c7ed33412d/tools/clientcmd/loader.go#L439-L440

func lockFile(filename string) error {
	// Make sure the dir exists before we try to create a lock file.
	dir := filepath.Dir(filename)
	if _, err := os.Stat(dir); os.IsNotExist(err) {
		if err = os.MkdirAll(dir, 0755); err != nil {
			return err
		}
	}
	f, err := os.OpenFile(lockName(filename), os.O_CREATE|os.O_EXCL, 0)
	if err != nil {
		return err
	}
	f.Close()
	return nil
}

func unlockFile(filename string) error {
	return os.Remove(lockName(filename))
}

func lockName(filename string) string {
	return filename + ".lock"
}

func main() {
	if err := lockFile("/tmp/.clusters"); err != nil {
		panic("Failed to lock")
	}
	fmt.Println("Sleeping...")
	time.Sleep(10 * time.Second)
	fmt.Println("Done")
	if err := unlockFile("/tmp/.clusters"); err != nil {
		panic("Failed to unlock")
	}
}
test.sh
(
    flock -x 200
    echo "SLEEPING..."
    sleep 10
    echo "DONE"
) 200>>/tmp/.clusters.lock

Could you confirm that you tested this approach successfully? Maybe there is an issue in my test programs.

If the 2 locking schemes are not compatible, I don't think this patch will have the intended effect, as we can have a concurrent call to kind create cluster which will mutate the Kubeconfig file?

It also seems that because flock will create the file and not delete it, kind will fail to acquire the lock later (file already exists). I tried and got this error from kind:

ERROR: failed to create cluster: failed to lock config file: open /Users/abas/.kube/config.lock: file exists

Finally, even if the locking schemes were compatible, I am not sure that the current version would be sufficient, as kind create cluster will only acquire the lock while it's writing the config: therefore, maybe we could still have a context in the config for which the control plane is not ready yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have explicitly added the command to remove the lock file, other than that it seems compatible to me, because when I ried locally b oth the scripts together the create script waited for the lock to be released by the delete script.

Regarding your second question that we could still have a context in the config for which the control plaane is not ready yet, taht won't be a problem for use because the output of creation timeSTamp is piped to true, so the script won't panic and will run as usual.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have explicitly added the command to remove the lock file, other than that it seems compatible to me, because when I ried locally b oth the scripts together the create script waited for the lock to be released by the delete script.

Please share your steps, this is not what I observe in my case. test.sh and test.go can execute concurrently (test.sh is supposed to represent the clean up step in kind-setup.sh here, using the same flock command).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't use test.go for my local testing I did the following:
I created a test.sh script similar to the above one and executed kind create cluster command in one shell and test.sh script in another cell, and there I din't got any error.
I did this because in actual test scenario also we are just making calls to kind create cluster command.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a test.sh script similar to the above one and executed kind create cluster command in one shell and test.sh script in another cell, and there I din't got any error.

That doesn't show anything.
You would need the following test perhaps:

  1. test.sh with a long sleep while the exclusive lock is held (a few minutes maybe)
  2. try running kind create cluster in a second terminal

If kind create cluster can succeed before test.sh releases the lock, then you know it's not working.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the same with along sleep, kind create cluster command didn't succeed before test.sh but the problem was that it didn't even waited for the lock to be release and it just exited.
I agree with you that flock does not seem to be compatible with the GO locking mechanism.

One solution which i can see could work here is that before invoking any kind create command or delete command we can use flock so then flock would wait for the lock to be released and then we can run the kind commands successfully.

Whats your suggestion on this @antoninbas

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you would get the ERROR: failed to create cluster: failed to lock config file: open /Users/abas/.kube/config.lock: file exists error. It's a different mechanism from flock. In Kind the presence of the lock file means that someone else owns the config file, and it will trigger an immediate failure. Acquiring the lock simply means creating the file, and releasing the lock means deleting the file.

flock will also not remove the file when the lock is released, so Kind will never succeed once flock is used once.

My suggestion is still to have our own lock file as I initially proposed (I also still think it may be a good idea to have our own .clusters file), and wrap all Kind commands with flock. This seems to be what you are describing as well. Trying to use .kube/config.lock with flock is guaranteed to fail by preventing Kind from ever succeeding.


for context in $(kubectl config get-contexts -o name | grep 'kind-'); do
cluster_name=$(echo $context | sed 's/^kind-//')
creationTimestamp=$(kubectl get nodes --context $context -o json -l node-role.kubernetes.io/control-plane | \
jq -r '.items[0].metadata.creationTimestamp' || true)
if [[ -z $creationTimestamp ]]; then
continue
fi
creation=$(printUnixTimestamp "$creationTimestamp")
now=$(date -u '+%s')
diff=$((now-creation))
timeout=$(($UNTIL_TIME_IN_MINS*60))
if [[ $diff -gt $timeout ]]; then
echo "=== kind ${kind_cluster_name} present from more than $UNTIL_TIME_IN_MINS minutes ==="
kind delete cluster --name $kind_cluster_name
echo "=== Cluster ${cluster_name} is present for more than $UNTIL_TIME_IN_MINS minutes ==="
kind delete cluster --name $cluster_name
fi
done
done
)200>>"$LOCK_FILE"
rm -rf $LOCK_FILE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a good idea IMO. It feels like there can be a race condition where we delete the file while another job is holding the lock?

Copy link
Contributor Author

@jainpulkit22 jainpulkit22 Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But untill we delete config.lock the other process will not be able to acquire the lock and it may panic.
Also since flock is not compatible with the Go based locking mechanism so what's you opinion on that should we use the alternative approach of writing the cluster names to a file and acquiring lock over that, or. should we move ahead with acquiring lock over the kubeconfig and use the below approach:

before invoking kind create cluster command we can use flock to wait for the other process to release the lock and then we should trigger this command so in that case flock will not interfere with the Go based locking mechanism its just that we will be introducing couple of unnecessary flocks in the code.

@antoninbas what's your opinion on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use ~/.kube/config.lock, I think that's a given. We can pretend it doesn't exist.
So we need our own lock file. After that, we have 2 options, and you can choose which one you want to use:

  1. have our own state file to keep track of cluster names and creation timestamps (which I was originally proposing)
  2. only rely on kubectl / kind, and do not introduce our own state file. With this option we use flock to acquire a lock before calling kubectl / kind, as appropriate

I think both approaches will work

}

if ! command -v kind &> /dev/null
Expand Down
Loading