Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite detached actor test with go #2722

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

owenowenisme
Copy link
Contributor

@owenowenisme owenowenisme commented Jan 9, 2025

Why are these changes needed?

Gradually moving test from python to go for better k8s support.

Related issue number

Closes #2510

Checks

make test-e2e
test -s /home/owenowenisme/kuberay/ray-operator/bin/controller-gen || GOBIN=/home/owenowenisme/kuberay/ray-operator/bin go install sigs.k8s.io/controller-tools/cmd/[email protected]
/home/owenowenisme/kuberay/ray-operator/bin/controller-gen "crd:maxDescLen=0,generateEmbeddedObjectMeta=true,allowDangerousTypes=true" rbac:roleName=kuberay-operator webhook paths="./..." output:crd:artifacts:config=config/crd/bases
go fmt ./...
go vet ./...
go test -timeout 30m -v ./test/e2e
=== RUN   TestRayClusterGCSFaultTolerence
    raycluster_gcs_ft_test.go:22: Creating Cluster for GCSFaultTolerence testing.
    raycluster_gcs_ft_test.go:25: Successfully applied testdata/ray-cluster.ray-ft.yaml to namespace test-ns-cv79v
    raycluster_gcs_ft_test.go:31: Waiting for RayCluster test-ns-cv79v/raycluster-external-redis to become ready
=== RUN   TestRayClusterGCSFaultTolerence/Test_Detached_Actor
=== NAME  TestRayClusterGCSFaultTolerence
    raycluster_gcs_ft_test.go:39: HeadPod Name: raycluster-external-redis-head-p82r5
    raycluster_gcs_ft_test.go:42: Ray namespace: testing-ray-namespace
    core.go:88: Executing command: [python samples/test_detached_actor_1.py testing-ray-namespace]
    core.go:101: Command stdout: val1: 1, val2: 2
    core.go:102: Command stderr: 2025-01-08 20:39:43,145        INFO worker.py:1405 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
        2025-01-08 20:39:43,145 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.244.0.202:6379...
        2025-01-08 20:39:43,167 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at http://10.244.0.202:8265 
    core.go:88: Executing command: [pkill gcs_server]
    core.go:101: Command stdout: 
    core.go:102: Command stderr: 
    core.go:88: Executing command: [python samples/test_detached_actor_2.py testing-ray-namespace 3]
    core.go:101: Command stdout: Try to connect to Ray cluster.
        retry iter: 0
        Get TestCounter actor.
        retry iter: 0
        Try to call remote function 'increment'.
        retry iter: 0
        val: 3
    core.go:102: Command stderr: SIGTERM handler is not set because current thread is not the main thread.
    core.go:88: Executing command: [python samples/test_detached_actor_2.py testing-ray-namespace 4]
    core.go:101: Command stdout: Try to connect to Ray cluster.
        retry iter: 0
        Get TestCounter actor.
        retry iter: 0
        Try to call remote function 'increment'.
        retry iter: 0
        val: 4
    core.go:102: Command stderr: SIGTERM handler is not set because current thread is not the main thread.
    test.go:110: Retrieving Pod Container test-ns-cv79v/raycluster-external-redis-head-p82r5/ray-head logs
    test.go:98: Creating ephemeral output directory as KUBERAY_TEST_OUTPUT_DIR env variable is unset
    test.go:101: Output directory has been created at: /tmp/TestRayClusterGCSFaultTolerence3488354341/001
    test.go:110: Retrieving Pod Container test-ns-cv79v/raycluster-external-redis-small-group-worker-2md8t/ray-worker logs
    test.go:110: Retrieving Pod Container test-ns-cv79v/redis-c676b45dc-ccxcr/redis logs
--- PASS: TestRayClusterGCSFaultTolerence (97.16s)
    --- PASS: TestRayClusterGCSFaultTolerence/Test_Detached_Actor (56.39s)
PASS
ok      github.com/ray-project/kuberay/ray-operator/test/e2e    97.218s
  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421
Copy link
Member

@rueian would you mind taking the first pass as you are also working on GCS FT API?

@kevin85421 kevin85421 self-assigned this Jan 9, 2025
@rueian
Copy link
Contributor

rueian commented Jan 9, 2025

Hi @owenowenisme,

This generally looks good to me, but unfortunately, we use applyconfiguration in the test/e2e folder.

Would you mind migrating the current approach to the applyconfiguration approach? There is an example that just got merged:

func TestRayClusterGCSFT(t *testing.T) {
test := With(t)
g := NewWithT(t)
// Create a namespace
namespace := test.NewTestNamespace()
_, err := test.Client().Core().AppsV1().Deployments(namespace.Name).Apply(
test.Ctx(),
appsv1ac.Deployment("redis", namespace.Name).
WithSpec(appsv1ac.DeploymentSpec().
WithReplicas(1).
WithSelector(metav1ac.LabelSelector().WithMatchLabels(map[string]string{"app": "redis"})).
WithTemplate(corev1ac.PodTemplateSpec().
WithLabels(map[string]string{"app": "redis"}).
WithSpec(corev1ac.PodSpec().
WithContainers(corev1ac.Container().
WithName("redis").
WithImage("redis:7.4").
WithPorts(corev1ac.ContainerPort().WithContainerPort(6379)),
),
),
),
),
TestApplyOptions,
)

And the configmap for scripts can be created with the newConfigMap helper, for example: newConfigMap(namespace.Name, files(test, "counter.py", "fail.py")).

@owenowenisme owenowenisme force-pushed the rewrite-detached-actor-test-with-go branch from 5451939 to 4a952cb Compare January 9, 2025 12:44
@@ -0,0 +1,267 @@
apiVersion: ray.io/v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray-cluster.ray-ft.yaml can be deleted.

@owenowenisme
Copy link
Contributor Author

owenowenisme commented Jan 10, 2025

Hi @owenowenisme,

This generally looks good to me, but unfortunately, we use applyconfiguration in the test/e2e folder.

Would you mind migrating the current approach to the applyconfiguration approach? There is an example that just got merged:

func TestRayClusterGCSFT(t *testing.T) {
test := With(t)
g := NewWithT(t)
// Create a namespace
namespace := test.NewTestNamespace()
_, err := test.Client().Core().AppsV1().Deployments(namespace.Name).Apply(
test.Ctx(),
appsv1ac.Deployment("redis", namespace.Name).
WithSpec(appsv1ac.DeploymentSpec().
WithReplicas(1).
WithSelector(metav1ac.LabelSelector().WithMatchLabels(map[string]string{"app": "redis"})).
WithTemplate(corev1ac.PodTemplateSpec().
WithLabels(map[string]string{"app": "redis"}).
WithSpec(corev1ac.PodSpec().
WithContainers(corev1ac.Container().
WithName("redis").
WithImage("redis:7.4").
WithPorts(corev1ac.ContainerPort().WithContainerPort(6379)),
),
),
),
),
TestApplyOptions,
)

And the configmap for scripts can be created with the newConfigMap helper, for example: newConfigMap(namespace.Name, files(test, "counter.py", "fail.py")).

@rueian
May I ask why does newConfigmap have to use jobs as its name?

func newConfigMap(namespace string, options ...option[corev1ac.ConfigMapApplyConfiguration]) *corev1ac.ConfigMapApplyConfiguration {
cmAC := corev1ac.ConfigMap("jobs", namespace).
WithBinaryData(map[string][]byte{}).
WithImmutable(true)
return configMapWith(cmAC, options...)
}

@owenowenisme owenowenisme requested a review from rueian January 10, 2025 17:04
@rueian
Copy link
Contributor

rueian commented Jan 10, 2025

May I ask why does newConfigmap have to use jobs as its name?

I think there is no special reason for that. It just a name.

Comment on lines 37 to 46
redisCM := corev1ac.ConfigMap("redis-config", namespace.Name).WithLabels(map[string]string{"app": "redis"}).
WithData(map[string]string{
"redis.conf": `dir /data
port 6379
bind 0.0.0.0
appendonly yes
protected-mode no
requirepass 5241590000000000
pidfile /data/redis-6379.pid`,
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentions make the configuration hard to read, but I think we should have better started the Redis without this configuration map. It is quite irrelevant to the test, but it takes too many lines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing we need is requirepass 5241590000000000. You may take this as an example to start a redis without the configmap:
image

@owenowenisme owenowenisme requested a review from rueian January 11, 2025 01:43
).
WithWorkerGroupSpecs(rayv1ac.WorkerGroupSpec().
WithRayStartParams(map[string]string{
"redis-password": "5241590000000000",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worker group doesn't need to set redis-password.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but it seems that somewhere else doesn't accept nil RayStartParams?

 Message: "RayCluster.ray.io \"raycluster-gcsft\" is invalid: [spec.workerGroupSpecs[1].rayStartParams: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]",
                    Reason: "Invalid",
                    Details: {
                        Name: "raycluster-gcsft",
                        Group: "ray.io",
                        Kind: "RayCluster",
                        UID: "",
                        Causes: [
                            {
                                Type: "FieldValueRequired",
                                Message: "Required value",
                                Field: "spec.workerGroupSpecs[1].rayStartParams",
                            },

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use "num-cpus": "1" instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

WithRayVersion(rayVersion).
WithHeadGroupSpec(rayv1ac.HeadGroupSpec().
WithRayStartParams(map[string]string{
"dashboard-host": "0.0.0.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"dashboard-host": "0.0.0.0",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

testScriptCM, err := test.Client().Core().CoreV1().ConfigMaps(namespace.Name).Apply(test.Ctx(), testScriptAC, TestApplyOptions)
g.Expect(err).NotTo(HaveOccurred())

redisDM := appsv1ac.Deployment("redis", namespace.Name).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks similar to

appsv1ac.Deployment("redis", namespace.Name).

Can you add a util for it instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

_, err = test.Client().Core().AppsV1().Deployments(namespace.Name).Apply(test.Ctx(), redisDM, TestApplyOptions)
g.Expect(err).NotTo(HaveOccurred())

_, err = test.Client().Core().CoreV1().Services(namespace.Name).Apply(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add a util function

corev1ac.Service("redis", namespace.Name).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

WithContainers(corev1ac.Container().
WithName("ray-head").
WithImage(rayImage).
WithEnv(corev1ac.EnvVar().WithName("RAY_REDIS_ADDRESS").WithValue("redis:6379")).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already set GcsFaultToleranceOptions, so this environment variable is not required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

WithReplicas(1).
WithMinReplicas(1).
WithMaxReplicas(2).
WithTemplate(corev1ac.PodTemplateSpec().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use workerPodTemplateApplyConfiguration instead?

WithPorts(corev1ac.ContainerPort().WithContainerPort(6379).WithName("redis")).
WithPorts(corev1ac.ContainerPort().WithContainerPort(8265).WithName("dashboard")).
WithPorts(corev1ac.ContainerPort().WithContainerPort(10001).WithName("client")).
WithVolumeMounts(corev1ac.VolumeMount().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use mountConfigMap?

g.Expect(err).NotTo(HaveOccurred())
test.T().Logf("Created RayCluster %s/%s successfully", rayCluster.Namespace, rayCluster.Name)

// Make sure the RAY_REDIS_ADDRESS env is set on the Head Pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to check this. It is the responsibility of the tests for GcsFaultToleranceOptions to verify this behavior.

	g.Eventually(func(g Gomega) bool {
		rayCluster, err := test.Client().Ray().RayV1().RayClusters(namespace.Name).Apply(test.Ctx(), rayClusterAC, TestApplyOptions)
		g.Expect(err).NotTo(HaveOccurred())
		if rayCluster.Status.Head.PodName != "" {
			headPod, err := test.Client().Core().CoreV1().Pods(namespace.Name).Get(test.Ctx(), rayCluster.Status.Head.PodName, metav1.GetOptions{})
			g.Expect(err).NotTo(HaveOccurred())
			return utils.EnvVarExists(utils.RAY_REDIS_ADDRESS, headPod.Spec.Containers[utils.RayContainerIndex].Env)
		}
		return false
	}, TestTimeoutMedium).Should(BeTrue())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

),
),
)
rayCluster, err := test.Client().Ray().RayV1().RayClusters(namespace.Name).Apply(test.Ctx(), rayClusterAC, TestApplyOptions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create the RayCluster inside "Test Detached Actor".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, only the rayCluster needs to be create inside or all the resources?
Cause maybe there will be some resources reused by multiple subtest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only move RayCluster inside "Test Detached Actor".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@kevin85421
Copy link
Member

By the way, it would be better not to resolve comments if you haven’t pushed the related changes or are not entirely sure about the reviewers’ requests.

@owenowenisme owenowenisme force-pushed the rewrite-detached-actor-test-with-go branch 2 times, most recently from 925349d to e81f179 Compare January 12, 2025 10:26
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind rebasing with the master branch?

  1. Reading https://github.com/ray-project/kuberay/blob/master/ray-operator/test/e2e/raycluster_gcsft_test.go may be helpful.
  2. Keep the code clean and reuse functions if possible.
  3. Use deployRedis function to deploy Redis.

Then, ping me for another review.

Signed-off-by: owenowenisme <[email protected]>
@owenowenisme owenowenisme force-pushed the rewrite-detached-actor-test-with-go branch from e81f179 to 0708c68 Compare January 23, 2025 02:25
@owenowenisme owenowenisme force-pushed the rewrite-detached-actor-test-with-go branch from 572270a to d07b95a Compare January 23, 2025 07:46
@owenowenisme
Copy link
Contributor Author

@kevin85421
BTW Should the filename raycluster_gcs_ft_test.go be change to something else cause there is raycluster_gcsft_test.go.
It's quite confusing actually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rewrite compatibility-test.py with golang
3 participants