rbd: added rbd info to validateRBDImageCount func #4938

OdedViner · 2024-10-31T14:15:04Z

Describe what this PR does

Added rbd info and rbd status on Failure message

bash-5.1$ rbd info --pool=replicapool csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7
rbd image 'csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7':
	size 1 GiB in 256 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 1102e70da18f
	block_name_prefix: rbd_data.1102e70da18f
	format: 2
	features: layering
	op_features: 
	flags: 
	create_timestamp: Thu Oct 31 13:26:32 2024
	access_timestamp: Thu Oct 31 13:26:32 2024
	modify_timestamp: Thu Oct 31 13:26:32 2024

bash-5.1$ rbd status --pool=replicapool csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7
Watchers: none

Is there anything that requires special attention

Do you have any questions?

Is the change backward compatible?

Are there concerns around backward compatibility?

Provide any external context for the change, if any.

For example:

Kubernetes links that explain why the change is required
CSI spec related changes/catch-up that necessitates this patch
golang related practices that necessitates this change

Related issues

Mention any github issues relevant to this PR. Adding below line
will help to auto close the issue once the PR is merged.

Fixes: #issue_number
#4547

Future concerns

List items that are not part of the PR and do not impact it's
functionality, but are work items that can be taken up subsequently.

Checklist:

Commit Message Formatting: Commit titles and messages follow
guidelines in the developer
guide.
Reviewed the developer guide on Submitting a Pull
Request
Pending release
notes
updated with breaking and/or notable changes for the next major release.
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.

Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

/retest ci/centos/<job-name>: retest the <job-name> after unrelated
failure (please report the failure too!)

nixpanic · 2024-11-01T09:09:00Z

e2e/rbd.go

 		framework.Failf(
 			"backend images not matching kubernetes resource count,image count %d kubernetes resource count %d"+
-				"\nbackend image Info:\n %v",
+				"\nbackend image Info:\n %v\n images information and status%v",


status%v could use a space befor the %v

nixpanic · 2024-11-01T09:12:51Z

e2e/rbd_helper.go

+	}
+	err = json.Unmarshal([]byte(stdOut), &imgStatus)
+	if err != nil {
+		return imgStatus, fmt.Errorf("unmarshal failed: %w. raw buffer response: %s",


don't return the buffer in the error message, as that may get printed many more times when callers log the error. Instead, use framework.Logf() to report details like this.

e2e/rbd_helper.go:1125:3: k8s.io/kubernetes/test/e2e/framework.Logf does not support error-wrapping directive %w

Failed to run container: https://url.corp.redhat.com/cc13cdb

framework.Logf() does not support %w for logging errors. Use %v while logging, but when %w with fmt.Errorf() or errors.New().

nixpanic · 2024-11-01T09:14:51Z

e2e/rbd_helper.go

+		fmt.Sprintf("rbd status %s %s --format json", rbdOptions(poolName), imageName),
+		rookNamespace)
+	if err != nil {
+		return imgStatus, fmt.Errorf("failed to get rbd status: %w", err)


It helps to give this err != nil and the below stdErr != "" errors a slightly different message. When something goes wrong, it can be identified more precisely if the error message is unique.

nixpanic · 2024-11-01T09:15:50Z

/test ci/centos/mini-e2e-helm/k8s-1.30

nixpanic · 2024-11-01T10:03:51Z

e2e/rbd.go

+				return
+			}
+			// Collecting image details for printing
+			imageDetails = append(imageDetails, fmt.Sprintf("Pool Name: %s, Image Name: %s, Img Info: %v, Img Status: %v", pool, image, imgInfo, imgStatus))


golangci-lint complains that this line is too log: https://github.com/ceph/ceph-csi/actions/runs/11613610012/job/32379913964?pr=4938#step:3:820

Madhu-1 · 2024-11-04T07:34:24Z

e2e/rbd.go

+		for _, image := range imageList {
+			imgInfo, err := getImageInfo(f, image, pool)
+			if err != nil {
+				return


we need to log error and continue as this is for debuggin

Madhu-1 · 2024-11-04T07:34:30Z

e2e/rbd.go

+			}
+			imgStatus, err := getImageStatus(f, image, pool)
+			if err != nil {
+				return


we need to log error and continue as this is for debugging

e2e/rbd.go

e2e/rbd_helper.go

Madhu-1 · 2024-11-04T10:31:08Z

/test ci/centos/mini-e2e-helm/k8s-1.30

Madhu-1 · 2024-11-04T13:11:52Z

/test ci/centos/mini-e2e-helm/k8s-1.30

OdedViner · 2024-11-04T14:04:50Z

Output from failure job: https://url.corp.redhat.com/e9aa53d

[38;5;9m[FAILED] backend images not matching kubernetes resource count,image count 1 kubernetes resource count 11
  backend image Info:
   [csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b]
   images information and status Pool: replicapool, Image: csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b, Info: {csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b 0 0 4194304}, Status: {  0 false 0 }�[0m
  �[38;5;9mIn �[1m[It]�[0m�[38;5;9m at: �[1m/go/src/github.com/ceph/ceph-csi/e2e/rbd.go:206�[0m �[38;5;243m@ 11/04/24 13:29:22.803�[0m

@Madhu-1 @nixpanic Is this the desired output? When I ran it manually I got the following results:

bash-5.1$ rbd info --pool=replicapool csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7
rbd image 'csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7':
	size 1 GiB in 256 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 1102e70da18f
	block_name_prefix: rbd_data.1102e70da18f
	format: 2
	features: layering
	op_features: 
	flags: 
	create_timestamp: Thu Oct 31 13:26:32 2024
	access_timestamp: Thu Oct 31 13:26:32 2024
	modify_timestamp: Thu Oct 31 13:26:32 2024
bash-5.1$ rbd status --pool=replicapool csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7
Watchers: none

Madhu-1 · 2024-11-04T14:09:29Z

Output from failure job: https://url.corp.redhat.com/e9aa53d

[38;5;9m[FAILED] backend images not matching kubernetes resource count,image count 1 kubernetes resource count 11
  backend image Info:
   [csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b]
   images information and status Pool: replicapool, Image: csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b, Info: {csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b 0 0 4194304}, Status: {  0 false 0 }�[0m
  �[38;5;9mIn �[1m[It]�[0m�[38;5;9m at: �[1m/go/src/github.com/ceph/ceph-csi/e2e/rbd.go:206�[0m �[38;5;243m@ 11/04/24 13:29:22.803�[0m

@Madhu-1 @nixpanic Is this the desired output? When I ran it manually I got the following results:

bash-5.1$ rbd info --pool=replicapool csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7
rbd image 'csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7':
	size 1 GiB in 256 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 1102e70da18f
	block_name_prefix: rbd_data.1102e70da18f
	format: 2
	features: layering
	op_features: 
	flags: 
	create_timestamp: Thu Oct 31 13:26:32 2024
	access_timestamp: Thu Oct 31 13:26:32 2024
	modify_timestamp: Thu Oct 31 13:26:32 2024
bash-5.1$ rbd status --pool=replicapool csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7
Watchers: none

How about if we just return the stdout and log it, i think it will be useful for debugging instead of unmarshal output. i tlooks like the above logs is not much helpful, @nixpanic WDYT?

nixpanic · 2024-11-06T09:42:32Z

How about if we just return the stdout and log it, i think it will be useful for debugging instead of unmarshal output. i tlooks like the above logs is not much helpful, @nixpanic WDYT?

Logging stdout is sufficient for me too. It may also be easier to find the right timestamp/location of errors when the logs contain pretty unique words like block_name_prefix.

OdedViner · 2024-11-06T18:40:20Z

Output from failure job: https://url.corp.redhat.com/e9aa53d

[38;5;9m[FAILED] backend images not matching kubernetes resource count,image count 1 kubernetes resource count 11
  backend image Info:
   [csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b]
   images information and status Pool: replicapool, Image: csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b, Info: {csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b 0 0 4194304}, Status: {  0 false 0 }�[0m
  �[38;5;9mIn �[1m[It]�[0m�[38;5;9m at: �[1m/go/src/github.com/ceph/ceph-csi/e2e/rbd.go:206�[0m �[38;5;243m@ 11/04/24 13:29:22.803�[0m

@Madhu-1 @nixpanic Is this the desired output? When I ran it manually I got the following results:

bash-5.1$ rbd info --pool=replicapool csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7
rbd image 'csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7':
	size 1 GiB in 256 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 1102e70da18f
	block_name_prefix: rbd_data.1102e70da18f
	format: 2
	features: layering
	op_features: 
	flags: 
	create_timestamp: Thu Oct 31 13:26:32 2024
	access_timestamp: Thu Oct 31 13:26:32 2024
	modify_timestamp: Thu Oct 31 13:26:32 2024
bash-5.1$ rbd status --pool=replicapool csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7
Watchers: none

How about if we just return the stdout and log it, i think it will be useful for debugging instead of unmarshal output. i tlooks like the above logs is not much helpful, @nixpanic WDYT?

I return imageInfo, string, error because validateStripe function

Madhu-1 · 2024-11-06T18:49:56Z

Info: {csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b 0 0 4194304}, Status: { 0 false 0 }

Info: {csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b 0 0 4194304}, Status: { 0 false 0 } This should log both key and value from the struct without it its difficult to understand and debug it.

It would be good if we just return the stdout and log the stdout in the logs that helps for debugging, actual struct is missing few details.

OdedViner · 2024-11-07T06:59:59Z

Info: {csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b 0 0 4194304}, Status: { 0 false 0 } This should log both key and value from the struct without it its difficult to understand and debug it.

It would be good if we just return the stdout and log the stdout in the logs that helps for debugging, actual struct is missing few details.

images not matching kubernete

Info: {csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b 0 0 4194304}, Status: { 0 false 0 }

Info: {csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b 0 0 4194304}, Status: { 0 false 0 } This should log both key and value from the struct without it its difficult to understand and debug it.

It would be good if we just return the stdout and log the stdout in the logs that helps for debugging, actual struct is missing few details.

And what should be done with this function? It relies on the imgInfo struct.

ceph-csi/e2e/rbd_helper.go

Lines 1132 to 1133 in cea8bf8

    
           imgInfo, err := getImageInfo(f, imageData.imageName, defaultRBDPool) 
        
           if err != nil {

OdedViner · 2024-11-11T08:47:22Z

@Madhu-1 @nixpanic Which method would be better: getImageStatus, which returns a string output, or getImageInfo, which returns a struct with all parameters?

Madhu-1 · 2024-11-11T12:26:01Z

@Madhu-1 @nixpanic Which method would be better: getImageStatus, which returns a string output, or getImageInfo, which returns a struct with all parameters?

we could have a base function that returns the string and getImageStatus can use the base function and later do unmarshal and from the validation we will call the base function that returns the string data and print it.

e2e/rbd_helper.go

e2e/rbd.go

Madhu-1

@OdedViner changes LGTM , please squash the commits into 1

ceph-csi-bot · 2024-11-22T08:02:29Z

/test ci/centos/mini-e2e/k8s-1.31

black-dragon74

LGTM, thank you!

nixpanic · 2024-11-22T12:42:51Z

@Mergifyio rebase

Signed-off-by: Oded Viner <[email protected]>

mergify · 2024-11-22T12:42:58Z

rebase

✅ Branch has been successfully rebased

nixpanic · 2024-11-22T12:43:31Z

@Mergifyio queue

mergify · 2024-11-22T12:43:39Z

queue

🛑 The pull request has been removed from the queue `default`

The merge conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

ceph-csi-bot · 2024-11-22T12:43:52Z

/test ci/centos/k8s-e2e-external-storage/1.30

ceph-csi-bot · 2024-11-22T12:43:53Z

/test ci/centos/mini-e2e-helm/k8s-1.30

ceph-csi-bot · 2024-11-22T12:43:53Z

/test ci/centos/k8s-e2e-external-storage/1.29

ceph-csi-bot · 2024-11-22T12:43:53Z

/test ci/centos/mini-e2e/k8s-1.30

ceph-csi-bot · 2024-11-22T12:43:53Z

/test ci/centos/mini-e2e-helm/k8s-1.29

ceph-csi-bot · 2024-11-22T12:43:54Z

/test ci/centos/k8s-e2e-external-storage/1.31

ceph-csi-bot · 2024-11-22T12:43:54Z

/test ci/centos/mini-e2e/k8s-1.29

ceph-csi-bot · 2024-11-22T12:43:55Z

/test ci/centos/mini-e2e-helm/k8s-1.31

ceph-csi-bot · 2024-11-22T12:43:55Z

/test ci/centos/mini-e2e/k8s-1.31

ceph-csi-bot · 2024-11-22T12:44:41Z

/test ci/centos/upgrade-tests-cephfs

ceph-csi-bot · 2024-11-22T12:44:42Z

/test ci/centos/upgrade-tests-rbd

nixpanic · 2024-11-22T13:48:39Z

/retest ci/centos/mini-e2e/k8s-1.29

nixpanic · 2024-11-22T13:53:36Z

@OdedViner, a CI job failed and resulted in the following output:

[FAILED] backend images not matching kubernetes resource count,image count 1 kubernetes resource count 0
  backend image Info:
   [csi-vol-76e28f0b-4ae2-43f7-b9e6-c8e28d891c25]
   images information and status Pool: replicapool, Image: csi-vol-76e28f0b-4ae2-43f7-b9e6-c8e28d891c25, Info: , Status:

It looks like Info and Status are both empty for some reason, maybe you want to double check what it is doing?

OdedViner · 2024-12-01T11:48:54Z

@OdedViner, a CI job failed and resulted in the following output:
[FAILED] backend images not matching kubernetes resource count,image count 1 kubernetes resource count 0
  backend image Info:
   [csi-vol-76e28f0b-4ae2-43f7-b9e6-c8e28d891c25]
   images information and status Pool: replicapool, Image: csi-vol-76e28f0b-4ae2-43f7-b9e6-c8e28d891c25, Info: , Status: 
It looks like Info and Status are both empty for some reason, maybe you want to double check what it is doing?
@nixpanic
@Madhu-1 @Madhu-1Can we re-run the job? Alternatively, I can create a PR with validateRBDImageCount(f, 111, defaultRBDPool). The test is expected to fail so we can review the failure output.

Madhu-1 · 2024-12-02T08:45:21Z

1Can we re-run the job? Alternatively, I can create a PR with validateRBDImageCount(f, 111, defaultRBDPool). The test is expected to fail so we can review the failure output.

I see this happening as well, This happens when there is a delay in volume deletion where the volume is present in first get and gets deleted when we try to query in details (due to async deletion)

OdedViner · 2024-12-02T08:56:21Z

1Can we re-run the job? Alternatively, I can create a PR with validateRBDImageCount(f, 111, defaultRBDPool). The test is expected to fail so we can review the failure output.

I see this happening as well, This happens when there is a delay in volume deletion where the volume is present in first get and gets deleted when we try to query in details (due to async deletion)

@Madhu-1 is there a method to fix it? or only add sleep before first get?

Madhu-1 · 2024-12-02T08:58:42Z

1Can we re-run the job? Alternatively, I can create a PR with validateRBDImageCount(f, 111, defaultRBDPool). The test is expected to fail so we can review the failure output.

I see this happening as well, This happens when there is a delay in volume deletion where the volume is present in first get and gets deleted when we try to query in details (due to async deletion)

@Madhu-1 is there a method to fix it? or only add sleep before first get?

This is very corner case there i no 100 % solution to fix it but we need more than sleep. we need to take tasks in to consideration when we do get call for the count.

mergify bot added the component/rbd Issues related to RBD label Oct 31, 2024

nixpanic requested changes Nov 1, 2024

View reviewed changes

nixpanic reviewed Nov 1, 2024

View reviewed changes

OdedViner force-pushed the add_rbd_info branch from 71d09c0 to 00ee8b1 Compare November 3, 2024 10:04

OdedViner requested a review from nixpanic November 3, 2024 10:05

Madhu-1 reviewed Nov 4, 2024

View reviewed changes

OdedViner force-pushed the add_rbd_info branch 2 times, most recently from 90b18cb to ffae514 Compare November 4, 2024 09:27

iPraveenParihar reviewed Nov 4, 2024

View reviewed changes

e2e/rbd.go Outdated Show resolved Hide resolved

iPraveenParihar reviewed Nov 4, 2024

View reviewed changes

e2e/rbd_helper.go Outdated Show resolved Hide resolved

OdedViner force-pushed the add_rbd_info branch from ffae514 to 0e15981 Compare November 4, 2024 12:56

OdedViner force-pushed the add_rbd_info branch from 0e15981 to da3d8b6 Compare November 4, 2024 13:51

OdedViner force-pushed the add_rbd_info branch from da3d8b6 to e19572a Compare November 6, 2024 18:38

OdedViner force-pushed the add_rbd_info branch from 0c69545 to 4fc6960 Compare November 11, 2024 18:17

Madhu-1 reviewed Nov 18, 2024

View reviewed changes

e2e/rbd_helper.go Outdated Show resolved Hide resolved

e2e/rbd.go Outdated Show resolved Hide resolved

OdedViner force-pushed the add_rbd_info branch from 4fc6960 to c9c7e80 Compare November 18, 2024 10:07

Madhu-1 approved these changes Nov 19, 2024

View reviewed changes

Madhu-1 requested changes Nov 19, 2024

View reviewed changes

OdedViner force-pushed the add_rbd_info branch from c9c7e80 to 383ff1d Compare November 19, 2024 13:25

ceph-csi-bot removed the ok-to-test Label to trigger E2E tests label Nov 22, 2024

nixpanic requested a review from a team November 22, 2024 08:24

black-dragon74 approved these changes Nov 22, 2024

View reviewed changes

rbd: added rbd info to validateRBDImageCount func

2e85719

Signed-off-by: Oded Viner <[email protected]>

nixpanic force-pushed the add_rbd_info branch from 80a7d36 to 2e85719 Compare November 22, 2024 12:43

mergify bot added the ok-to-test Label to trigger E2E tests label Nov 22, 2024

ceph-csi-bot removed the ok-to-test Label to trigger E2E tests label Nov 22, 2024

mergify bot merged commit dd1c302 into ceph:devel Nov 22, 2024
38 checks passed

rbd: added rbd info to validateRBDImageCount func #4938

rbd: added rbd info to validateRBDImageCount func #4938

Conversation

OdedViner commented Oct 31, 2024 • edited Loading

Describe what this PR does

Is there anything that requires special attention

Related issues

Future concerns

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nixpanic commented Nov 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Madhu-1 commented Nov 4, 2024

Madhu-1 commented Nov 4, 2024

OdedViner commented Nov 4, 2024

Madhu-1 commented Nov 4, 2024

nixpanic commented Nov 6, 2024

OdedViner commented Nov 6, 2024

Madhu-1 commented Nov 6, 2024

OdedViner commented Nov 7, 2024

OdedViner commented Nov 11, 2024

Madhu-1 commented Nov 11, 2024

Madhu-1 left a comment

Choose a reason for hiding this comment

ceph-csi-bot commented Nov 22, 2024

black-dragon74 left a comment

Choose a reason for hiding this comment

nixpanic commented Nov 22, 2024

mergify bot commented Nov 22, 2024

✅ Branch has been successfully rebased

nixpanic commented Nov 22, 2024

mergify bot commented Nov 22, 2024 • edited Loading

🛑 The pull request has been removed from the queue default

ceph-csi-bot commented Nov 22, 2024

ceph-csi-bot commented Nov 22, 2024

ceph-csi-bot commented Nov 22, 2024

ceph-csi-bot commented Nov 22, 2024

ceph-csi-bot commented Nov 22, 2024

ceph-csi-bot commented Nov 22, 2024

ceph-csi-bot commented Nov 22, 2024

ceph-csi-bot commented Nov 22, 2024

ceph-csi-bot commented Nov 22, 2024

ceph-csi-bot commented Nov 22, 2024

ceph-csi-bot commented Nov 22, 2024

nixpanic commented Nov 22, 2024

nixpanic commented Nov 22, 2024

OdedViner commented Dec 1, 2024 • edited Loading

Madhu-1 commented Dec 2, 2024

OdedViner commented Dec 2, 2024 • edited Loading

Madhu-1 commented Dec 2, 2024

OdedViner commented Oct 31, 2024 •

edited

Loading

mergify bot commented Nov 22, 2024 •

edited

Loading

🛑 The pull request has been removed from the queue `default`

OdedViner commented Dec 1, 2024 •

edited

Loading

OdedViner commented Dec 2, 2024 •

edited

Loading