Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd: added rbd info to validateRBDImageCount func #4938

Merged
merged 1 commit into from
Nov 22, 2024

Conversation

OdedViner
Copy link
Contributor

@OdedViner OdedViner commented Oct 31, 2024

Describe what this PR does

Added rbd info and rbd status on Failure message

bash-5.1$ rbd info --pool=replicapool csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7
rbd image 'csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7':
	size 1 GiB in 256 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 1102e70da18f
	block_name_prefix: rbd_data.1102e70da18f
	format: 2
	features: layering
	op_features: 
	flags: 
	create_timestamp: Thu Oct 31 13:26:32 2024
	access_timestamp: Thu Oct 31 13:26:32 2024
	modify_timestamp: Thu Oct 31 13:26:32 2024
bash-5.1$ rbd status --pool=replicapool csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7
Watchers: none

Is there anything that requires special attention

Do you have any questions?

Is the change backward compatible?

Are there concerns around backward compatibility?

Provide any external context for the change, if any.

For example:

  • Kubernetes links that explain why the change is required
  • CSI spec related changes/catch-up that necessitates this patch
  • golang related practices that necessitates this change

Related issues

Mention any github issues relevant to this PR. Adding below line
will help to auto close the issue once the PR is merged.

Fixes: #issue_number
#4547

Future concerns

List items that are not part of the PR and do not impact it's
functionality, but are work items that can be taken up subsequently.

Checklist:

  • Commit Message Formatting: Commit titles and messages follow
    guidelines in the developer
    guide
    .
  • Reviewed the developer guide on Submitting a Pull
    Request
  • Pending release
    notes

    updated with breaking and/or notable changes for the next major release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

  • /retest ci/centos/<job-name>: retest the <job-name> after unrelated
    failure (please report the failure too!)

@mergify mergify bot added the component/rbd Issues related to RBD label Oct 31, 2024
e2e/rbd.go Outdated
framework.Failf(
"backend images not matching kubernetes resource count,image count %d kubernetes resource count %d"+
"\nbackend image Info:\n %v",
"\nbackend image Info:\n %v\n images information and status%v",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status%v could use a space befor the %v

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
err = json.Unmarshal([]byte(stdOut), &imgStatus)
if err != nil {
return imgStatus, fmt.Errorf("unmarshal failed: %w. raw buffer response: %s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't return the buffer in the error message, as that may get printed many more times when callers log the error. Instead, use framework.Logf() to report details like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e2e/rbd_helper.go:1125:3: k8s.io/kubernetes/test/e2e/framework.Logf does not support error-wrapping directive %w

Failed to run container: https://url.corp.redhat.com/cc13cdb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

framework.Logf() does not support %w for logging errors. Use %v while logging, but when %w with fmt.Errorf() or errors.New().

fmt.Sprintf("rbd status %s %s --format json", rbdOptions(poolName), imageName),
rookNamespace)
if err != nil {
return imgStatus, fmt.Errorf("failed to get rbd status: %w", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helps to give this err != nil and the below stdErr != "" errors a slightly different message. When something goes wrong, it can be identified more precisely if the error message is unique.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@nixpanic
Copy link
Member

nixpanic commented Nov 1, 2024

/test ci/centos/mini-e2e-helm/k8s-1.30

e2e/rbd.go Outdated
return
}
// Collecting image details for printing
imageDetails = append(imageDetails, fmt.Sprintf("Pool Name: %s, Image Name: %s, Img Info: %v, Img Status: %v", pool, image, imgInfo, imgStatus))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

e2e/rbd.go Outdated
for _, image := range imageList {
imgInfo, err := getImageInfo(f, image, pool)
if err != nil {
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to log error and continue as this is for debuggin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

e2e/rbd.go Outdated
}
imgStatus, err := getImageStatus(f, image, pool)
if err != nil {
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to log error and continue as this is for debugging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@OdedViner OdedViner force-pushed the add_rbd_info branch 2 times, most recently from 90b18cb to ffae514 Compare November 4, 2024 09:27
e2e/rbd.go Outdated Show resolved Hide resolved
e2e/rbd_helper.go Outdated Show resolved Hide resolved
@Madhu-1
Copy link
Collaborator

Madhu-1 commented Nov 4, 2024

/test ci/centos/mini-e2e-helm/k8s-1.30

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Nov 4, 2024

/test ci/centos/mini-e2e-helm/k8s-1.30

@OdedViner
Copy link
Contributor Author

Output from failure job: https://url.corp.redhat.com/e9aa53d

[38;5;9m[FAILED] backend images not matching kubernetes resource count,image count 1 kubernetes resource count 11
  backend image Info:
   [csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b]
   images information and status Pool: replicapool, Image: csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b, Info: {csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b 0 0 4194304}, Status: {  0 false 0 }�[0m
  �[38;5;9mIn �[1m[It]�[0m�[38;5;9m at: �[1m/go/src/github.com/ceph/ceph-csi/e2e/rbd.go:206�[0m �[38;5;243m@ 11/04/24 13:29:22.803�[0m

@Madhu-1 @nixpanic Is this the desired output? When I ran it manually I got the following results:

bash-5.1$ rbd info --pool=replicapool csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7
rbd image 'csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7':
	size 1 GiB in 256 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 1102e70da18f
	block_name_prefix: rbd_data.1102e70da18f
	format: 2
	features: layering
	op_features: 
	flags: 
	create_timestamp: Thu Oct 31 13:26:32 2024
	access_timestamp: Thu Oct 31 13:26:32 2024
	modify_timestamp: Thu Oct 31 13:26:32 2024
bash-5.1$ rbd status --pool=replicapool csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7
Watchers: none

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Nov 4, 2024

Output from failure job: https://url.corp.redhat.com/e9aa53d

[38;5;9m[FAILED] backend images not matching kubernetes resource count,image count 1 kubernetes resource count 11
  backend image Info:
   [csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b]
   images information and status Pool: replicapool, Image: csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b, Info: {csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b 0 0 4194304}, Status: {  0 false 0 }�[0m
  �[38;5;9mIn �[1m[It]�[0m�[38;5;9m at: �[1m/go/src/github.com/ceph/ceph-csi/e2e/rbd.go:206�[0m �[38;5;243m@ 11/04/24 13:29:22.803�[0m

@Madhu-1 @nixpanic Is this the desired output? When I ran it manually I got the following results:

bash-5.1$ rbd info --pool=replicapool csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7
rbd image 'csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7':
	size 1 GiB in 256 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 1102e70da18f
	block_name_prefix: rbd_data.1102e70da18f
	format: 2
	features: layering
	op_features: 
	flags: 
	create_timestamp: Thu Oct 31 13:26:32 2024
	access_timestamp: Thu Oct 31 13:26:32 2024
	modify_timestamp: Thu Oct 31 13:26:32 2024
bash-5.1$ rbd status --pool=replicapool csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7
Watchers: none

How about if we just return the stdout and log it, i think it will be useful for debugging instead of unmarshal output. i tlooks like the above logs is not much helpful, @nixpanic WDYT?

@nixpanic
Copy link
Member

nixpanic commented Nov 6, 2024

How about if we just return the stdout and log it, i think it will be useful for debugging instead of unmarshal output. i tlooks like the above logs is not much helpful, @nixpanic WDYT?

Logging stdout is sufficient for me too. It may also be easier to find the right timestamp/location of errors when the logs contain pretty unique words like block_name_prefix.

@OdedViner
Copy link
Contributor Author

Output from failure job: https://url.corp.redhat.com/e9aa53d

[38;5;9m[FAILED] backend images not matching kubernetes resource count,image count 1 kubernetes resource count 11
  backend image Info:
   [csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b]
   images information and status Pool: replicapool, Image: csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b, Info: {csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b 0 0 4194304}, Status: {  0 false 0 }�[0m
  �[38;5;9mIn �[1m[It]�[0m�[38;5;9m at: �[1m/go/src/github.com/ceph/ceph-csi/e2e/rbd.go:206�[0m �[38;5;243m@ 11/04/24 13:29:22.803�[0m

@Madhu-1 @nixpanic Is this the desired output? When I ran it manually I got the following results:

bash-5.1$ rbd info --pool=replicapool csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7
rbd image 'csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7':
	size 1 GiB in 256 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 1102e70da18f
	block_name_prefix: rbd_data.1102e70da18f
	format: 2
	features: layering
	op_features: 
	flags: 
	create_timestamp: Thu Oct 31 13:26:32 2024
	access_timestamp: Thu Oct 31 13:26:32 2024
	modify_timestamp: Thu Oct 31 13:26:32 2024
bash-5.1$ rbd status --pool=replicapool csi-vol-2b6574ee-9c6f-4a3b-9a96-09060763f7a7
Watchers: none

How about if we just return the stdout and log it, i think it will be useful for debugging instead of unmarshal output. i tlooks like the above logs is not much helpful, @nixpanic WDYT?

I return imageInfo, string, error because validateStripe function

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Nov 6, 2024

Info: {csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b 0 0 4194304}, Status: { 0 false 0 }

Info: {csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b 0 0 4194304}, Status: { 0 false 0 } This should log both key and value from the struct without it its difficult to understand and debug it.

It would be good if we just return the stdout and log the stdout in the logs that helps for debugging, actual struct is missing few details.

@OdedViner
Copy link
Contributor Author

Info: {csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b 0 0 4194304}, Status: { 0 false 0 } This should log both key and value from the struct without it its difficult to understand and debug it.

It would be good if we just return the stdout and log the stdout in the logs that helps for debugging, actual struct is missing few details.

images not matching kubernete

Info: {csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b 0 0 4194304}, Status: { 0 false 0 }

Info: {csi-vol-d2bb1f89-26f7-4c37-a22d-d72f9a7aa30b 0 0 4194304}, Status: { 0 false 0 } This should log both key and value from the struct without it its difficult to understand and debug it.

It would be good if we just return the stdout and log the stdout in the logs that helps for debugging, actual struct is missing few details.

And what should be done with this function? It relies on the imgInfo struct.

ceph-csi/e2e/rbd_helper.go

Lines 1132 to 1133 in cea8bf8

imgInfo, err := getImageInfo(f, imageData.imageName, defaultRBDPool)
if err != nil {

@OdedViner
Copy link
Contributor Author

@Madhu-1 @nixpanic Which method would be better: getImageStatus, which returns a string output, or getImageInfo, which returns a struct with all parameters?

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Nov 11, 2024

@Madhu-1 @nixpanic Which method would be better: getImageStatus, which returns a string output, or getImageInfo, which returns a struct with all parameters?

we could have a base function that returns the string and getImageStatus can use the base function and later do unmarshal and from the validation we will call the base function that returns the string data and print it.

e2e/rbd_helper.go Outdated Show resolved Hide resolved
e2e/rbd.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OdedViner changes LGTM , please squash the commits into 1

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.31

@ceph-csi-bot ceph-csi-bot removed the ok-to-test Label to trigger E2E tests label Nov 22, 2024
@nixpanic nixpanic requested a review from a team November 22, 2024 08:24
Copy link
Member

@black-dragon74 black-dragon74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@nixpanic
Copy link
Member

@Mergifyio rebase

Copy link
Contributor

mergify bot commented Nov 22, 2024

rebase

✅ Branch has been successfully rebased

@nixpanic
Copy link
Member

@Mergifyio queue

Copy link
Contributor

mergify bot commented Nov 22, 2024

queue

🛑 The pull request has been removed from the queue default

The merge conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

@mergify mergify bot added the ok-to-test Label to trigger E2E tests label Nov 22, 2024
@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.30

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.30

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.29

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.30

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.29

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.31

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.29

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.31

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.31

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-cephfs

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-rbd

@ceph-csi-bot ceph-csi-bot removed the ok-to-test Label to trigger E2E tests label Nov 22, 2024
@nixpanic
Copy link
Member

/retest ci/centos/mini-e2e/k8s-1.29

@nixpanic
Copy link
Member

@OdedViner, a CI job failed and resulted in the following output:

[FAILED] backend images not matching kubernetes resource count,image count 1 kubernetes resource count 0
  backend image Info:
   [csi-vol-76e28f0b-4ae2-43f7-b9e6-c8e28d891c25]
   images information and status Pool: replicapool, Image: csi-vol-76e28f0b-4ae2-43f7-b9e6-c8e28d891c25, Info: , Status: 

It looks like Info and Status are both empty for some reason, maybe you want to double check what it is doing?

@mergify mergify bot merged commit dd1c302 into ceph:devel Nov 22, 2024
38 checks passed
@OdedViner
Copy link
Contributor Author

OdedViner commented Dec 1, 2024

@OdedViner, a CI job failed and resulted in the following output:

[FAILED] backend images not matching kubernetes resource count,image count 1 kubernetes resource count 0
  backend image Info:
   [csi-vol-76e28f0b-4ae2-43f7-b9e6-c8e28d891c25]
   images information and status Pool: replicapool, Image: csi-vol-76e28f0b-4ae2-43f7-b9e6-c8e28d891c25, Info: , Status: 

It looks like Info and Status are both empty for some reason, maybe you want to double check what it is doing?
@nixpanic
@Madhu-1 @Madhu-1Can we re-run the job? Alternatively, I can create a PR with validateRBDImageCount(f, 111, defaultRBDPool). The test is expected to fail so we can review the failure output.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Dec 2, 2024

1Can we re-run the job? Alternatively, I can create a PR with validateRBDImageCount(f, 111, defaultRBDPool). The test is expected to fail so we can review the failure output.

I see this happening as well, This happens when there is a delay in volume deletion where the volume is present in first get and gets deleted when we try to query in details (due to async deletion)

@OdedViner
Copy link
Contributor Author

OdedViner commented Dec 2, 2024

1Can we re-run the job? Alternatively, I can create a PR with validateRBDImageCount(f, 111, defaultRBDPool). The test is expected to fail so we can review the failure output.

I see this happening as well, This happens when there is a delay in volume deletion where the volume is present in first get and gets deleted when we try to query in details (due to async deletion)

@Madhu-1 is there a method to fix it? or only add sleep before first get?

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Dec 2, 2024

1Can we re-run the job? Alternatively, I can create a PR with validateRBDImageCount(f, 111, defaultRBDPool). The test is expected to fail so we can review the failure output.

I see this happening as well, This happens when there is a delay in volume deletion where the volume is present in first get and gets deleted when we try to query in details (due to async deletion)

@Madhu-1 is there a method to fix it? or only add sleep before first get?

This is very corner case there i no 100 % solution to fix it but we need more than sleep. we need to take tasks in to consideration when we do get call for the count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/rbd Issues related to RBD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants