-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16982 csum: recalculate checksum on retrying #15786
base: jvolivie/disable_target
Are you sure you want to change the base?
Conversation
Ticket title is 'We should not report checksum errors against the nmve device for key verification' |
I have already tested it by manually injecting failure, and I'm working on turning that into a unit test.
|
7f74db4
to
bb23b17
Compare
@@ -5140,6 +5141,11 @@ obj_csum_update(struct dc_object *obj, daos_obj_update_t *args, struct obj_auxi_ | |||
if (!obj_csum_dedup_candidate(&obj->cob_co->dc_props, args->iods, args->nr)) | |||
return 0; | |||
|
|||
if (obj_auxi->csum_retry) { | |||
/* Release old checksum result and prepare for new calculation */ | |||
daos_csummer_free_ic(obj->cob_co->dc_csummer, &obj_auxi->rw_args.iod_csums); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably want to do this after a couple of retries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really easy to add but I wonder if that is indeed necessary, because cksum error is a rare event by itself.
How about revising it to:
if (obj_auxi->csum_retry && obj_auxi->csum_retry_cnt > 2) { ... }
would that work for you?
/* Release old checksum result and prepare for new calculation */ | ||
daos_csummer_free_ic(obj->cob_co->dc_csummer, &obj_auxi->rw_args.iod_csums); | ||
} | ||
|
||
return dc_obj_csum_update(obj->cob_co->dc_csummer, obj->cob_co->dc_props, | ||
obj->cob_md.omd_id, args->dkey, args->iods, args->sgls, args->nr, | ||
obj_auxi->reasb_req.orr_singv_los, &obj_auxi->rw_args.dkey_csum, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of the actual issue we saw, it was the dkey_csum that needs to be recalculated, is that happening here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes if I read the code correctly because we release the previous calculation above.
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/3/execution/node/344/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/3/execution/node/334/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/3/execution/node/387/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/3/execution/node/345/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/3/execution/node/480/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/3/execution/node/339/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/4/execution/node/375/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/4/execution/node/373/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/4/execution/node/345/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/4/execution/node/338/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/4/execution/node/335/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/4/execution/node/342/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/5/execution/node/373/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/5/execution/node/319/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/5/execution/node/345/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/5/execution/node/342/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/6/execution/node/374/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/6/execution/node/371/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/6/execution/node/356/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/6/execution/node/355/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/6/execution/node/359/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/7/execution/node/373/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/7/execution/node/348/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/7/execution/node/347/log |
c63ecc9
to
e119758
Compare
e119758
to
f0a07e6
Compare
This PR fixes retry logic by actually recalculating the checksum; also it removes the code that incorrectly records nvme error. Run-GHA: true Change-Id: Ib0287851fea4d125eecda48c5ccb3c73ed85b8f8 Signed-off-by: Jinshan Xiong <[email protected]>
f0a07e6
to
eb6a7d1
Compare
Functional on EL 8.8 Test Results131 tests 127 ✅ 1h 30m 53s ⏱️ Results for commit eb6a7d1. |
@wangdi1 @liuxuezhao can you please take a look? |
src/object/srv_obj.c
Outdated
DP_C_UOID_DKEY(orw->orw_oid, &orw->orw_dkey), | ||
DP_RC(rc)); | ||
if (rc == -DER_CSUM) | ||
obj_log_csum_err(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should fix this in a separate patch?
Signed-off-by: Jeff Olivier <[email protected]>
Signed-off-by: Jeff Olivier <[email protected]>
/* Retry fetch on alternative shard */ | ||
if (obj_auxi->opc == DAOS_OBJ_RPC_FETCH) { | ||
if (task->dt_result == -DER_CSUM) | ||
if ((obj_auxi->opc == DAOS_OBJ_RPC_FETCH || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now if FETCH get DER_CSUM
- for replica obj it will try other replica, if all replicas retried then will fail, see obj_retry_next_shard(),
so actually the "csum_retry_cnt < MAX_CSUM_RETRY" compare for fetch looks not very reasonable, because if #replicas < MAX_CSUM_RETRY the check is useless (alwasy true) , if #replicas > MAX_CSUM_RETRY it acutally has chance to sucess but the code will fail it. - for EC obj it will mark the shard as failed (obj_auxi_add_failed_tgt()) and do EC degraded fetch, so the "csum_retry_cnt < MAX_CSUM_RETRY" compare also not really useful, because if the retried times exceed number of parity shards it will fail.
The retry for UPDATE is easier just retry that times, the check is valid.
so looks to me that the change here make the code a kind of inconsistent with real behavior (for FETCH). Do you think is it fine? I'll leave a -1 first, thx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
csum_retry_cnt < MAX_CSUM_RETRY
will be applied to both read and write. For read, even though it may have more than 10 (MAX_CSUM_RETRY
is set to 10 at this point) replicas, but if it returns the same error 10 times in a row, something may have gone terribly wrong. Either way, I think it would be reasonable to limit the rw RPC to retry only limited times. If there existed BUGs in our code, this would prevent it from unlimited trying, which we have seen this in production.
I actually have a comment below to explain when I did this. Or otherwise, please suggest a fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thought that for FETCH need not consider the csum_retry_cnt as it with its own different control.
I thought again that as long as the MAX_CSUM_RETRY will not be changed smaller probably is fine.
@@ -5140,6 +5140,12 @@ obj_csum_update(struct dc_object *obj, daos_obj_update_t *args, struct obj_auxi_ | |||
if (!obj_csum_dedup_candidate(&obj->cob_co->dc_props, args->iods, args->nr)) | |||
return 0; | |||
|
|||
if (obj_auxi->csum_retry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just confirm that will the original code cause mem leak? (as it did not call the daos_csummer_free_ci() before).
and why only do the daos_csummer_free_ci for csum_retry, need not do it for other retry like nvme_io_err/tx_uncertain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not about memory leak. The checksum computation code dc_obj_csum_update()
will use obj_auxi->rw_args.dkey_csum
and obj_auxi->rw_args.iod_csums
to check if the checksum has been computed.
By freeing the previous result, it will cause the checksum to be recomputed, which is what we want for the sake of this fix. The reason we only do this for checksum retry is that it's possible that the buffer gets updated after the checksum is computed, and retrying (w/o recomputation) won't really help because it will certainly get the same error.
We're actually considering changing NVME checksum error to nvme_io_err. Therefore, csum_err is dedicated for network checksum error. The discussion is initiated here: https://daos-stack.slack.com/archives/C4SM0RZ54/p1738030213108609
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks for explanation.
For "changing NVME checksum error to nvme_io_err",
"For DER_NVME_IO error, it should try next replica if it has; otherwise, it should just send the RPC to the same replica again and hopefully network error would disappear."
that will need change quite some codes and make the logic more complicate.
It depends on the possibility of network transferring caused CSUM err, is it high or very rare, when it happens can it be corrected by retry the transferring to same target?
It looks to me that possibility (pure network caused csum err and fixed by re-transferring) should be very rare. On the other side, RPC (such as mercury) internally can verify the CSUM during transferring and will fail the underneath RPC.
So I'm wondering that possibly not worth to introduce much complexity for that looks-not-true assumption.
(also replied to the slack thread, we may try to simplify the retry logic and avoid complexity if possible).
/* Retry fetch on alternative shard */ | ||
if (obj_auxi->opc == DAOS_OBJ_RPC_FETCH) { | ||
if (task->dt_result == -DER_CSUM) | ||
if ((obj_auxi->opc == DAOS_OBJ_RPC_FETCH || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thought that for FETCH need not consider the csum_retry_cnt as it with its own different control.
I thought again that as long as the MAX_CSUM_RETRY will not be changed smaller probably is fine.
@@ -5140,6 +5140,12 @@ obj_csum_update(struct dc_object *obj, daos_obj_update_t *args, struct obj_auxi_ | |||
if (!obj_csum_dedup_candidate(&obj->cob_co->dc_props, args->iods, args->nr)) | |||
return 0; | |||
|
|||
if (obj_auxi->csum_retry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks for explanation.
For "changing NVME checksum error to nvme_io_err",
"For DER_NVME_IO error, it should try next replica if it has; otherwise, it should just send the RPC to the same replica again and hopefully network error would disappear."
that will need change quite some codes and make the logic more complicate.
It depends on the possibility of network transferring caused CSUM err, is it high or very rare, when it happens can it be corrected by retry the transferring to same target?
It looks to me that possibility (pure network caused csum err and fixed by re-transferring) should be very rare. On the other side, RPC (such as mercury) internally can verify the CSUM during transferring and will fail the underneath RPC.
So I'm wondering that possibly not worth to introduce much complexity for that looks-not-true assumption.
(also replied to the slack thread, we may try to simplify the retry logic and avoid complexity if possible).
This PR fixes retry logic by actually recalculating the checksum; also it removes the code that incorrectly records nvme error.
This is a quick fix before we make an ultimate fix discussed here: https://daos-stack.slack.com/archives/C4SM0RZ54/p1738030213108609
Change-Id: Ib0287851fea4d125eecda48c5ccb3c73ed85b8f8
Signed-off-by: Jinshan Xiong [email protected]
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: