Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runner:fix the ret when reopen dev failed in cmd stpg #537

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lmh10144360
Copy link

if the device with the successful close()
but failed open() in tcmu_acquire_dev_lock,after
the cmd stpg failed,the device will never reopen
again。so change the ret from TCMU_STS_FENCED to
TCMU_STS_TIMEOUT,let the device into the recovery
list.

Signed-off-by: 李明辉10144360 [email protected]

if the device with the successful close()
but failed open() in tcmu_acquire_dev_lock,after
the cmd stpg failed,the device will never reopen
again。so change the ret from TCMU_STS_FENCED to
TCMU_STS_TIMEOUT,let the device into the recovery
list.

Signed-off-by: 李明辉10144360 <[email protected]>
@mikechristie
Copy link
Collaborator

I didn't understand the problem description. Did this patch:

commit 08e3a0e
Author: Mike Christie [email protected]
Date: Tue Sep 11 18:48:13 2018 -0500

runner: don't drop iscsi connection on lock fence errors

cause a bug for you?

We can't go into that if block that does tcmu_notify_conn_lost because it causes the path bouncing which can end up in the command retries being used up and paths not being added.

In your case is the following happening:

  1. TCMU_STS_FENCED is returned to tcmu_explicit_transition and that gets translated to a SCSI BUSY status.
  2. The initiator gets BUSY and retries.
  3. tcmu-runner gets the STPG and calls tcmu_acquire_dev_lock again. It sees that TCMUR_DEV_FLAG_IS_OPEN is not set, so it calls tcmu_reopen_dev.

In your case is the initiator not retrying the BUSY status? Is it because the initiator does not retry for that SCSI status code or is it being retired so many times that the cmd has run out of retries? If either of those, what is the initiator OS?

@lmh10144360
Copy link
Author

Thanks for reply !
yes,it is being retried,but when the cmd timeout is reached in scsi level,
it will finish the command with failed !
with the initiator OS centos7.4,the timeout is setted to 360S,if stpg is not
return ok in 360S,the cmd will be failed, and will not be retried anymore.

@mikechristie
Copy link
Collaborator

Hey, Sorry for the late reply.

It looks like you have the same problem with TCMU_STS_TIMEOUT. Won't you have limited retries for that case too?

What is your pg_init_retries set to on the initiator side in /etc/multipath.conf?

@lxbsz lxbsz changed the base branch from master to main August 10, 2022 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants