Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ZED tunable for disk label wait during auto-replace #15327

Closed
wants to merge 1 commit into from

Conversation

don-brady
Copy link
Contributor

@don-brady don-brady commented Sep 28, 2023

Motivation and Context

During an auto-replace of a VDEV device, the device is labeled asynchronously in ZED and it waits until the partition link appears before proceeding with the zpool_vdev_attach().

We have seen cases where ZED doesn't wait long enough for the partition link to appear and so the auto-replace fails. Below is the ZED log message observed:

zed[3752]: zfs_mod: expected replacement disk /dev/disk/by-id/scsi-35000c500a5713fa4-part1 is missing

Note that during a zpool create or zpool add it waits up to 30 seconds when calling zpool_label_disk_wait().

Description

Added a new ZED configuration tunable, ZED_DISK_LABEL_WAIT_TIME which can override the previously hard-coded value of 3000 ms.

Also re-worded the log message when the timeout is reached before the link appears:

zfs_mod: after labeling replacement disk, the expected disk partition link '/dev/disk/by-id/scsi-35000c500a5713fa4-part1' is missing after waiting 3000 ms

Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.

How Has This Been Tested?

Tested manually and with ZTS functional/fault/auto_replace_001_pos

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

# to show up before giving up. The time is in milliseconds and
# defaults to 3000 if not specified here.
#
#ZED_DISK_LABEL_WAIT_TIME=3000
Copy link
Contributor

@allanjude allanjude Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default is now 30,000 ms right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the default for ZED is still 3000 ms. And the default for zpool create, i.e., DISK_LABEL_WAIT, remains 30,000 ms. This PR currently is only making it tunable.

In the ZED case, it is waiting after consuming a disk add event for the partition (after the partition was made and sdh1 exists, but other links like under /dev/disk/by-id/ might be inflight) , whereas in the zpool create case it is waiting just after issuing the partition request (before partitioning may have even started) so it makes sense that it is a bit longer.

But you bring up a good point, the wait time is the maximum time it will wait so why not make the default longer for an auto-replace?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem like we are here fixing this because the wait is too short, so yeah, changing the default while here seems like a good idea

During an auto-replace of a VDEV device, the device is labeled
asynchronously in ZED and it waits until the partition link
appears before proceeding with the zpool_vdev_attach().

Added a new ZED configuration tunable, ZED_DISK_LABEL_WAIT_TIME
which can override the previously hard-coded value of 3000 ms.

Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.

Signed-off-by: Don Brady <[email protected]>

static int
zed_lookup_tunable(const char *zedlet_dir, const char *tuneable,
unsigned int *wait_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is meant to be generic, maybe rename wait_time to tunable_value or something
Maybe also name the function for the type, zed_lookup_uint_tunable(), incase there are other types in the future

#define CONFIG_LINE_MAX 100

static int
zed_lookup_tunable(const char *zedlet_dir, const char *tuneable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zed_lookup_tunable() is basically a lightweight shell parser. It's going to get tripped up if you do things like ZED_DISK_LABEL_WAIT = 1 or ZED_DISK_LABEL_WAIT=1 or:

if dmidecode | grep "system-that-needs-more-spin-up-time" ; then
	ZED_DISK_LABEL_WAIT = 6000
else
	ZED_DISK_LABEL_WAIT = 3000
fi

I was thinking you could just "execute" zed.rc and then print out all its variables. Or more specifically, have zed call libzfs_run_process_get_stdout_nopath() with bash -c 'set -o allexport && source zed.rc && env' as the command. That function will return an array of lines of output which you can extract the tuneable from. You can use vdev_run_cmd() as a template for how to use libzfs_run_process_get_stdout_nopath() since it's a little complicated.

I am happy that with this PR we would now have a true zed-readable config file (not just for zedlets)! That's something we've wanted for a long time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sscanf does correctly process whitespace on either side of the '='. I tested with many variations, including tab characters.

The format of zed.rc is only implied (name=value) but not otherwise defined AFAIKT. If it is intended to be a bash script then I'm more inclined to used a different config file for the non-zedlet tunables. I'd like to keep it simple and useable for other future non-zedlet tunables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's good that it's interpreting white-space correctly.

The format of zed.rc is only implied (name=value) but not otherwise defined AFAIKT.

The rc (run command) extension typically implies the file is executed as configuration commands. Since we're already executing zed.rc at the beginning of our zedlets, I think it's fair for users to assume it's a script and that they could put other shell commands in there.

I think you're right that a separate zed daemon config file might be the way to go.

@don-brady
Copy link
Contributor Author

Further investigations revealed that the issue causing auto-replace to fail was caused by a regression in ZED and not caused by a short waiting period. For VDEV paths that are under by-id, it was waiting for the old path to show up, which will never happen since the new disk will have a different name under the by-id directory. The by-id paths are typically unique to the device (i.e., based on a serial number).

I'll open a different PR to address the regression and close this one.

@don-brady don-brady closed this Oct 4, 2023
@don-brady don-brady deleted the zed-auto-replace-wait branch October 4, 2023 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants