Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomic option locking #967

Merged
merged 8 commits into from
Jul 11, 2023
Merged

Atomic option locking #967

merged 8 commits into from
Jul 11, 2023

Conversation

barryhughes
Copy link
Member

@barryhughes barryhughes commented Jun 29, 2023

Updates the Option Lock implementation to reduce the potential for race conditions.

Previously, and with reference to ActionScheduler_QueueRunner::maybe_dispatch_async_request(), we had this pattern:

if ( is_admin() && ! ActionScheduler::lock()->is_locked( 'async-request-runner' ) ) {
	// Only start an async queue at most once every 60 seconds
	ActionScheduler::lock()->set( 'async-request-runner' );
	$this->async_request->maybe_dispatch();
}

The problem is that another process may obtain the lock immediately after we enter the if {} block, but before we set the lock. As noted in the linked issue, this isn't just theoretical: some cases of this happening have been detected in the wild. This change attempts to address the problem in the following ways:

  • When calling $lock->set(), it will only successfully claim the lock and return true if another process did not get there first. Therefore, we now only try to dispatch the async queue runner request if calling $lock->set() returns true.
  • The public interface remains unchanged, since there may be alternative (custom) locking implementations in use.
  • Related to the previous point, within the Queue Runner we still call $lock->is_locked() because, even though this should now be superfluous when the updated Option Lock implementation is in use, we don't actually know that that is the active implementation.
  • Lastly, though we are still using the options table, we are no longer using the options API. That's a trade-off, but it should mean we gain better protection against race conditions.

Closes #793.


Testing instructions

Code review and some free-style testing (such as single-stepping through the locking process with XDebug) may be the best way to test and examine the changes to the locking system, but I will anyway outline a fairly manual method of testing here. These testing instructions do not try to simulate the sort of highly concurrent conditions that would lead to a race condition, they simply ensure we haven't obviously broken the way async queue running works.

First, we should disable any methods of triggering the queue runner except for the async queue runner:

  • Disable WP Cron by adding define( 'DISABLE_WP_CRON', true ); to your wp-config.php file.
  • If you use system crontab (or equivalent), remove or disable any rules that trigger the queue runner via WP CLI, or that trigger WP Cron.

You may also wish to manually generate some scheduled actions using this one-liner:

wp eval "for ( \$i = 0; \$i <= 50; \$i++ ) as_enqueue_async_action( 'foo-' . rand( 1000, 9999 ) );"

That should give us 50 scheduled actions that are 'past due'.

Before we go any further I want to highlight that, by default, the locking system is used in such a way that a new queue runner is spawned no more frequently than once every 60 seconds.

Visit Tools ▸ Scheduled Actions ... you should see at least 50 actions are past-due (there may be more, depending on how many were already present in your test site ... and, if you do not see any, it's possible a queue runner just finished its work and so you may need to re-run the above snippet a second time). Also, there should be an indication as to when the next queue runner will start:

Screenshot 2023-06-29 at 16 14 31
  • If you refresh the page before the next queue runner is initiated, then the past-due count should not change.
  • If you wait and then refresh after the queue runner should have started , then the past-due count should reduce significantly.

This should be enough to assure us that:

  • The lock is still respected.
  • Async queue runners are still being spawned.

Changelog

Fix - Improve locking mechanism to protect against race conditions.

Note that we still check is_locked() because, although the option lock implementation has been updated in such a way that we can rely on just calling set(), we do not know that that is the active implementation.
@barryhughes barryhughes requested review from a team and coreymckrill and removed request for a team June 29, 2023 23:26
@barryhughes barryhughes added this to the 3.6.2 milestone Jun 29, 2023
@coreymckrill
Copy link
Contributor

@barryhughes I may be doing something wrong, but the queue runner doesn't seem to be doing anything, either in trunk or on this branch. I've got DISABLE_WP_CRON set to true, and I start with a completely empty queue. Then I generate the 50 actions, and get the notice that the next queue runner will start in X seconds. However, after it runs, I still have the same 50 actions listed as both pending and past due...

@barryhughes
Copy link
Member Author

Hmm 🤔

I'm not immediately sure, but I wonder if it could be a local issue of some kind (local loopback prohibited?) ... here's something I tried:

  • Checked this branch out, and created the zip via npm run build.
  • Created a new sandbox site (JN).
  • Disabled default plugins that were pre-installed.
  • Uploaded and activated the Action Scheduler build.
  • Generated the test actions via WP CLI.
  • Tested from there, everything seemed to work.

I'll share the creds I used for the sandbox site and, if you login, you will see 51 failed actions from my test (they failed because of course there is no callback function, but that's completely expected) and you can re-test there if you like (or do so in a fresh sandbox environment)?

Copy link
Contributor

@coreymckrill coreymckrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, got it working on JN. Must be something to do with the wp-env setup.

It looks good to me, but I'd be curious if @brandonpayton has any thoughts as well, since he opened the issue.

Copy link

@brandonpayton brandonpayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great to see. Thank you very much for working on it, @barryhughes! 🙇

✅ The main thing is that the UPDATE depends upon the previous known value to avoid the race.

I left one comment because it seems like the changes could be simpler, but it's possible I'm missing something.

Thanks again!

Comment on lines 29 to 30
$existing_lock_value = $this->get_or_create_lock( $lock_type );
$expiration_timestamp = $this->get_expiration_from( $existing_lock_value );

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we could not just get expiration and, depending on the result, INSERT or UPDATE?

It seems unnecessary to potentially INSERT a lock option value in order to conditionally UPDATE against that value later on. Currently, this requires three queries:

  1. INSERT lock if none exists
  2. SELECT lock value
  3. UPDATE if old value is expected value

If we conditionally INSERT or UPDATE, it would require two queries:

  1. SELECT lock value if any
  2. INSERT or UPDATE depending on whether there is an existing and expired lock value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely ... a little unsure why I did things this way. In any case, mind taking a further peek (especially at 895e8d0)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely ... a little unsure why I did things this way.

I've had similar experiences :)

In any case, mind taking a further peek (especially at 895e8d0)?

Sure thing!

Copy link

@brandonpayton brandonpayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested, but the latest INSERT/UPDATE logic looks good to me.

Having a race in a scheduler undermines its purpose for being. It's great it will be fixed. Thanks again for fixing it!

At least as far back as WP 5.2 (the oldest version we support), the option_name column is defined as unique: therefore we can rely on MySQL to avoid duplicates. Of course, if the schema has been modified and the unique constraint removed, that would be problematic -- but it seems reasonable for us to assume that isn't the case.
@barryhughes
Copy link
Member Author

@brandonpayton thanks for the reviews so far ... my apologies but could I get one more check? Specifically on b215f72.

Being mindful of recent conversations about the inability to rely on the dual table (ie, if HyperDB is being used), I felt it best to update the INSERT statement. I've basically simplified it to a regular $wpdb->insert and felt this was safe, since the option_name column has a unique constraint (therefore, it should not be possible to create two entries with the same option name, and MySQL will take care of enforcing this).

I realize this might be problematic if that constraint has been removed, but I'm assuming that generally is not the case. Checking in, though, in case you have any thoughts on this—particularly in relation to the platform(s) you work on.

Copy link

@brandonpayton brandonpayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @barryhughes, it's funny you ended up making that change because, in my previous review, I almost asked why we needed the WHERE clause given that MySQL would enforce option name uniqueness. But one thing the previous query did was to avoid the INSERT attempt resulting in a MySQL error, and I decided not to say anything because that seemed fine too. :)

Either way works for me.

IMO it is totally reasonable to rely upon options having unique names in the wp_options table. If someone removes that constraint, they have change the underlying and widely assumed truth of the WP DB, and WP would probably have bigger problems than broken action scheduler locking.

Thanks!

@barryhughes
Copy link
Member Author

But one thing the previous query did was to avoid the INSERT attempt resulting in a MySQL error

Yep, that came to mind here—but I think it's a reasonable trade-off. Thanks again for checking on this one 👍

@barryhughes barryhughes merged commit c355dbf into trunk Jul 11, 2023
38 checks passed
@barryhughes barryhughes deleted the fix/793-atomic-option-lock branch July 11, 2023 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race conditions affecting ActionScheduler_QueueRunner::maybe_dispatch_async_request()
3 participants