Atomic option locking #967

barryhughes · 2023-06-29T23:23:53Z

Updates the Option Lock implementation to reduce the potential for race conditions.

Previously, and with reference to ActionScheduler_QueueRunner::maybe_dispatch_async_request(), we had this pattern:

if ( is_admin() && ! ActionScheduler::lock()->is_locked( 'async-request-runner' ) ) {
	// Only start an async queue at most once every 60 seconds
	ActionScheduler::lock()->set( 'async-request-runner' );
	$this->async_request->maybe_dispatch();
}

The problem is that another process may obtain the lock immediately after we enter the if {} block, but before we set the lock. As noted in the linked issue, this isn't just theoretical: some cases of this happening have been detected in the wild. This change attempts to address the problem in the following ways:

When calling $lock->set(), it will only successfully claim the lock and return true if another process did not get there first. Therefore, we now only try to dispatch the async queue runner request if calling $lock->set() returns true.
The public interface remains unchanged, since there may be alternative (custom) locking implementations in use.
Related to the previous point, within the Queue Runner we still call $lock->is_locked() because, even though this should now be superfluous when the updated Option Lock implementation is in use, we don't actually know that that is the active implementation.
Lastly, though we are still using the options table, we are no longer using the options API. That's a trade-off, but it should mean we gain better protection against race conditions.

Closes #793.

Testing instructions

Code review and some free-style testing (such as single-stepping through the locking process with XDebug) may be the best way to test and examine the changes to the locking system, but I will anyway outline a fairly manual method of testing here. These testing instructions do not try to simulate the sort of highly concurrent conditions that would lead to a race condition, they simply ensure we haven't obviously broken the way async queue running works.

First, we should disable any methods of triggering the queue runner except for the async queue runner:

Disable WP Cron by adding define( 'DISABLE_WP_CRON', true ); to your wp-config.php file.
If you use system crontab (or equivalent), remove or disable any rules that trigger the queue runner via WP CLI, or that trigger WP Cron.

You may also wish to manually generate some scheduled actions using this one-liner:

wp eval "for ( \$i = 0; \$i <= 50; \$i++ ) as_enqueue_async_action( 'foo-' . rand( 1000, 9999 ) );"

That should give us 50 scheduled actions that are 'past due'.

Before we go any further I want to highlight that, by default, the locking system is used in such a way that a new queue runner is spawned no more frequently than once every 60 seconds.

Visit Tools ▸ Scheduled Actions ... you should see at least 50 actions are past-due (there may be more, depending on how many were already present in your test site ... and, if you do not see any, it's possible a queue runner just finished its work and so you may need to re-run the above snippet a second time). Also, there should be an indication as to when the next queue runner will start:

If you refresh the page before the next queue runner is initiated, then the past-due count should not change.
If you wait and then refresh after the queue runner should have started , then the past-due count should reduce significantly.

This should be enough to assure us that:

The lock is still respected.
Async queue runners are still being spawned.

Changelog

Fix - Improve locking mechanism to protect against race conditions.

…ions.

Note that we still check is_locked() because, although the option lock implementation has been updated in such a way that we can rely on just calling set(), we do not know that that is the active implementation.

coreymckrill · 2023-06-30T18:49:14Z

@barryhughes I may be doing something wrong, but the queue runner doesn't seem to be doing anything, either in trunk or on this branch. I've got DISABLE_WP_CRON set to true, and I start with a completely empty queue. Then I generate the 50 actions, and get the notice that the next queue runner will start in X seconds. However, after it runs, I still have the same 50 actions listed as both pending and past due...

barryhughes · 2023-06-30T19:27:55Z

Hmm 🤔

I'm not immediately sure, but I wonder if it could be a local issue of some kind (local loopback prohibited?) ... here's something I tried:

Checked this branch out, and created the zip via npm run build.
Created a new sandbox site (JN).
Disabled default plugins that were pre-installed.
Uploaded and activated the Action Scheduler build.
Generated the test actions via WP CLI.
Tested from there, everything seemed to work.

I'll share the creds I used for the sandbox site and, if you login, you will see 51 failed actions from my test (they failed because of course there is no callback function, but that's completely expected) and you can re-test there if you like (or do so in a fresh sandbox environment)?

coreymckrill

Okay, got it working on JN. Must be something to do with the wp-env setup.

It looks good to me, but I'd be curious if @brandonpayton has any thoughts as well, since he opened the issue.

brandonpayton

This is great to see. Thank you very much for working on it, @barryhughes! 🙇

✅ The main thing is that the UPDATE depends upon the previous known value to avoid the race.

I left one comment because it seems like the changes could be simpler, but it's possible I'm missing something.

Thanks again!

brandonpayton · 2023-07-01T03:27:27Z

classes/ActionScheduler_OptionLock.php

+		$existing_lock_value  = $this->get_or_create_lock( $lock_type );
+		$expiration_timestamp = $this->get_expiration_from( $existing_lock_value );


Is there a reason we could not just get expiration and, depending on the result, INSERT or UPDATE?

It seems unnecessary to potentially INSERT a lock option value in order to conditionally UPDATE against that value later on. Currently, this requires three queries:

INSERT lock if none exists

SELECT lock value

UPDATE if old value is expected value

If we conditionally INSERT or UPDATE, it would require two queries:

SELECT lock value if any

INSERT or UPDATE depending on whether there is an existing and expired lock value

Definitely ... a little unsure why I did things this way. In any case, mind taking a further peek (especially at 895e8d0)?

Definitely ... a little unsure why I did things this way.

I've had similar experiences :)

In any case, mind taking a further peek (especially at 895e8d0)?

Sure thing!

brandonpayton

I haven't tested, but the latest INSERT/UPDATE logic looks good to me.

Having a race in a scheduler undermines its purpose for being. It's great it will be fixed. Thanks again for fixing it!

At least as far back as WP 5.2 (the oldest version we support), the option_name column is defined as unique: therefore we can rely on MySQL to avoid duplicates. Of course, if the schema has been modified and the unique constraint removed, that would be problematic -- but it seems reasonable for us to assume that isn't the case.

barryhughes · 2023-07-07T14:48:50Z

@brandonpayton thanks for the reviews so far ... my apologies but could I get one more check? Specifically on b215f72.

Being mindful of recent conversations about the inability to rely on the dual table (ie, if HyperDB is being used), I felt it best to update the INSERT statement. I've basically simplified it to a regular $wpdb->insert and felt this was safe, since the option_name column has a unique constraint (therefore, it should not be possible to create two entries with the same option name, and MySQL will take care of enforcing this).

I realize this might be problematic if that constraint has been removed, but I'm assuming that generally is not the case. Checking in, though, in case you have any thoughts on this—particularly in relation to the platform(s) you work on.

brandonpayton

Hi @barryhughes, it's funny you ended up making that change because, in my previous review, I almost asked why we needed the WHERE clause given that MySQL would enforce option name uniqueness. But one thing the previous query did was to avoid the INSERT attempt resulting in a MySQL error, and I decided not to say anything because that seemed fine too. :)

Either way works for me.

IMO it is totally reasonable to rely upon options having unique names in the wp_options table. If someone removes that constraint, they have change the underlying and widely assumed truth of the WP DB, and WP would probably have bigger problems than broken action scheduler locking.

Thanks!

barryhughes · 2023-07-11T17:06:05Z

But one thing the previous query did was to avoid the INSERT attempt resulting in a MySQL error

Yep, that came to mind here—but I think it's a reasonable trade-off. Thanks again for checking on this one 👍

barryhughes added 4 commits June 29, 2023 15:10

Describe how option locks should handle possible concurrent claims.

b5c21ec

Update Option Lock implementation to reduce potential for race condit…

636e298

…ions.

Check that lock was set, before proceeding to dispatch async request.

d1a5342

Note that we still check is_locked() because, although the option lock implementation has been updated in such a way that we can rely on just calling set(), we do not know that that is the active implementation.

Add guidance for concrete lock::set implementations.

530baa1

barryhughes requested review from a team and coreymckrill and removed request for a team June 29, 2023 23:26

barryhughes added this to the 3.6.2 milestone Jun 29, 2023

coreymckrill approved these changes Jun 30, 2023

View reviewed changes

coreymckrill requested a review from brandonpayton June 30, 2023 20:58

brandonpayton reviewed Jul 1, 2023

View reviewed changes

Be efficient! Be lazy in terms of creating the lock option.

895e8d0

barryhughes requested a review from brandonpayton July 5, 2023 14:54

Unquote column names.

62f0d33

brandonpayton approved these changes Jul 7, 2023

View reviewed changes

barryhughes added 2 commits July 7, 2023 07:39

Cleanup.

bd2d81a

barryhughes requested a review from brandonpayton July 7, 2023 14:48

brandonpayton approved these changes Jul 11, 2023

View reviewed changes

barryhughes merged commit c355dbf into trunk Jul 11, 2023
38 checks passed

barryhughes deleted the fix/793-atomic-option-lock branch July 11, 2023 17:06

andrewklimek mentioned this pull request Jul 21, 2024

Race conditions affecting ActionScheduler_QueueRunner::maybe_dispatch_async_request() #793

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atomic option locking #967

Atomic option locking #967

barryhughes commented Jun 29, 2023 •

edited

Loading

coreymckrill commented Jun 30, 2023

barryhughes commented Jun 30, 2023

coreymckrill left a comment

brandonpayton left a comment

brandonpayton Jul 1, 2023

barryhughes Jul 5, 2023

brandonpayton Jul 7, 2023

brandonpayton left a comment

barryhughes commented Jul 7, 2023

brandonpayton left a comment

barryhughes commented Jul 11, 2023

		$existing_lock_value = $this->get_or_create_lock( $lock_type );
		$expiration_timestamp = $this->get_expiration_from( $existing_lock_value );

Atomic option locking #967

Atomic option locking #967

Conversation

barryhughes commented Jun 29, 2023 • edited Loading

Testing instructions

Changelog

coreymckrill commented Jun 30, 2023

barryhughes commented Jun 30, 2023

coreymckrill left a comment

Choose a reason for hiding this comment

brandonpayton left a comment

Choose a reason for hiding this comment

brandonpayton Jul 1, 2023

Choose a reason for hiding this comment

barryhughes Jul 5, 2023

Choose a reason for hiding this comment

brandonpayton Jul 7, 2023

Choose a reason for hiding this comment

brandonpayton left a comment

Choose a reason for hiding this comment

barryhughes commented Jul 7, 2023

brandonpayton left a comment

Choose a reason for hiding this comment

barryhughes commented Jul 11, 2023

barryhughes commented Jun 29, 2023 •

edited

Loading