Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-115999: Specialize LOAD_GLOBAL in free-threaded builds #126607

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

mpage
Copy link
Contributor

@mpage mpage commented Nov 9, 2024

Enable specialization of LOAD_GLOBAL in free-threaded builds.

Thread-safety of specialization in free-threaded builds is provided by the following:

  • A critical section is held on both the globals and builtins objects during specialization. This ensures we get an atomic view of both builtins and globals during specialization.
  • Generation of new keys versions is made atomic in free-threaded builds.
  • Existing helpers are used to atomically modify the opcode.

Thread-safety of specialized instructions in free-threaded builds is provided by the following:

  • Relaxed atomics are used when loading and storing dict keys versions. This avoids potential data races as the dict keys versions are read without holding the dictionary's per-object lock in version guards.
  • Dicts keys objects are passed from keys version guards to the downstream uops. This ensures that we are loading from the correct offset in the keys object. Once a unicode key has been stored in a keys object for a combined dictionary in free-threaded builds, the offset that it is stored in will never be reused for a different key. Once the version guard passes, we know that we are reading from the correct offset.
  • The dictionary read fast-path is used to read values from the dictionary once we know the correct offset.

Performance-wise, this looks like a 3-4% improvement on free-threaded builds on pyperformance.

Thread-safety of specialization in free-threaded builds:

- A critical section is held on both the globals and builtins objects during
  specialization. This ensures we get an atomic view of both builtins and globals
  during specialization.
- Generation of new keys versions is made atomic in free-threaded builds.
- We use helpers safely modify the bytecode.

Thread-safety of specialized instructions in free-threaded builds:

- Dicts keys objects are passed from keys version guards to the downstream uops.
  This ensures that we are loading from the correct offset in the keys object.
  Once a unicode key has been stored in a keys object for a combined dictionary
  in free-threaded builds, the offset that it is stored in will never be reaused
  for a different key.
- The dictionary read fast-path is used to read values from the dictionary.
- Relaxed atomics are used when loading and storing dict keys versions. This
  avoids potential data races as the dict keys versions may be read without
  holding the dictionary's per-object lock while the instructions are executing.
This handles the case where another thread is instrumenting the bytecode.
@mpage mpage marked this pull request as ready for review November 9, 2024 02:00
@mpage mpage requested review from colesbury and Yhg1s November 9, 2024 02:00
Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was less complicated than I initially anticipated. Nice!

LGTM overall, but I'll withold from approval to let others review it first.

Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! A few comments below.

@@ -4498,7 +4498,7 @@ dict_popitem_impl(PyDictObject *self)
return NULL;
}
}
self->ma_keys->dk_version = 0;
FT_ATOMIC_STORE_UINT32_RELAXED(self->ma_keys->dk_version, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one more assignment to dk_version in insert_split_key that should use this pattern.

void
_Py_Specialize_LoadGlobal(
static void
specialize_load_global_lock_held(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mark the dicts as "shared" (_PyObject_GC_SET_SHARED / SET_DICT_SHARED) when we specialize for load global. That ensures that the keys object will not be immediately freed when the dictionary is resized, which is important if another thread is trying to read the keys without holding a lock.

The dict code enforces this with ensure_shared_on_read(), but I think we should have the version check subsume this: if the dict has a valid version, then it's marked as shared (and we can attempt the non-locking read without checking the owning thread).

So maybe ensure the dict is shared in _PyDictKeys_GetVersionForCurrentState.

Comment on lines +1637 to +1640
if (globals_keys != _Py_atomic_load_ptr_acquire(&dict->ma_keys)) {
Py_DECREF(res_o);
DEOPT_IF(true);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these checks that the keys haven't changed are not necessary. We have them in dictobject.c too, and I think they're not necessary there either.

I think the checks were inherited from the nogil forks, which used a different scheme for dictionary keys.

(Also the use of global_keys here seems to contradict the DEAD(globals_keys) above.)

DEAD(globals_keys);
SYNC_SP();
DEOPT_IF(res_o == NULL);
#if Py_GIL_DISABLED
int increfed = _Py_TryIncrefCompare(&entries[index].me_value, res_o);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to support deferred reference counting here. Loading functions to call them will use LOAD_GLOBAL_MODULE and LOAD_GLOBAL_BUILTINS.

I don't think we have the equivalent of _Py_TryIncrefCompare for _PyStackRef yet, it needs to be written.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants