Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does it mean for an atomic_ref to have work_item scope? #665

Open
Seanst98 opened this issue Nov 27, 2024 · 8 comments
Open

What does it mean for an atomic_ref to have work_item scope? #665

Seanst98 opened this issue Nov 27, 2024 · 8 comments

Comments

@Seanst98
Copy link

What is the expected behaviour of an atomic_ref with work_item scope? Does it make sense to apply a per-thread atomic?

The memory ordering enabled by the atomic_ref may still provide some benefit to individual threads, however, couldn't the same be achieved with atomic_fences instead of using an atomic_ref?

Are there any cases where work_item scoped atomics are meaningful, other than the aforementioned case?

How should an implementation handle work_item scoped atomics?

@Pennycook
Copy link
Contributor

This is an area of the consistency model that we should probably clarify. OpenCL defines some of these things a little more precisely, but I don't think we should adopt the OpenCL wording.

What is the expected behaviour of an atomic_ref with work_item scope? Does it make sense to apply a per-thread atomic?

I think the behavior should be: 1) the operation is performed atomically; 2) there are no ordering constraints across work-items.

The memory ordering enabled by the atomic_ref may still provide some benefit to individual threads, however, couldn't the same be achieved with atomic_fences instead of using an atomic_ref?

Probably, but sequential consistency across threads can be achieved using either an atomic_ref using seq_cst or an atomic_ref using relaxed followed by atomic_fences using seq_cst. I don't think the fact that there are multiple ways to achieve the same effect is a bug in the consistency model, and you could view atomic_ref with work_item scope as a short-hand for generating the appropriate fences.

Are there any cases where work_item scoped atomics are meaningful, other than the aforementioned case?

In OpenCL, work_item scoped fences are required to provide read-write coherence to image objects. I don't know enough about images to know whether we need similar wording here, so I defer to @AerialMantis. One concern I have here is that C++ would say sampling an image and accessing memory is a race condition.

Regardless, I don't think we would want to adopt exactly the same wording. OpenCL says you can't use work_item scope on an atomic operation, which seems unnecessarily restrictive. The fact that a work_item scope fence only affects images also seems very narrow to me, and we may want to consider generalizing to cover other asynchronous operations initiated by a work-item.

How should an implementation handle work_item scoped atomics?

I think this may be implementation- and device-specific, so it's hard to answer in general. But if an implementation can guarantee atomicity and ordering within a work-item without use of special instructions, work_item-scoped atomics could just lower directly to non-atomic instructions with no fences. If an implementation requires work_item-scoped fences to ensure coherence in some corner-cases, it should generate whatever instructions and fences are required.

@gmlueck
Copy link
Contributor

gmlueck commented Dec 13, 2024

@Pennycook: We talked about this a bit in the Friday meeting after you left. It seems like the key question is whether the scope parameter applies only to the memory ordering or to both the memory ordering and the scope of atomicity. This question is relevant for all values of scope, not just memory_scope::work_item. For example, is an atomic operation with memory_scope::work_group guaranteed to be atomic even with work-items in a different work-group?

Your response above:

I think the behavior should be: 1) the operation is performed atomically; 2) there are no ordering constraints across work-items.

seems to imply that you think the scope parameter applies only to the memory ordering and not to the scope of atomicity. Do you think that atomic_ref operations are required to be atomic with respect to all other work-items, regardless of the value of scope? It seems like this might affect performance on some machines.

If we decide that scope affects both the memory ordering and also the scope of atomicity, then I think memory_scope::work_item means that the operation is not guaranteed to be atomic w.r.t. any other work-item.

We think there is probably a similar issue with OpenCL, but we haven't checked to see what OpenCL says here. Regardless, this is something we should clarify in the SYCL spec. Does scope affect only the memory ordering or both the memory ordering and also the scope of atomicity?

@Pennycook
Copy link
Contributor

It's really difficult to talk about any of this stuff precisely, because the memory consistency model is underspecified. We still have this note in the specification:

The addition of memory scopes to the C++ memory model modifies the definition of some concepts from the C++ core language. For example: data races, the synchronizes-with relationship and sequential consistency must be defined in a way that accounts for atomic operations with differing (but compatible) scopes, in a manner similar to the OpenCL 2.0 specification. Efforts to formalize the memory model of SYCL are ongoing, and a formal memory model will be included in a future version of the SYCL specification.

I agree we should try and improve things here, but I think any discussions we have about this will involve a lot of hand-waving until we can find somebody with both the time and necessary expertise to help us formalize the model.

But, to try and answer your question... I don't think we want to get in a situation where only some operations performed via an atomic_ref are atomic. ISO C++ says:

An atomic_ref object applies atomic operations ([atomics.general]) to the object referenced by *ptr such that, for the lifetime ([basic.life]) of the atomic_ref object, the object referenced by *ptr is an atomic object ([intro.races]).

When I said that the accesses were atomic, above, I didn't mean to imply that they were (necessarily) visible to other work-items. I really just meant that any accesses via atomic_ref should be considered "atomic" when reasoning about memory consistency at work-item scope.

For example, if we say that image operations initiated by a work-item are also "atomic", then the purpose of a work-item scoped atomic operation would be to ensure that an application sees either the results of updating memory via some image operation or the results of updating it via an atomic_ref, but never some weird interleaving of the two. If we say that image operations are not "atomic", then atomic_ref with work-item scope would probably be equivalent to a non-atomic operation. (Even with the latter interpretation, I would argue there's no downside to allowing memory_scope::work_item to be used as a parameter, and implementations should support it for completeness.)

Rather than trying to incorporate scope into the meaning of "atomic", I think what we want to do here is modify the definition of terms like "data race" and "synchronizes-with", to say something like:

  • Atomic operations only prevent data races if all potentially concurrent modifications of a given memory location use atomic operations with compatible scopes;
  • An atomic operation can only synchronize-with another atomic operation if they have compatible scopes.

I'm hand-waving my way through "compatible scopes" because I'm not sure what compatibility means. Things are very easy to define when the scopes are equal, but there are some cases that should still be well-defined even if the scopes are different (e.g., I'd expect a release operation with work-group scope to synchronize-with an acquire operation with device-scope, if the work-items executing the operations were in the same work-group).

It looks like CUDA took a similar approach (see here) but I'm not sure about their wording.

@gmlueck
Copy link
Contributor

gmlueck commented Dec 16, 2024

Lets consider a concrete example:

int *pint = sycl::malloc_shared<int>(1, q);
*pint = 0;
q.parallel_for(ndr, [=](sycl::nd_item ndi) {
  sycl::atomic_ref<int, sycl::memory_order::relaxed, sycl::memory_scope::work_item> ar{*pint};
  int i = ar.fetch_add(1);
});

Note that the memory order is relaxed, so this is not a synchronization operation. In C++, std::atomic_ref with relaxed synchronization is still guaranteed to perform an atomic operation on the referenced value. In fact, incrementing a counter is a typical use case for relaxed atomics.

Also note that the memory scope is work_item. There is no analog to this in C++, so we can't look there for guidance.

Does the code snippet above guarantee that each work-item gets a unique value for i despite the fact that the memory scope of the atomic operation is only work-item?

I think @bashbaug and I are arguing that the answer should be "no". Are you saying "yes"?

@Pennycook
Copy link
Contributor

Does the code snippet above guarantee that each work-item gets a unique value for i despite the fact that the memory scope of the atomic operation is only work-item?

I think Ben and I are arguing that the answer should be "no". Are you saying "yes"?

I'm also saying "No", but I think we might disagree on how to describe the behavior.

I think we should describe your snippet as performing atomic operations -- via an atomic reference of the memory location pointed to by pint -- but clarify that the usage of incompatible scopes on concurrent atomic operations leads to undefined behavior.

I think you and Ben are suggesting that we say something like "atomics with work_item scope are not guaranteed to perform an atomic operation", which seems confusing to me.

@gmlueck
Copy link
Contributor

gmlueck commented Dec 16, 2024

I think you and Ben are suggesting that we say something like "atomics with work_item scope are not guaranteed to perform an atomic operation", which seems confusing to me.

I'm not suggesting that we write that into the spec. I think the spec should say that the scope parameter tells the set of work-items across which the operation is guaranteed to be atomic. It seems like we are in agreement here.

I think an outcome of this wording is that work_item scope provides no guarantees of atomicity at all because atomicity is only relevant between two different work-items.

@Pennycook
Copy link
Contributor

I think we're in agreement about which behaviors are invalid, but I'm still not sure that atomicity is necessarily only relevant between different work-items -- I don't understand why OpenCL says that there's no guarantee of read-write coherence within a work-item when images are involved, and I think we need to understand that before deciding work_item scope is useless.

I want to continue discussing the wording later, but I know what you mean and agree with the intent.

@gmlueck
Copy link
Contributor

gmlueck commented Dec 16, 2024

I don't understand why OpenCL says that there's no guarantee of read-write coherence within a work-item when images are involved, and I think we need to understand that before deciding work_item scope is useless.

I looked into this once. OpenCL has special rules around the APIs that read and write images. The API that writes an image is not guaranteed to be coherent with subsequent APIs that read the same image, even if the write and read operations are in the same work-item. If you want the effect of a write operation to be visible to a subsequent read, you need to make a special call to atomic_work_item_fence. This is explained here:

https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html#reading-and-writing-to-the-same-image-in-a-kernel

I don't know the history of this limitation. I'm guessing that some older hardware didn't provide coherence between writes and reads to image memory.

I'm not sure if SYCL needs anything similar here. It could be that SYCL is intended to run on newer, more capable hardware.

I think this is a separate topic, though. The issue in OpenCL with image writes/reads isn't really related to atomic operations as far a I can tell. The fact that it uses an API with the name "atomic" in it seems like a misnomer to me. If we do need some operation like this in SYCL, I think we should find a better name -- maybe something like flush_image_writes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants