-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft: Desul ordered atomic policies + litmus tests #1616
base: develop
Are you sure you want to change the base?
Draft: Desul ordered atomic policies + litmus tests #1616
Conversation
Fiddling around with some parameters for the litmus test driver: - It seems that having only a subset of the running blocks participate in the Message Passing litmus test increases the rate at which weak memory behaviors are observed. - Pre-stressing memory doesn't seem to help on NVIDIA V100s.
Store buffering is an observable behavior where a store may be reordered after a load. This exercises MemoryOrderSeqCst.
- Use a forall device kernel to check results - Interleave order of operations between testing threads - Only warn on a lack of observed relaxed behaviors
Correctly use the stress testing formulation from the paper, "Foundations of Empirical Memory Consistency Testing" (OOPSLA 2020). Instead of having all stressing blocks scatter their accesses across the "stressing" array, select a small-ish subset of 64-word lines and stripe them across the stressing blocks. This increases the stress on the contention hardware in a GPU. Synchronize testing blocks and stressing blocks together on each iteration.
One comment based on yours @publixsubfan, all previously existing raja atomics were relaxed. It's the stronger ones, and really the scopes, that are most interesting because they can mean we can pass data without having to do all loads and stores with atomics. Block scope atomics, with device scope fences only when necessary, also are likely to help us greatly on El Cap. The atomics themselves wont be faster, but there will be substantially less expensive cache invalidation. The code looks good to me with a cursory look over it. I'm not sure what we want to do with respect to these interfaces longer term, but this looks good to me as a place to explore. |
Can you add something about how you get back non-atomic memory operations using this interface? |
RAJA_HOST_DEVICE RAJA_INLINE T atomicAdd(AtomicPolicy, T volatile *acc, T value) | ||
{ | ||
using desul_order = | ||
typename detail::DesulAtomicPolicy<AtomicPolicy>::memory_order; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that AtomicPolicy has to be detail_atomic_t<...>
instead of DesulAtomicPolicy<...>
When designing desul, we realized that the sequential option behaved more like a scope than a memory order. The scope is |
Summary
RAJA::atomic_{mem_policy}_{scope}
, where:mem_policy
is one ofrelaxed
,acquire
,release
,acq_rel
, orseq_cst
scope
is either empty (device-scope),system
for a system-wide atomic, orblock
for a block-wide atomicMotivation
On architectures which adopt relaxed memory models (ARM, PowerPC, most GPU architectures), the order in which memory modifications on one thread are observed on another thread may differ from the "program order" of the memory operations. This may lead to unexpected results if, for example, an atomic variable is used as a mutex; writes done in the critical section may not be visible to another thread due to the memory subsystem reordering the writes.
x86 implements a much stronger, but not entirely sequentially-consistent memory model (x86-TSO). The only observable reordering between threads is Store->Load reordering, where an earlier (in "program-order") store can be reordered after the load. The "Store Buffer" litmus test demonstrates this behavior, where without fencing, it can appear as if the store instruction in both threads happen after their corresponding load instructions.
Desul supports specifying a memory order policy, which can restore a consistent view of memory operations between threads with a stronger memory ordering pair between the two threads.
Litmus testing
The added GPU litmus tests are based off of the work here: https://gpuharbor.ucsc.edu/webgpu-mem-testing/ and in a paper "Foundations of Empirical Memory Consistency Testing", Kirkham et al. (OOPSLA 2020).
Litmus testing allows us to probe the existence of relaxed memory behavior on GPU platforms. We implement a family of 2-thread tests, where each thread attempts to write data to or read data from a thread residing on a different block.
More references
"A Tutorial Introduction to the ARM and POWER Relaxed Memory Models"