Atomic operations on arrays #10949

OlivierNicole · 2022-01-25T15:25:13Z

Proposed changes

Atomic operations on arrays, namely:

atomic read, write, exchange
compare and set
fetch and add

These operations are in a new Atomic.Array module. These functions currently take a regular 'a array, but the interface is up for debate; it sounds reasonable to have an opaque type 'a Atomic.Array.t instead.

All modifying operations are translated as C calls. This is not optimal, but in line with the existing atomic primitives. Generation of more efficient code could maybe be seen as an improvement over this PR.

The operations have been tested for correction in non-parallel code (https://github.com/OlivierNicole/atomic_array_tests) and for sequential consistency with multiple domains, using @jmid's multicoretests library (https://github.com/OlivierNicole/multicoretests/blob/atomicarrays/src/lin_tests.ml). As expected, atomic accesses are sequentially consistent while non-atomic array accesses are not.

Notable points

This PR requires atomic doubles, and requires that they have the same memory layout as plain doubles, due to flat float arrays. This can be an issue on some 32-bit architectures [ocaml-multicore/ocaml-multicore#822 (comment)]. An alternative would be to decide that arrays of atomic floats are not flat.

The diff contains a number of comments marked TODO where I kindly ask a few confirmations from the reviewers.

OlivierNicole · 2022-01-25T15:36:01Z

I realize that a sizeable part of the diff is about introducing new ~is_atomic flags in Cmm_helpers. This is because this PR was developed concurrently with #10943 and is related to it. These parts will be updated based on what is decided there.

Edit.: Done.

jhjourdan · 2022-01-26T13:56:41Z

We already discussed this with Glen Mével. This should be useful for implementing a concurrent non-blocking queue.

FTR, I think that using a specific type for atomic arrays would be beneficial, because we probably want the layout to be different. Indeed, it is often a good thing for performances that the various atomic cells are not in the same cache line, so that operations on contiguous atomic cells do not result in contention on the hardware. So if would probably be reasonnable to add padding to an atomic arrays so that each cell uses a different cache line.

OlivierNicole · 2022-01-26T14:46:51Z

Thanks for pointing this out. Is it possible to implement such a padding in a portable way?

gmevel · 2022-01-26T14:52:43Z

I indeed believe that arrays with atomic cells are a valuable feature. I’ll mention that, if it was included in the language, I’d also expect support for records with (some) atomic fields (such as: type r = { mutable atomic a : int ; mutable b : int ; c : int }). That’s more ambitious as it implies extending the surface language, and also because Atomic.t could then be refactored using that feature.

gasche · 2022-01-26T15:09:14Z

I wonder if we could support an unboxing optimization for record fields with type Atomic.t (or ref, for that matter), where the atomic block header and the argument are inlined into the record representation. This is less compact that built-in mutable fields (or built-in atomic fields), but it is also more expressive as one can refer to the field individually and pass it to Atomic.t-expecting (or ref-expecting) consumers. It is also possibly an instance of a more general unboxing transformation.

(Note: we need to include the block header as an extra word because OCaml does not support interior pointers. One could presumably unbox the first field without an extra header, but this makes the runtime data representation more complex for no obvious reason.)

jhjourdan · 2022-01-26T15:26:52Z

@gasche : it seems to me that implementing this by unboxing Atomic.t is a can of worms. As you just explain, this would need infix pointers, which we know create a lot of bugs in the case of closures.

Why not simply adding an atomic qualifier for records fields, and a few language primitives for dealing with them?

xavierleroy · 2022-01-26T17:35:03Z

I agree with @jhjourdan: infix pointers are a pain, we even considered getting rid of them entirely at some point in the past, so let's not invent new uses for them.

Also, they reduce the interest of atomic arrays:

array of (pointers to) Atomic.t values: 1 + 3N words (N = size of the array)
flat atomic array: 1 + N words
flat array of (atomic header, atomic value) pairs: 1 + 2N words

As an aside: we need hard evidence that normal arrays of Atomic.t values, or normal records with Atomic.t fields, are not time- and space- efficient enough before considering other mechanisms.

OlivierNicole · 2022-02-18T09:12:00Z

I have performed two simple preliminary benchmarks to assess the cost of the indirection layer in arrays of Atomic.t values.

The first benchmark consists in a single thread performing a lot of reads at subsequent indices in an array, either an int Atomic.t array or direct atomic reads in an int array. The second benchmark is also a thread performing a lot of reads, but at random indices. The results are shown below with a linear scale:

and a logarithmic scale:

The difference is modest for sequential accesses, but for random accesses it can be 2x (and up to 3x) for large arrays. The reason for that is probably that CPU caches greatly mitigate the cost of the extra indirection if the locality is high, which seems to be the case both with flat arrays and arrays of int Atomic.t values in this simple benchmark; but not with low locality as is the case with random accesses.

Another advantage of unboxed arrays is that they give more control over the layout of data, which enables to easily make sure that two values do not sit on the same cache lines—a possible source of contention. It is hard to do the same reliably with arrays of Atomic.ts. Finally, the space advantage (N+1 words rather than 3N+1) remains.

With the presented evidence, it would be useful to know whether it is useful to continue work on this PR so that we can get towards a merge, or whether this is too early for this feature and I close this PR for now. Would additional experimental evidence be needed to make this choice. If so, what would it be? If the general feeling is that we can get towards a merge with additional evidence, I can try to reduce the diff to minimise memory-model-related maintenance. Please share your thoughts.

gasche · 2022-02-18T13:18:19Z

Thanks for the benchmark results!

I thought about this PR again, here are some comments.

I'm uncomfortable with the amount of redundancy in the implementation. The patch currently adds 999 lines of code, and most of it look like duplication of existing non-atomic code. This makes me a bit nervous from a maintenance perspective; it doesn't make the code more complex, but if we now have twice as much of it, every refactoring-style lateral change is going to pay the cost. See this part of translprim.ml for example, which makes me uneasy. We already had some duplication with the s and u variants of each accessor; now we are adding a new dimension, so we have four time as much code. Could we refactor the array primitives to reduce duplication ?
Possibly a simple approach: what about implementing these as C calls as early as possible, instead of propagating them through the compiler representations? This could be slower, but could we (again) measure this?
Do we really want to reuse the 'a array type? My guts tell me to have a separate type of atomic arrays.
What is the kind of code that needs atomic arrays? Do we have examples somewhere?
(Note: I don't think we can easily implement domain-indexed arrays in OCaml right now, given that, if I understand correctly, we don't have access to compact/sequential identifiers for domains, only to "unique identifiers", but on the other hand we have thread-local storage that covers some of this need as well.)
What is this business of cache lines? Is the proposal really to have 52 bytes of padding between every 8-byte element in the array? Or are we expecting programmers to transform indices themselves to only use one element out of each cacheline? Are we talking about one data structure, or two different data structures with different requirements?

OlivierNicole · 2022-02-18T14:36:08Z

Thank you for your comments.

Yes, I hadn't thought of it but it does seem relevant to refactor array primitives to avoid this code duplication.
Implementing them as C calls is possible (mutating primitives are already translated to C calls), but I think it can be substantially slower on x86 where atomic array reads are translated to plain loads. But it is a possibility.
Having a separate type for atomic arrays is not excluded at all; one of the reasons I opened this as a draft PR was to let people discuss on what the best interface would be.
I am currently trying out variations of @gmevel and @jhjourdan's concurrent queue to determine if its performance can be improved using atomic arrays. So far the gain seems modest, but I have had difficulties conducting the benchmark without performance swings due to unrelated matters such as code alignment, so this conclusion is not definitive.
There is also this lock-free map proposal which uses a node Atomic.t array and I intend to assess the impact of turning it into an atomic array.
As concurrent data structures are not my primary background, I don't know other specific use cases where atomic arrays are needed, but maybe someone with more experience will.
The proposal is to use the same memory layout as regular arrays, but

Or are we expecting programmers to transform indices themselves to only use one element out of each cacheline?

This is an option that atomic arrays offer, akin to how C programmers sometimes use alignment directives to ensure that two fields mutated by different threads are on different cache lines. But atomic arrays can also be used like regular arrays.

xavierleroy · 2022-02-18T17:09:46Z

I'm uncomfortable with the amount of redundancy in the implementation. The patch currently adds 999 lines of code, and most of it look like duplication of existing non-atomic code.

Ouch. I'm still not convinced the performance gains are worth so much code.

Concerning putting atomics in different cache lines: a simple solution suggested during one of our meetings is to make values of type Atomic.t as big as a cache line; this way, the atomic word cannot share a cache line with any other block. That's better (in my opinion) than expecting users of atomic arrays to stride their accesses just right to avoid cache line sharing.

(Left as an exercise to the reader: determine the cache line size in a portable and efficient manner.)

bluddy · 2022-02-20T07:26:32Z

AFAIK Atomic arrays are not a common data structure -- usually one only needs a handful of atomic values. I would recommend tabling this PR until we have more experience with multicore programming. In the meantime, this data structure can be implemented by an external library in order to demonstrate its usefulness.

OlivierNicole · 2022-02-21T13:14:03Z

Thank you for your input, as it is unlikely to progress further I am closing this PR with a summary of findings and identified requirements.

Findings

Atomic arrays seem uncommon in parallel programming. Much more common are regular arrays, with a handful of atomic values telling the threads which indices to access.
Simple, preliminary benchmarks suggest that atomic arrays can improve data locality and avoid cache misses when not accessing subsequent indices (see Atomic operations on arrays #10949 (comment)), but the performance gain is significant only with large arrays.

Requirements identified

For now, not enough evidence of utility has been found to outweigh the added maintenance cost.
Such a feature would need to be carefully coded to avoid code duplication and additional memory-model-related maintenance burden.

gmevel · 2022-02-21T15:39:24Z

Thank you @OlivierNicole for taking the time to benchmark this. I took to long to write my comments, I leave them anyway in case the idea gets resurrected one way or another.

Some comments on the said benchmarks

The first benchmark consists in a single thread performing a lot of reads at subsequent indices in an array […]. The second benchmark is also a thread performing a lot of reads, but at random indices.

This is a valuable insight already. Note that with only reads, you never invalidate other threads’ caches, so the ordered traversal benefits from locality just as usual. One of the things we want to measure is precisely the cache behavior. For that we’d also need a benchmark which does some writes.

Also, I notice that both your benchmarks include Gc.minor () in the measure and (perhaps more importantly) the random-access benchmark also accounts for the creation of a random array and N non-atomic accesses to it. I believe that might constitute a big bias against the random access benchmark.

And of course we’d need benchmarking with several threads.

On padding

Another advantage of unboxed arrays is that they give more control over the layout of data, which enables to easily make sure that two values do not sit on the same cache lines—a possible source of contention. It is hard to do the same reliably with arrays of Atomic.ts.

Good point. When allocating atomic references in a row, as in Array.init n (fun i -> Atomic.make i), you cannot make sure that the references fall on separate cache lines and in fact, it’s likely that they fall next to one another. That’s addressed by @xavierleroy ’s suggestion of making _ Atomic.t span a whole a cache line but, in the event that this suggestion is implemented, the atomic array proposal would remain relevant IMHO. Indeed there would be some interest in flat and compact atomic arrays, as a much more compact representation of _ Atomic.t array, aimed at users that

either accept the possible slowdown of cache invalidations,
or are willing to implement a custom indexing strategy for mitigating it.

One possible strategy is to just add padding (so the only benefit is the unboxing of atomic references). Another strategy is, for instance (assuming that a cache line is 8 words):

take 8P ≥ N a multiple of 8;
allocate an array of length 8P (or just 8P−1, which works when 8P > N);
map index i to physical offset 8(i mod P) + (i / P).

That way, we don’t waste space with padding, but P adjacent indexes always fall on P separate cache lines.

++---+---+----+---+-----++----+-----+------+---+-------++       ++------+------+---+-------++
|| 0 | P | 2P | … | 7P  ||  1 | P+1 | 2P+1 | … | 7P+1  ||  ...  ||  P−1 | 2P−1 | … | 8P−1  ||
++---+---+----+---+-----++----+-----+------+---+-------++       ++------+------+---+-------++
 <----- cache line ----->

We probably don’t want such a strategy to be built-in.

On types

On the type of atomic arrays, I concur with others: it seems clear to me that we’d want a separate type _ AtomicArray.t as opposed to reusing _ Array.t. The currently described memory model assumes that we never mix several access modes on a same memory location, and I believe we don’t want to extend the model to support mixed access modes (this is one of the things that make the C11 model a nightmare).

OlivierNicole · 2022-02-23T13:20:04Z

Also, I notice that both your benchmarks include Gc.minor () in the measure and (perhaps more importantly) the random-access benchmark also accounts for the creation of a random array and N non-atomic accesses to it. I believe that might constitute a big bias against the random access benchmark.

We've taken this into account by performing 10,000 passes on the array in each case. The run time should be dominated by the array accesses, and profiling shows that it is the case.

Edit: Realizing that I probably misunderstood your comment and that you were talking about a bias against random access when compared to subsequent accesses. It's true, however it does not invalidate the comparison between atomic arrays and arrays of atomics inside the random access benchmark, which was what I wanted to evaluate.

This is a valuable insight already. Note that with only reads, you never invalidate other threads’ caches, so the ordered traversal benefits from locality just as usual. One of the things we want to measure is precisely the cache behavior. For that we’d also need a benchmark which does some writes.

That's true. For the sake of completeness, I performed such a benchmark. Many different benchmarks could be performed, I chose to evaluate the cost of false sharing, i.e. cache invalidation costs incurred by by different threads modifying separate values which happen to sit on the same cache line.
The program spawns 4 threads which performs 10,000 fetch-and-adds in an atomic array (resp. a simple array) of integers (resp. of int Atomic.ts). Like before, I've benchmarked random accesses and accesses to subsequent elements.
Each thread accesses a disjoint subset of the array: thread no. i accesses elements i mod 4.

To demonstrate that atomic arrays can avoid false sharing, two additional benchmarks use a padded array, and an array with remapped indices using the method suggested by @gmevel. In this particular case, this index remapping implies that different threads never modify the same cache line.

With random accesses, atomic arrays perform pretty much uniformly worse, except for very small arrays (in which false sharing occurs in spite of randomness), and for a very large 5Mi-element array (perhaps related to the L2 cache). After some quick profiling, I interpret the fact that atomic arrays perform worse by the fact that the C primitive for fetch-and-add on atomic arrays performs more stack operations than the fetch-and-add on Atomic.ts.

With subsequent accesses, padded atomic arrays and arrays with remapping are much faster, with 3x or 4x speedups except for very large arrays.

OlivierNicole force-pushed the 5.00+atomic_arrays branch from 48d0955 to e5c571a Compare February 17, 2022 16:02

Atomic operations on arrays

1a0f7ea

OlivierNicole force-pushed the 5.00+atomic_arrays branch from e5c571a to 1a0f7ea Compare February 17, 2022 16:19

OlivierNicole closed this Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atomic operations on arrays #10949

Atomic operations on arrays #10949

OlivierNicole commented Jan 25, 2022 •

edited

Loading

OlivierNicole commented Jan 25, 2022 •

edited

Loading

jhjourdan commented Jan 26, 2022

OlivierNicole commented Jan 26, 2022

gmevel commented Jan 26, 2022

gasche commented Jan 26, 2022

jhjourdan commented Jan 26, 2022

xavierleroy commented Jan 26, 2022 •

edited

Loading

OlivierNicole commented Feb 18, 2022

gasche commented Feb 18, 2022 •

edited

Loading

OlivierNicole commented Feb 18, 2022

xavierleroy commented Feb 18, 2022

bluddy commented Feb 20, 2022

OlivierNicole commented Feb 21, 2022

gmevel commented Feb 21, 2022 •

edited

Loading

OlivierNicole commented Feb 23, 2022 •

edited

Loading

Atomic operations on arrays #10949

Atomic operations on arrays #10949

Conversation

OlivierNicole commented Jan 25, 2022 • edited Loading

Proposed changes

Notable points

OlivierNicole commented Jan 25, 2022 • edited Loading

jhjourdan commented Jan 26, 2022

OlivierNicole commented Jan 26, 2022

gmevel commented Jan 26, 2022

gasche commented Jan 26, 2022

jhjourdan commented Jan 26, 2022

xavierleroy commented Jan 26, 2022 • edited Loading

OlivierNicole commented Feb 18, 2022

gasche commented Feb 18, 2022 • edited Loading

OlivierNicole commented Feb 18, 2022

xavierleroy commented Feb 18, 2022

bluddy commented Feb 20, 2022

OlivierNicole commented Feb 21, 2022

Findings

Requirements identified

gmevel commented Feb 21, 2022 • edited Loading

Some comments on the said benchmarks

On padding

On types

OlivierNicole commented Feb 23, 2022 • edited Loading

OlivierNicole commented Jan 25, 2022 •

edited

Loading

OlivierNicole commented Jan 25, 2022 •

edited

Loading

xavierleroy commented Jan 26, 2022 •

edited

Loading

gasche commented Feb 18, 2022 •

edited

Loading

gmevel commented Feb 21, 2022 •

edited

Loading

OlivierNicole commented Feb 23, 2022 •

edited

Loading