-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Atomic operations on arrays #10949
Atomic operations on arrays #10949
Conversation
Edit.: Done. |
We already discussed this with Glen Mével. This should be useful for implementing a concurrent non-blocking queue. FTR, I think that using a specific type for atomic arrays would be beneficial, because we probably want the layout to be different. Indeed, it is often a good thing for performances that the various atomic cells are not in the same cache line, so that operations on contiguous atomic cells do not result in contention on the hardware. So if would probably be reasonnable to add padding to an atomic arrays so that each cell uses a different cache line. |
Thanks for pointing this out. Is it possible to implement such a padding in a portable way? |
I indeed believe that arrays with atomic cells are a valuable feature. I’ll mention that, if it was included in the language, I’d also expect support for records with (some) atomic fields (such as: |
I wonder if we could support an unboxing optimization for record fields with type (Note: we need to include the block header as an extra word because OCaml does not support interior pointers. One could presumably unbox the first field without an extra header, but this makes the runtime data representation more complex for no obvious reason.) |
@gasche : it seems to me that implementing this by unboxing Atomic.t is a can of worms. As you just explain, this would need infix pointers, which we know create a lot of bugs in the case of closures. Why not simply adding an |
I agree with @jhjourdan: infix pointers are a pain, we even considered getting rid of them entirely at some point in the past, so let's not invent new uses for them. Also, they reduce the interest of atomic arrays:
As an aside: we need hard evidence that normal arrays of Atomic.t values, or normal records with Atomic.t fields, are not time- and space- efficient enough before considering other mechanisms. |
48d0955
to
e5c571a
Compare
e5c571a
to
1a0f7ea
Compare
I have performed two simple preliminary benchmarks to assess the cost of the indirection layer in arrays of The first benchmark consists in a single thread performing a lot of reads at subsequent indices in an array, either an The difference is modest for sequential accesses, but for random accesses it can be 2x (and up to 3x) for large arrays. The reason for that is probably that CPU caches greatly mitigate the cost of the extra indirection if the locality is high, which seems to be the case both with flat arrays and arrays of Another advantage of unboxed arrays is that they give more control over the layout of data, which enables to easily make sure that two values do not sit on the same cache lines—a possible source of contention. It is hard to do the same reliably with arrays of With the presented evidence, it would be useful to know whether it is useful to continue work on this PR so that we can get towards a merge, or whether this is too early for this feature and I close this PR for now. Would additional experimental evidence be needed to make this choice. If so, what would it be? If the general feeling is that we can get towards a merge with additional evidence, I can try to reduce the diff to minimise memory-model-related maintenance. Please share your thoughts. |
Thanks for the benchmark results! I thought about this PR again, here are some comments.
|
Thank you for your comments.
|
Ouch. I'm still not convinced the performance gains are worth so much code. Concerning putting atomics in different cache lines: a simple solution suggested during one of our meetings is to make values of type (Left as an exercise to the reader: determine the cache line size in a portable and efficient manner.) |
AFAIK Atomic arrays are not a common data structure -- usually one only needs a handful of atomic values. I would recommend tabling this PR until we have more experience with multicore programming. In the meantime, this data structure can be implemented by an external library in order to demonstrate its usefulness. |
Thank you for your input, as it is unlikely to progress further I am closing this PR with a summary of findings and identified requirements. Findings
Requirements identified
|
Thank you @OlivierNicole for taking the time to benchmark this. I took to long to write my comments, I leave them anyway in case the idea gets resurrected one way or another. Some comments on the said benchmarks
This is a valuable insight already. Note that with only reads, you never invalidate other threads’ caches, so the ordered traversal benefits from locality just as usual. One of the things we want to measure is precisely the cache behavior. For that we’d also need a benchmark which does some writes.
And of course we’d need benchmarking with several threads. On padding
Good point. When allocating atomic references in a row, as in
One possible strategy is to just add padding (so the only benefit is the unboxing of atomic references). Another strategy is, for instance (assuming that a cache line is 8 words):
That way, we don’t waste space with padding, but P adjacent indexes always fall on P separate cache lines.
We probably don’t want such a strategy to be built-in. On typesOn the type of atomic arrays, I concur with others: it seems clear to me that we’d want a separate type |
We've taken this into account by performing 10,000 passes on the array in each case. The run time should be dominated by the array accesses, and profiling shows that it is the case. Edit: Realizing that I probably misunderstood your comment and that you were talking about a bias against random access when compared to subsequent accesses. It's true, however it does not invalidate the comparison between atomic arrays and arrays of atomics inside the random access benchmark, which was what I wanted to evaluate.
That's true. For the sake of completeness, I performed such a benchmark. Many different benchmarks could be performed, I chose to evaluate the cost of false sharing, i.e. cache invalidation costs incurred by by different threads modifying separate values which happen to sit on the same cache line. To demonstrate that atomic arrays can avoid false sharing, two additional benchmarks use a padded array, and an array with remapped indices using the method suggested by @gmevel. In this particular case, this index remapping implies that different threads never modify the same cache line. With random accesses, atomic arrays perform pretty much uniformly worse, except for very small arrays (in which false sharing occurs in spite of randomness), and for a very large 5Mi-element array (perhaps related to the L2 cache). After some quick profiling, I interpret the fact that atomic arrays perform worse by the fact that the C primitive for fetch-and-add on atomic arrays performs more stack operations than the fetch-and-add on With subsequent accesses, padded atomic arrays and arrays with remapping are much faster, with 3x or 4x speedups except for very large arrays. |
Proposed changes
Atomic operations on arrays, namely:
These operations are in a new
Atomic.Array
module. These functions currently take a regular'a array
, but the interface is up for debate; it sounds reasonable to have an opaque type'a Atomic.Array.t
instead.All modifying operations are translated as C calls. This is not optimal, but in line with the existing atomic primitives. Generation of more efficient code could maybe be seen as an improvement over this PR.
The operations have been tested for correction in non-parallel code (https://github.com/OlivierNicole/atomic_array_tests) and for sequential consistency with multiple domains, using @jmid's multicoretests library (https://github.com/OlivierNicole/multicoretests/blob/atomicarrays/src/lin_tests.ml). As expected, atomic accesses are sequentially consistent while non-atomic array accesses are not.
Notable points
This PR requires atomic doubles, and requires that they have the same memory layout as plain doubles, due to flat float arrays. This can be an issue on some 32-bit architectures [ocaml-multicore/ocaml-multicore#822 (comment)]. An alternative would be to decide that arrays of atomic floats are not flat.
The diff contains a number of comments marked TODO where I kindly ask a few confirmations from the reviewers.