-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCS: Introduce lightweight rwlock #10355
base: master
Are you sure you want to change the base?
UCS: Introduce lightweight rwlock #10355
Conversation
Artemy-Mellanox
commented
Dec 5, 2024
•
edited by yosefe
Loading
edited by yosefe
24a457b
to
65f025c
Compare
65f025c
to
89415df
Compare
src/ucs/type/rwlock.h
Outdated
ucs_cpu_relax(); | ||
} | ||
|
||
x = __atomic_fetch_add(&lock->l, UCS_RWLOCK_READ, __ATOMIC_ACQUIRE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can use the atomic operations defined in atomic.h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the we replace deprecated __sync*
fuctions from atomic.h
with new __atomic
variants?
test/gtest/ucs/test_type.cc
Outdated
int m = std::thread::hardware_concurrency(); | ||
std::vector<int> threads = {1, 2, 4, m}; | ||
std::vector<int> writers_per_256 = {1, 25, 128, 250}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- i think we also want to measure overhead of read lock+unlock regardless of concurrency since it is the reason we added lightweight rwlock
- can you pls post example output in the PR description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- you mean write percent 0?
- didn't i do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- yes ( isee you added it)
- yes, pls update with percent 0
test/gtest/ucs/test_type.cc
Outdated
std::this_thread::sleep_for(std::chrono::milliseconds(1)); | ||
} | ||
|
||
void measure_one(int num, int writers, const std::function<void()> &r, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe introduce a typedef with using
:
using run_func_t = std::function<void()>;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also propose more explicit names:
r
-> reader
w
-> writer
num
-> thread_count
test/gtest/ucs/test_type.cc
Outdated
void measure_one(int num, int writers, const std::function<void()> &r, | ||
const std::function<void()> &w, const std::string &name) | ||
{ | ||
std::vector<std::thread> tt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: maybe name it threads
test/gtest/ucs/test_type.cc
Outdated
{ | ||
int m = std::thread::hardware_concurrency(); | ||
std::vector<int> threads = {1, 2, 4, m}; | ||
std::vector<int> writers_per_256 = {1, 25, 128, 250}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use percent to set the number of writers? It will be easier to understand
std::vector<int> writers_percent = {1, 25, 50, 75, 90};
src/ucs/type/rwlock.h
Outdated
ucs_cpu_relax(); | ||
} | ||
|
||
x = __atomic_fetch_add(&lock->l, UCS_RWLOCK_READ, __ATOMIC_ACQUIRE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the we replace deprecated __sync*
fuctions from atomic.h
with new __atomic
variants?
test/gtest/ucs/test_type.cc
Outdated
|
||
UCS_TEST_F(test_rwlock, perf) { | ||
ucs_rwlock_t lock = UCS_RWLOCK_STATIC_INITIALIZER; | ||
measure( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good performance test, but it must change some state to guarantee the lock correctness. This is the whole purpose of litmus test. For instance, these functions may perform some simple math calculations and we check the invariant:
// Invariant: counter2 is 2 times bigger than counter1
int counter1 = 1;
int counter2 = 2;
measure(
[&]() {
ucs_rwlock_read_lock(&lock);
UCS_ASSERT_EQ(counter1 * 2, counter2);
ucs_rwlock_read_unlock(&lock);
},
[&]() {
ucs_rwlock_write_lock(&lock);
counter1 += counter1;
counter2 += counter2;
if (counter2 > 100000) {
counter1 = 1;
counter2 = 2;
}
ucs_rwlock_write_unlock(&lock);
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is a good idea. This test doesn't guarantee lock correctness, it will very easily give a false negative result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, testing correctness of MT algorithms is hard topic. You're right about false negatives results, that's ok. That's actually a nature of litmus tests: when they pass, it does not guarantee that your algorithm is 100% correct, but when they fail - it's obviously broken. Personally I catch tons of MT issues with the help of litmus tests.
If you propose some other way of testing correctness - let's discuss that. What I'm proposing is well established industry practise, and I'm sure it's better to have it than no testing at all.
Moreover, the example that I provided is just scratching the surface. We should also consider adding tests with nested locks, try-locks, etc
test/gtest/ucs/test_type.cc
Outdated
UCS_TEST_F(test_rwlock, pthread) { | ||
pthread_rwlock_t plock; | ||
pthread_rwlock_init(&plock, NULL); | ||
measure( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, update/verify some state
src/ucs/type/rwlock.h
Outdated
int x; | ||
|
||
while (1) { | ||
while (lock->l & UCS_RWLOCK_MASK) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we read atomic without mem order guarantees, normally it should be something
__atomic_load_n(&lock->l, __ATOMIC_RELAXED)
Same in other read cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? AFAIU relaxed mem order means - no mem order guarantees
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, my point here is to use the uniform API to intercept loads/stores of this variable, so it does not need to be volatile. And then we can specify an appropriate mem order, whether it's relaxed or acquire.
Btw I also see performance improvement after replacing volatile with __atomic_load
s
src/ucs/type/rwlock.h
Outdated
|
||
static inline void ucs_rwlock_write_unlock(ucs_rwlock_t *lock) | ||
{ | ||
__atomic_fetch_sub(&lock->l, UCS_RWLOCK_WRITE, __ATOMIC_RELAXED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it is usually __ATOMIC_RELEASE
to be used on all release paths to contain and be sure that what happened under lock is visible, any reason for not doing so in this PR?
} | ||
}; | ||
|
||
UCS_TEST_F(test_rwlock, lock) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a way to run this/some tests with maximal optimizations (maybe inline compiler pragma, ..) ? I suspect we end-up running it without optimisations which might affect correctness-related tests?
src/ucs/type/rwlock.h
Outdated
|
||
|
||
/** | ||
* Read-write lock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spinlock
src/ucs/type/rwlock.h
Outdated
#include <errno.h> | ||
|
||
/** | ||
* The ucs_rwlock_t type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: ucs_rw_spinlock_t
and for all apis?
} | ||
|
||
while (lock->l > UCS_RWLOCK_WAIT) { | ||
ucs_cpu_relax(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we benefit from either using sched_yeld()
, or X times ucs_cpu_relax()
before retrying to read lock->l
?
src/ucs/type/rwlock.h
Outdated
} | ||
|
||
|
||
static inline void ucs_rwlock_read_unlock(ucs_rwlock_t *lock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add assertions/checks to detect underflow using returned value, only on debug builds (and also overflow on the other path)
src/ucs/type/rwlock.h
Outdated
|
||
static inline void ucs_rwlock_write_unlock(ucs_rwlock_t *lock) | ||
{ | ||
__atomic_fetch_sub(&lock->l, UCS_RWLOCK_WRITE, __ATOMIC_RELAXED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
detect underflow on debug builds?
src/ucs/type/rwlock.h
Outdated
* Read-write lock. | ||
*/ | ||
typedef struct { | ||
volatile int l; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use unsigned int
to have defined behavior on overflow/underflow (and detect it)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't expect overflow by design so signed int is more suitable i this case IMO
src/ucs/type/rwlock.h
Outdated
} | ||
|
||
|
||
static inline void ucs_rwlock_write_lock(ucs_rwlock_t *lock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe annotate for coverity with something like /* coverity[lock] */
, to help it with sanity checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for lock paths
5609ae7
to
d41b883
Compare
d41b883
to
2c3a0e8
Compare
static UCS_F_ALWAYS_INLINE void ucs_cpu_relax() | ||
{ | ||
#ifdef __SSE2__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: do something else if SSE2 not defined, like "":::"memory" ?
if (flags & UCS_ATOMIC_FENCE_LOCK) { | ||
return __ATOMIC_ACQUIRE; | ||
} | ||
|
||
if (flags & UCS_ATOMIC_FENCE_UNLOCK) { | ||
return __ATOMIC_RELEASE; | ||
} | ||
|
||
return __ATOMIC_RELAXED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are UCS_ATOMIC_xx defined as flags if there is no meaning to combine more than one of them?
also, how does it relate to UCS_DEFINE_ATOMIC_xx macros? should we remove/extend them?
static UCS_F_ALWAYS_INLINE int ucs_atomic_get(int *ptr, unsigned flags) | ||
{ | ||
return __atomic_load_n(ptr, ucs_atomic_memorder(flags)); | ||
} | ||
|
||
|
||
static UCS_F_ALWAYS_INLINE int | ||
ucs_atomic_fadd(int *ptr, int val, unsigned flags) | ||
{ | ||
return __atomic_fetch_add(ptr, val, ucs_atomic_memorder(flags)); | ||
} | ||
|
||
|
||
static UCS_F_ALWAYS_INLINE void | ||
ucs_atomic_sub(int *ptr, int val, unsigned flags) | ||
{ | ||
__atomic_fetch_sub(ptr, val, ucs_atomic_memorder(flags)); | ||
} | ||
|
||
|
||
static UCS_F_ALWAYS_INLINE void ucs_atomic_or(int *ptr, int val, unsigned flags) | ||
{ | ||
__atomic_fetch_or(ptr, val, ucs_atomic_memorder(flags)); | ||
} | ||
|
||
|
||
static UCS_F_ALWAYS_INLINE int | ||
ucs_atomic_cswap(int *ptr, int old_val, int new_val, unsigned flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we define it as macros to allow any type (not just int)?
flags & UCS_ATOMIC_WEAK, | ||
ucs_atomic_memorder(flags), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why in case of success we want weak mem order?
*/ | ||
typedef struct { | ||
volatile int l; | ||
} ucs_rwlock_t; | ||
int state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe uint32_t ? since it's flags value and not numeric value
(__atomic_compare_exchange_n(&lock->l, &x, x + UCS_RWLOCK_WRITE, 1, | ||
__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))) { | ||
return 0; | ||
ucs_atomic_cswap(&lock->state, x, x + UCS_RWLOCK_WRITE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use
"!(x & UCS_RWLOCK_WRITE)" instead of "x < UCS_RWLOCK_WRITE"
"x | UCS_RWLOCK_WRITE" instead of "x + UCS_RWLOCK_WRITE"
since this is bit value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe keep wait bit included in the check too
|
||
static UCS_F_ALWAYS_INLINE void ucs_rw_spinlock_cleanup(ucs_rw_spinlock_t *lock) | ||
{ | ||
ucs_assert(lock->state == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertv
|
||
int write_taken = 0; | ||
bool write_taken = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use true/false for bool instead of 1/0
test/gtest/ucs/test_type.cc
Outdated
int m = std::thread::hardware_concurrency(); | ||
std::vector<int> threads = {1, 2, 4, m}; | ||
std::vector<int> writers_per_256 = {1, 25, 128, 250}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- yes ( isee you added it)
- yes, pls update with percent 0
@@ -138,4 +139,59 @@ UCS_DEFINE_ATOMIC_BOOL_CSWAP(16, w); | |||
UCS_DEFINE_ATOMIC_BOOL_CSWAP(32, l); | |||
UCS_DEFINE_ATOMIC_BOOL_CSWAP(64, q); | |||
|
|||
|
|||
#define UCS_ATOMIC_WEAK UCS_BIT(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to encapsulate these values, then maybe it should be an enum?
And also we miss few other values:
__ATOMIC_CONSUME - probably we don't need this
Data dependency only for both barrier and synchronization with another thread.
__ATOMIC_ACQ_REL - I'm not sure about this one
Full barrier in both directions and synchronizes with acquire loads and release stores in another thread.
__ATOMIC_SEQ_CST - the strongest mode, we definitely need it
Full barrier in both directions and synchronizes with acquire loads and release stores in all threads.
#define UCS_ATOMIC_FENCE_UNLOCK UCS_BIT(2) | ||
|
||
|
||
static UCS_F_ALWAYS_INLINE int ucs_atomic_memorder(unsigned flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can simply use the actual flag (__ATOMIC_ACQUIRE
etc) directly
static UCS_F_ALWAYS_INLINE void | ||
ucs_rw_spinlock_read_unlock(ucs_rw_spinlock_t *lock) | ||
{ | ||
ucs_assert(lock->state >= UCS_RWLOCK_READ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertv with details?
@@ -0,0 +1,137 @@ | |||
/* | |||
* Copyright (c) NVIDIA CORPORATION & AFFILIATES, 2024. ALL RIGHTS RESERVED. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copyrights would need 2025
* | ||
* Readers increment the counter by UCS_RWLOCK_READ (4) | ||
* Writers set the UCS_RWLOCK_WRITE bit when lock is held | ||
* and set the UCS_RWLOCK_WAIT bit while waiting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe detail something in the line of "the wait bit is meant for all subsequent read lock to let any write lock go first"
(__atomic_compare_exchange_n(&lock->l, &x, x + UCS_RWLOCK_WRITE, 1, | ||
__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))) { | ||
return 0; | ||
ucs_atomic_cswap(&lock->state, x, x + UCS_RWLOCK_WRITE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe keep wait bit included in the check too