Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCS: Introduce lightweight rwlock #10355

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Artemy-Mellanox
Copy link
Contributor

@Artemy-Mellanox Artemy-Mellanox commented Dec 5, 2024

[ RUN      ] test_rwlock.perf <> <>
[     INFO ] 8.22035 ms builtin 1 threads 1 writers per 256 
[     INFO ] 9.91718 ms builtin 1 threads 25 writers per 256 
[     INFO ] 15.6226 ms builtin 1 threads 128 writers per 256 
[     INFO ] 8.53102 ms builtin 1 threads 250 writers per 256 
[     INFO ] 14.8544 ms builtin 2 threads 1 writers per 256 
[     INFO ] 20.3778 ms builtin 2 threads 25 writers per 256 
[     INFO ] 25.2576 ms builtin 2 threads 128 writers per 256 
[     INFO ] 18.0413 ms builtin 2 threads 250 writers per 256 
[     INFO ] 15.9894 ms builtin 4 threads 1 writers per 256 
[     INFO ] 25.9471 ms builtin 4 threads 25 writers per 256 
[     INFO ] 30.6886 ms builtin 4 threads 128 writers per 256 
[     INFO ] 23.7099 ms builtin 4 threads 250 writers per 256 
[     INFO ] 109.585 ms builtin 64 threads 1 writers per 256 
[     INFO ] 498.465 ms builtin 64 threads 25 writers per 256 
[     INFO ] 981.667 ms builtin 64 threads 128 writers per 256 
[     INFO ] 891.406 ms builtin 64 threads 250 writers per 256 
[       OK ] test_rwlock.perf (2699 ms)
[ RUN      ] test_rwlock.pthread <> <>
[     INFO ] 15.7748 ms pthread 1 threads 1 writers per 256 
[     INFO ] 17.1273 ms pthread 1 threads 25 writers per 256 
[     INFO ] 26.9475 ms pthread 1 threads 128 writers per 256 
[     INFO ] 15.2211 ms pthread 1 threads 250 writers per 256 
[     INFO ] 29.4559 ms pthread 2 threads 1 writers per 256 
[     INFO ] 215.037 ms pthread 2 threads 25 writers per 256 
[     INFO ] 104.325 ms pthread 2 threads 128 writers per 256 
[     INFO ] 41.5187 ms pthread 2 threads 250 writers per 256 
[     INFO ] 35.4409 ms pthread 4 threads 1 writers per 256 
[     INFO ] 196.073 ms pthread 4 threads 25 writers per 256 
[     INFO ] 122.957 ms pthread 4 threads 128 writers per 256 
[     INFO ] 62.3138 ms pthread 4 threads 250 writers per 256 
[     INFO ] 73.8066 ms pthread 64 threads 1 writers per 256 
[     INFO ] 198.412 ms pthread 64 threads 25 writers per 256 
[     INFO ] 165.476 ms pthread 64 threads 128 writers per 256 
[     INFO ] 118.79 ms pthread 64 threads 250 writers per 256 
[       OK ] test_rwlock.pthread (1439 ms)

src/ucs/arch/cpu.h Outdated Show resolved Hide resolved
src/ucs/type/rwlock.h Outdated Show resolved Hide resolved
src/ucs/type/rwlock.h Outdated Show resolved Hide resolved
src/ucs/type/rwlock.h Outdated Show resolved Hide resolved
ucs_cpu_relax();
}

x = __atomic_fetch_add(&lock->l, UCS_RWLOCK_READ, __ATOMIC_ACQUIRE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can use the atomic operations defined in atomic.h?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the we replace deprecated __sync* fuctions from atomic.h with new __atomic variants?

src/ucs/type/rwlock.h Outdated Show resolved Hide resolved
test/gtest/ucs/test_type.cc Outdated Show resolved Hide resolved
Comment on lines 187 to 190
int m = std::thread::hardware_concurrency();
std::vector<int> threads = {1, 2, 4, m};
std::vector<int> writers_per_256 = {1, 25, 128, 250};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. i think we also want to measure overhead of read lock+unlock regardless of concurrency since it is the reason we added lightweight rwlock
  2. can you pls post example output in the PR description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. you mean write percent 0?
  2. didn't i do that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes ( isee you added it)
  2. yes, pls update with percent 0

test/gtest/ucs/test_type.cc Outdated Show resolved Hide resolved
test/gtest/ucs/test_type.cc Outdated Show resolved Hide resolved
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}

void measure_one(int num, int writers, const std::function<void()> &r,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe introduce a typedef with using:
using run_func_t = std::function<void()>;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also propose more explicit names:
r -> reader
w -> writer
num -> thread_count

void measure_one(int num, int writers, const std::function<void()> &r,
const std::function<void()> &w, const std::string &name)
{
std::vector<std::thread> tt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: maybe name it threads

{
int m = std::thread::hardware_concurrency();
std::vector<int> threads = {1, 2, 4, m};
std::vector<int> writers_per_256 = {1, 25, 128, 250};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use percent to set the number of writers? It will be easier to understand

std::vector<int> writers_percent = {1, 25, 50, 75, 90};

ucs_cpu_relax();
}

x = __atomic_fetch_add(&lock->l, UCS_RWLOCK_READ, __ATOMIC_ACQUIRE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the we replace deprecated __sync* fuctions from atomic.h with new __atomic variants?


UCS_TEST_F(test_rwlock, perf) {
ucs_rwlock_t lock = UCS_RWLOCK_STATIC_INITIALIZER;
measure(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good performance test, but it must change some state to guarantee the lock correctness. This is the whole purpose of litmus test. For instance, these functions may perform some simple math calculations and we check the invariant:

    // Invariant: counter2 is 2 times bigger than counter1
    int counter1 = 1;
    int counter2 = 2;
    measure(
            [&]() {
                ucs_rwlock_read_lock(&lock);
                UCS_ASSERT_EQ(counter1 * 2, counter2);
                ucs_rwlock_read_unlock(&lock);
            },
            [&]() {
                ucs_rwlock_write_lock(&lock);
                counter1 += counter1;
                counter2 += counter2;
                if (counter2 > 100000) {
                    counter1 = 1;
                    counter2 = 2;
                }
                ucs_rwlock_write_unlock(&lock);
            },

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a good idea. This test doesn't guarantee lock correctness, it will very easily give a false negative result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, testing correctness of MT algorithms is hard topic. You're right about false negatives results, that's ok. That's actually a nature of litmus tests: when they pass, it does not guarantee that your algorithm is 100% correct, but when they fail - it's obviously broken. Personally I catch tons of MT issues with the help of litmus tests.
If you propose some other way of testing correctness - let's discuss that. What I'm proposing is well established industry practise, and I'm sure it's better to have it than no testing at all.

Moreover, the example that I provided is just scratching the surface. We should also consider adding tests with nested locks, try-locks, etc

UCS_TEST_F(test_rwlock, pthread) {
pthread_rwlock_t plock;
pthread_rwlock_init(&plock, NULL);
measure(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, update/verify some state

src/ucs/type/rwlock.h Outdated Show resolved Hide resolved
int x;

while (1) {
while (lock->l & UCS_RWLOCK_MASK) {
Copy link
Contributor

@iyastreb iyastreb Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we read atomic without mem order guarantees, normally it should be something

__atomic_load_n(&lock->l, __ATOMIC_RELAXED)

Same in other read cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? AFAIU relaxed mem order means - no mem order guarantees

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, my point here is to use the uniform API to intercept loads/stores of this variable, so it does not need to be volatile. And then we can specify an appropriate mem order, whether it's relaxed or acquire.
Btw I also see performance improvement after replacing volatile with __atomic_loads

src/ucs/type/rwlock.h Outdated Show resolved Hide resolved

static inline void ucs_rwlock_write_unlock(ucs_rwlock_t *lock)
{
__atomic_fetch_sub(&lock->l, UCS_RWLOCK_WRITE, __ATOMIC_RELAXED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is usually __ATOMIC_RELEASE to be used on all release paths to contain and be sure that what happened under lock is visible, any reason for not doing so in this PR?

}
};

UCS_TEST_F(test_rwlock, lock) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a way to run this/some tests with maximal optimizations (maybe inline compiler pragma, ..) ? I suspect we end-up running it without optimisations which might affect correctness-related tests?



/**
* Read-write lock.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spinlock

#include <errno.h>

/**
* The ucs_rwlock_t type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: ucs_rw_spinlock_t and for all apis?

}

while (lock->l > UCS_RWLOCK_WAIT) {
ucs_cpu_relax();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we benefit from either using sched_yeld(), or X times ucs_cpu_relax() before retrying to read lock->l?

}


static inline void ucs_rwlock_read_unlock(ucs_rwlock_t *lock)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add assertions/checks to detect underflow using returned value, only on debug builds (and also overflow on the other path)


static inline void ucs_rwlock_write_unlock(ucs_rwlock_t *lock)
{
__atomic_fetch_sub(&lock->l, UCS_RWLOCK_WRITE, __ATOMIC_RELAXED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect underflow on debug builds?

* Read-write lock.
*/
typedef struct {
volatile int l;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use unsigned int to have defined behavior on overflow/underflow (and detect it)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't expect overflow by design so signed int is more suitable i this case IMO

}


static inline void ucs_rwlock_write_lock(ucs_rwlock_t *lock)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe annotate for coverity with something like /* coverity[lock] */, to help it with sanity checks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for lock paths

static UCS_F_ALWAYS_INLINE void ucs_cpu_relax()
{
#ifdef __SSE2__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: do something else if SSE2 not defined, like "":::"memory" ?

Comment on lines +150 to +158
if (flags & UCS_ATOMIC_FENCE_LOCK) {
return __ATOMIC_ACQUIRE;
}

if (flags & UCS_ATOMIC_FENCE_UNLOCK) {
return __ATOMIC_RELEASE;
}

return __ATOMIC_RELAXED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are UCS_ATOMIC_xx defined as flags if there is no meaning to combine more than one of them?
also, how does it relate to UCS_DEFINE_ATOMIC_xx macros? should we remove/extend them?

Comment on lines +162 to +189
static UCS_F_ALWAYS_INLINE int ucs_atomic_get(int *ptr, unsigned flags)
{
return __atomic_load_n(ptr, ucs_atomic_memorder(flags));
}


static UCS_F_ALWAYS_INLINE int
ucs_atomic_fadd(int *ptr, int val, unsigned flags)
{
return __atomic_fetch_add(ptr, val, ucs_atomic_memorder(flags));
}


static UCS_F_ALWAYS_INLINE void
ucs_atomic_sub(int *ptr, int val, unsigned flags)
{
__atomic_fetch_sub(ptr, val, ucs_atomic_memorder(flags));
}


static UCS_F_ALWAYS_INLINE void ucs_atomic_or(int *ptr, int val, unsigned flags)
{
__atomic_fetch_or(ptr, val, ucs_atomic_memorder(flags));
}


static UCS_F_ALWAYS_INLINE int
ucs_atomic_cswap(int *ptr, int old_val, int new_val, unsigned flags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we define it as macros to allow any type (not just int)?

Comment on lines +192 to +193
flags & UCS_ATOMIC_WEAK,
ucs_atomic_memorder(flags),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why in case of success we want weak mem order?

*/
typedef struct {
volatile int l;
} ucs_rwlock_t;
int state;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe uint32_t ? since it's flags value and not numeric value

(__atomic_compare_exchange_n(&lock->l, &x, x + UCS_RWLOCK_WRITE, 1,
__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))) {
return 0;
ucs_atomic_cswap(&lock->state, x, x + UCS_RWLOCK_WRITE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use
"!(x & UCS_RWLOCK_WRITE)" instead of "x < UCS_RWLOCK_WRITE"
"x | UCS_RWLOCK_WRITE" instead of "x + UCS_RWLOCK_WRITE"
since this is bit value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe keep wait bit included in the check too


static UCS_F_ALWAYS_INLINE void ucs_rw_spinlock_cleanup(ucs_rw_spinlock_t *lock)
{
ucs_assert(lock->state == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertv


int write_taken = 0;
bool write_taken = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use true/false for bool instead of 1/0

Comment on lines 187 to 190
int m = std::thread::hardware_concurrency();
std::vector<int> threads = {1, 2, 4, m};
std::vector<int> writers_per_256 = {1, 25, 128, 250};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes ( isee you added it)
  2. yes, pls update with percent 0

@@ -138,4 +139,59 @@ UCS_DEFINE_ATOMIC_BOOL_CSWAP(16, w);
UCS_DEFINE_ATOMIC_BOOL_CSWAP(32, l);
UCS_DEFINE_ATOMIC_BOOL_CSWAP(64, q);


#define UCS_ATOMIC_WEAK UCS_BIT(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to encapsulate these values, then maybe it should be an enum?
And also we miss few other values:
__ATOMIC_CONSUME - probably we don't need this
Data dependency only for both barrier and synchronization with another thread.
__ATOMIC_ACQ_REL - I'm not sure about this one
Full barrier in both directions and synchronizes with acquire loads and release stores in another thread.
__ATOMIC_SEQ_CST - the strongest mode, we definitely need it
Full barrier in both directions and synchronizes with acquire loads and release stores in all threads.

#define UCS_ATOMIC_FENCE_UNLOCK UCS_BIT(2)


static UCS_F_ALWAYS_INLINE int ucs_atomic_memorder(unsigned flags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can simply use the actual flag (__ATOMIC_ACQUIRE etc) directly

static UCS_F_ALWAYS_INLINE void
ucs_rw_spinlock_read_unlock(ucs_rw_spinlock_t *lock)
{
ucs_assert(lock->state >= UCS_RWLOCK_READ);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertv with details?

@@ -0,0 +1,137 @@
/*
* Copyright (c) NVIDIA CORPORATION & AFFILIATES, 2024. ALL RIGHTS RESERVED.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyrights would need 2025

*
* Readers increment the counter by UCS_RWLOCK_READ (4)
* Writers set the UCS_RWLOCK_WRITE bit when lock is held
* and set the UCS_RWLOCK_WAIT bit while waiting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe detail something in the line of "the wait bit is meant for all subsequent read lock to let any write lock go first"

(__atomic_compare_exchange_n(&lock->l, &x, x + UCS_RWLOCK_WRITE, 1,
__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))) {
return 0;
ucs_atomic_cswap(&lock->state, x, x + UCS_RWLOCK_WRITE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe keep wait bit included in the check too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants