-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix races around pthread exit and join #409
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other locations such as here which alias __tl_{lock,unlock}
to a dummy function. (and a few others as well). Should those be removed?
Another location I see is this one which by avoiding __tl_sync
it's not properly synchronizing with __pthread_exit
because as soon as this line is executed then the thread is eligible for freeing which would be a use-after free with later self->foo
writes (or stack writes).
Additionally I'll still point out that despite the example in #405 not using detached threads I don't think that free
is sufficient to deallocate a thread's stack. There's no guarantee that the stack isn't in use at the end of free
or at the end of __pthread_exit
when everything is cleaned up, meaning it could still be a use-after-free.
It seems like wasi-libc was originally patched to not use threads, but when re-adding threads support it seems like it might be worthwhile to start from scratch or otherwise comprehensively review all threading-related patches, although I'm not sure if that's possible.
# Unlock thread list. (as CLONE_CHILD_CLEARTID would do for Linux) | ||
# | ||
# Note: once we unlock the thread list, our "map_base" can be freed | ||
# by a joining thread. It's safe as we are in ASM and no longer use | ||
# our C stack or pthread_t. | ||
i32.const __thread_list_lock | ||
i32.const 0 | ||
i32.atomic.store 0 | ||
# As an optimization, we can check tl_lock_waiters here. | ||
# But for now, simply wake up unconditionally as | ||
# CLONE_CHILD_CLEARTID does. | ||
i32.const __thread_list_lock | ||
i32.const 1 | ||
memory.atomic.notify 0 | ||
drop | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading over the definition of __tl_unlock
it additionally decrements tl_lock_count
. It might be good to assert that is zero in C before returning here to perform this. Additionally it might be good to leave a comment in __tl_unlock
that this needs updating if the implementation changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just call __tl_unlock
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading over the definition of
__tl_unlock
it additionally decrementstl_lock_count
. It might be good to assert that is zero in C before returning here to perform this. Additionally it might be good to leave a comment in__tl_unlock
that this needs updating if the implementation changes.
i feel it belongs to upstream musl, not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just
call __tl_unlock
?
because __tl_unlock
is a C function, which potentially requires C stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean by it belongs in upstream musl? Wasi-libc is already patching musl a fair bit, so what I'm saying is that as part of the patching the cases where __pthread_exit
returns should all assert that the tl_lock_count
is zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean by it belongs in upstream musl? Wasi-libc is already patching musl a fair bit, so what I'm saying is that as part of the patching the cases where
__pthread_exit
returns should all assert that thetl_lock_count
is zero.
i meant tl_lock_count should be zero when a thread exits is not wasi-specific.
if you feel strongly i can add such an assertion.
my understanding is that these weak symbols are resolved to the real __tl_lock/unlock
i agree it's broken. i have a few ideas to fix it. i intentionally didn't include a fix in this PR because it's a separate issue. |
* See also set_tid_address(2) | ||
* | ||
* In WebAssembly, we leave it to wasi_thread_start instead. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we instead call __tl_unlock
right before a_store(&self->detach_state, DT_EXITED);
like we do right be free(self->map_base);
?
In the first case we are unlocking right before notifying the joiner that they can call free and in the later case we are unlocking right before we call free
ourselves, so it seem symmetrical.
(I'm suggesting this so that we can avoid re-implementing __tl_unlock
in asm if we can).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, what was wrong with the single __tl_unlock
on line 172 that we had previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we instead call
__tl_unlock
right beforea_store(&self->detach_state, DT_EXITED);
like we do right befree(self->map_base);
?In the first case we are unlocking right before notifying the joiner that they can call free and in the later case we are unlocking right before we call
free
ourselves, so it seem symmetrical.
- the detached case is broken as alex said.
- detached threads do never have joiners.
(I'm suggesting this so that we can avoid re-implementing
__tl_unlock
in asm if we can).
this PR doesn't re-implement __tl_unlock
.
it emulates CLONE_CHILD_CLEARTID
, which the __tl_lock/unlock/sync
protocol relies on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, what was wrong with the single
__tl_unlock
on line 172 that we had previously?
- double unlock.
- it unlocks too early and allows joiner to free our stack before we finish on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the double unlock thing looks like a bug.
But the joiner is not waiting on the tl_lock
is it? The joiner seems to be waiting on t->detach_state
. In fact, I don't see the tl_lock
referenced at all in pthread_join.c
. Maybe I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the joiner uses __td_sync
to sync with the exit.
I wonder if we should instead come up with a different solution that doesn't involve a thread free'ing its own stack. How about this:
|
Aha you are correct, I was misunderstanding what a weak alias did. Disregard these comments of mine then!
Is the "ready for collection" flag necessary in that scheme? Given that the linked list is protected by It seems a bit of a bummer though in that if you have a bunch of threads all exit at once and you never create another thread then that's a bunch of memory stuck which can't be reused. Another possible alternative though would be to open-code the mutex around
The weird part though is that |
for detached case, i'm thinking: a. implement
right now i'm inclined in b. because |
Neat idea. Sounds like that could work. |
does anyone still have any concerns? |
ping |
i32.const __thread_list_lock | ||
i32.const 1 | ||
memory.atomic.notify 0 | ||
drop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about putting all this new code a new local function called __tl_unlock_asm
? And perhaps mention in the comment why we need to asm re-implementation here.
Otherwise this lgtm now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i added a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time today staring at this trying to convince myself that this is correct. I'm not too familiar with the details of "CLONE_CHILD_CLEARTID" and the "__tl_lock/unlock/sync protocol" referred so it could be that I'm too concerned about something that is actually fine here. What is concerning though is that it is not easy to figure out what the change is doing and I suspect that will make troubleshooting difficult in the future.
Tell me if this is a correct reading: we need to control concurrent access to __thread_list_lock
, which previously was done via __tl_lock
, __tl_unlock
and __tl_sync
. Because the current thread's stack might be deallocated, we can't call __tl_unlock
directly and instead do something "equivalent" in assembly code, by just atomically setting __thread_list_lock
to zero and waking the next waiter. This means that there are two places munging with __thread_list_lock
: the assembly code and __tl_unlock
. (Why isn't the function removed then, e.g., to avoid future errors? Presumably it is used somewhere else when the stack is valid?). But the "equivalent" assembly code is also not completely equivalent: it doesn't decrement tl_lock_count
as __tl_unlock
does. So it appears as if there are cases where the assembly code could result in different behavior than the original __tl_unlock
--i.e., when __tl_lock
has been called multiple times previously. It is unclear why this different behavior is OK here.
None of this is explained in the commit comments or commit message. It feels like some kind of bug is already present (re: tl_lock_count
)--if it is not, it still "appears" that way because there is no explanation as to why we can disregard tl_lock_count
. And, since there is little explanation, it seems likely that if there isn't a bug already present, there could be one soon if someone again inserts a __tl_unlock
in the wrong place, e.g., (since they should rely on the assembly code).
the asm code mimics CLONE_CHILD_CLEARTID, not __tl_unlock. CLONE_CHILD_CLEARTID is a linux kernel functionality the original musl code relies on.
i feel it's actually clearly explained in the comment. |
I think this is the part that it took be a while to understand. At this point I think the comment explains its but early on I was confused about this point (and CLONE_CHILD_CLEARTID in general). |
https://www.man7.org/linux/man-pages/man2/clone.2.html seems to have a good enough explanation to read this patch:
|
I want to include WebAssembly/wasi-libc#409 which is rather easy to hit when people are experimenting wasi-threads.
I want to include WebAssembly/wasi-libc#409 which is rather easy to hit when people are experimenting wasi-threads.
I want to include WebAssembly/wasi-libc#409 which is rather easy to hit when people are experimenting wasi-threads.
No description provided.