Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could there be a use-after-free bug on the atomic shared_ptr implementation? #4

Open
QuangTung97 opened this issue Dec 16, 2024 · 13 comments

Comments

@QuangTung97
Copy link

I think the implementation here (and possibly many other implementation) has this bug:

For example, assume we have 2 threads:

  • Thread 1 does the get() of SharedAtomicPtr ptr
  • Thread 2 does the store() of ptr.

With following interleaving actions:

  1. Thread 1 increases the local refcount and load the pointer to the control block from ptr
  2. Thread 1 increases the global refcount from the control block. (global refcount now = 2)
  3. Thread 2 comes along and update the 48 bits pointer to the new control block
  4. BUT before thread 2 can increase the global refcount, it's temporary paused by system scheduler.
  5. Thread 1 then tries to decrease the local refcount, finds out that the 48 bits pointer has changed
  6. Thread 1 decrease the global refcount. (global refcount now = 1)
  7. Thread 1 then starts using the object
  8. After finish using the object, Thread 1 then decrease the global refcount. (global refcount now = 0)
  9. Thread 1 does the deletion on the control block.
  10. Finally thread 2 is resumed, and then it tries to increase the global refcount. But the control block has already been freed => use-after-free bug

Did I get something wrong somewhere.
Or the algorithm itself is incorrect?

@vtyulb
Copy link
Owner

vtyulb commented Dec 16, 2024

The trick is hidden at step 3. You cannot simply update control block to new location, you must take a holder (it's basically a guard) before. Holder is created here: https://github.com/vtyulb/AtomicSharedPtr/blob/master/src/atomic_shared_ptr.h#L309

Now we have at least one alive pointer instance. However, it says nothing about global refcount (moreover, FastSharedPtr never increases global refcount contrary to SharedPtr).

That's why we can never update our control block, if it has non-zero local refcount. I call it "pumping phase": https://github.com/vtyulb/AtomicSharedPtr/blob/master/src/atomic_shared_ptr.h#L316
If we have non-zero local refcount we pump it to global refcount.

When local refcount is pumped out, we can finally exchange our control block here: https://github.com/vtyulb/AtomicSharedPtr/blob/master/src/atomic_shared_ptr.h#L325

So, steps 3 and 4 actually occur in reversed order. And there is no direct store() method, it's always compareExchange() for this implementation.

@vtyulb
Copy link
Owner

vtyulb commented Dec 16, 2024

Thanks for the question, I just had a lot of fun sorting out the algorithm

@QuangTung97
Copy link
Author

QuangTung97 commented Dec 16, 2024

Could something interfere in the middle between "pumping phase" and the exchange control block step?
But it's a CAS loop, then could this increase the global refcount multi times.
Also I still have a hard time convince myself that it will cause no problem when multiple store() on atomic shared ptr happens concurrently.

Could you describe it in some kind of pseudo code (without the FastSharedPtr).
Thanks a lot

@QuangTung97
Copy link
Author

I also created my own way to resolve this bug.
But not totally sure it's correct.
So I really want to compare it with others first

@vtyulb
Copy link
Owner

vtyulb commented Dec 16, 2024

Could something interfere in the middle between "pumping phase" and the exchange control block step? But it's a CAS loop, then could this increase the global refcount multi times.

No, if we didn't succeed then we subtract our counter back: https://github.com/vtyulb/AtomicSharedPtr/blob/master/src/atomic_shared_ptr.h#L320

We try to do actual control block exchange when we know, that it should have zero local refcount now: https://github.com/vtyulb/AtomicSharedPtr/blob/master/src/atomic_shared_ptr.h#L325

read:

  1. fetch_add control block: increase local refcount and extract control block at same moment
  2. (if applicable) try doing pumping phase, so local refcount wouldn't overflow

compareExchange:

  1. read(), so we have 1 pointer instance: https://github.com/vtyulb/AtomicSharedPtr/blob/master/src/atomic_shared_ptr.h#L309
  2. if pointer isn't compared, surrender (line 311, line 315)
  3. pump out local refcount, if it's not zero (lines 316-323):
    • increase global refcount
    • try decrease local refcount with compare_exchange
    • if not succeded, decrease global refcount back
    • return to step 2 anyway
  4. try exchange zero local-refcounted control block to new one, if not succeeded then return to step 2 (lines 325-331)
  5. decrease old control block once (line 23, line 283). AtomicSharedPtr counts as 1 instance itself, and we just no longer containing it

store:

  • call read, try compareExchange, retry cycle

Invariant is:

  • We never do dangerous operations, if local refcount is not zero
  • We can always pump out local refcount to global non-destructively, if we have SharePtr / FastSharedPtr
  • global refcount can never reach zero if there are any SharedPtr / FastSharedPtr
  • SharedPtr increases local refcount, and tries pump it out to global (line 233). If it didn't succeed, someone else already did the job for it

I also created my own way to resolve this bug.
But not totally sure it's correct.
So I really want to compare it with others first

You should actually try. This project has stress tests, I had a hard time debugging ABA problem on lock-free structures that use my AtomicSharedPtr. Build commands are here: https://github.com/vtyulb/AtomicSharedPtr/tree/master#build

You can modify cycle length to bigger values, and enable/disable all sanitizers (but they are quite slow). Project also has native logging. Standard logging won't do, because as it turns out any broken code starts to work when you add mutex /syscall / other lock-free queue.

@QuangTung97
Copy link
Author

Because you increase the global refcount, the try to the local refcount. So this condition added the check for zero local refcount?
https://github.com/vtyulb/AtomicSharedPtr/blob/master/src/atomic_shared_ptr.h#L241

@QuangTung97
Copy link
Author

QuangTung97 commented Dec 17, 2024

The idea for me is different:

  • To avoid the problem I mentioned at the start, we should conditionally decrease the global refcount ONLY after the store() thread successfully pumps the local counter to the global refcount.
  • The global refcount now has 3 parts instead: 47 bits refcount, 16 bits ignored_counter, and 1 bit flag denotes that it has been PUMPED.
  • So on the load() thread, when it finds out that the 48 bits address has changed, it tries to decrease the global refcount by one only when the PUMPED flag has already been set.
  • Otherwise, If the PUMPED flag has NOT been set, it will increase the ignored_counter by one.
    This ignored_counter signals the store() thread that it should pump LESS value to the global refcount, because this load() thread will not try to decrease the global refcount again.
    These two cases should be done atomically with CAS.

So on the store() thread, the steps should be:

  • first, compare exchange to swap out the old (address, old_local_refcount) to (new address, 0)
  • then try to change the global refcount by: set the PUMPED bit to one, increase 47 bits global ref counter by: (old_local_refcount - ignored_counter). This step is done atomically.

The store() thread and the load() step will exchange information depends on who comes to change the packed global refcount first.

  1. If the store() thread comes first => it will set the PUMPED flag => then the load() thread must decrease the 47 bits global refcount by one.
  2. If the load() thread comes first => it will add one to the ignored_counter => then the store() thread will pump less (or zero) to the 47 bits global refcount.

I pretty sure this approach works.
But can you help me try this? Thanks

@QuangTung97
Copy link
Author

I did create a TLA+ spec of this algorithm.
The full spec is here: https://github.com/QuangTung97/tla-plus/blob/master/AtomicPtrV2/AtomicPtrV2.tla
And checked with many invariants like:

  • No use-after-free
  • No double free
  • No missing free
  • No deadlock

I think I will also try to model your approach with TLA+.
Is the approach you use same as in folly lib?

@vtyulb
Copy link
Owner

vtyulb commented Dec 17, 2024

Because you increase the global refcount, the try to the local refcount. So this condition added the check for zero local refcount? https://github.com/vtyulb/AtomicSharedPtr/blob/master/src/atomic_shared_ptr.h#L241

No, that specific condition is needed for ABA-handling. Someone could change control block twice. He must have also pumped local refcount to global, so we can safely decrease global refcount back at line 243.

@vtyulb
Copy link
Owner

vtyulb commented Dec 17, 2024

The idea for me is different:

  • To avoid the problem I mentioned at the start, we should conditionally decrease the global refcount ONLY after the store() thread successfully pumps the local counter to the global refcount.
  • The global refcount now has 3 parts instead: 47 bits refcount, 16 bits ignored_counter, and 1 bit flag denotes that it has been PUMPED.
  • So on the load() thread, when it finds out that the 48 bits address has changed, it tries to decrease the global refcount by one only when the PUMPED flag has already been set.
  • Otherwise, If the PUMPED flag has NOT been set, it will increase the ignored_counter by one.
    This ignored_counter signals the store() thread that it should pump LESS value to the global refcount, because this load() thread will not try to increase the global refcount again.
    These two cases should be done atomically with CAS.

So on the store() thread, the steps should be:

  • first, compare exchange to swap out the old (address, old_local_refcount) to (new address, 0)
  • then try to change the global refcount by: set the PUMPED bit to one, increase 47 bits global ref counter by: (old_local_refcount - ignored_counter). This step is done atomically.

The store() thread and the load() step will exchange information depends on who comes to change the packed global refcount first.

  1. If the store() thread comes first => it will set the PUMPED flag => then the load() thread must decrease the 47 bits global refcount by one.
  2. If the load() thread comes first => it will add one to the ignored_counter => then the store() thread will pump less (or zero) to the 47 bits global refcount.

I pretty sure this approach works. But can you help me try this? Thanks

I think I agree with approach, it should work. Did I understand correctly, PUMPED flag is not set at start, can be set only once and never reset again?

@vtyulb
Copy link
Owner

vtyulb commented Dec 17, 2024

I did create a TLA+ spec of this algorithm. The full spec is here: https://github.com/QuangTung97/tla-plus/blob/master/AtomicPtrV2/AtomicPtrV2.tla And checked with many invariants like:

  • No use-after-free
  • No double free
  • No missing free
  • No deadlock

I think I will also try to model your approach with TLA+. Is the approach you use same as in folly lib?

That's a great idea! I found out about TLA+ only a year ago, and it would probably saved me some time, lost a lot on debugging random aba problem specific to current algorithm.

Folly is similar and different simultaneously. I dived in folly with @timuraudio 3 years ago, my info can be a little bit outdated. At that point we had same packing idea, local and global refcounts. Their global refcount is compatible with std::shared_ptr, but they used some dirty hacks to do that, so they lost compatibility with macos. Folly is production, and AtomicSharedPtr is still just proof-of-concept. Timur described difference at cppcon and some other conferences I think.

Now that you mention folly, it's funny that I finally understand what problem you described in your first message. We have 3 different solutions to it:

  1. I use some strange invariants when pumping phase can happen everywhere at destruction, compareExchange and read. If for some reason it's not possible, I'm sure that someone else did the job. Hard to prove formally, maybe TLA+ will be the most clean way
  2. You suggest extra flag to help with exchanging. However, you should be very cautions for ABA, PUMPED flag can be suddenly reset after 2 exchanges. But I can't provide broken example, so it should work
  3. Folly does another thing. They create AtomicSharedPtr with immediate reserve on global refcount (1024 or something), so they don't fear to do extra decreases for free, in the end math will magically come together. That's also faster than my version, because they don't need to do all these fetch_add(1 / -1), they do it in chunks.

@QuangTung97
Copy link
Author

QuangTung97 commented Dec 18, 2024

I did create a TLA+ spec of this algorithm. The full spec is here: https://github.com/QuangTung97/tla-plus/blob/master/AtomicPtrV2/AtomicPtrV2.tla And checked with many invariants like:

  • No use-after-free
  • No double free
  • No missing free
  • No deadlock

I think I will also try to model your approach with TLA+. Is the approach you use same as in folly lib?

That's a great idea! I found out about TLA+ only a year ago, and it would probably saved me some time, lost a lot on debugging random aba problem specific to current algorithm.

Folly is similar and different simultaneously. I dived in folly with @timuraudio 3 years ago, my info can be a little bit outdated. At that point we had same packing idea, local and global refcounts. Their global refcount is compatible with std::shared_ptr, but they used some dirty hacks to do that, so they lost compatibility with macos. Folly is production, and AtomicSharedPtr is still just proof-of-concept. Timur described difference at cppcon and some other conferences I think.

Now that you mention folly, it's funny that I finally understand what problem you described in your first message. We have 3 different solutions to it:

  1. I use some strange invariants when pumping phase can happen everywhere at destruction, compareExchange and read. If for some reason it's not possible, I'm sure that someone else did the job. Hard to prove formally, maybe TLA+ will be the most clean way
  2. You suggest extra flag to help with exchanging. However, you should be very cautions for ABA, PUMPED flag can be suddenly reset after 2 exchanges. But I can't provide broken example, so it should work
  3. Folly does another thing. They create AtomicSharedPtr with immediate reserve on global refcount (1024 or something), so they don't fear to do extra decreases for free, in the end math will magically come together. That's also faster than my version, because they don't need to do all these fetch_add(1 / -1), they do it in chunks.

I added this in my TLA+ Spec.
The ideal is that if an control blocked has already destroyed by free(). It can be reused.
https://github.com/QuangTung97/tla-plus/blob/master/AtomicPtrV2/AtomicPtrV2.tla#L62

So I can check whether the ABA problem can cause problem.
Currently there is not.
Hopefully for real implementation it's also correct.

When after you make your own TLA+ model of your ideal (or folly's idea), could you send me yours for me to study / check. Thanks you first.

For the case whether there's ABA problem when set the PUMPED flag.
I think it will not.
Because only one store() thread can set the PUMPED. There is no concurrent set here.
Other load() threads can race with the store() thread to increase the ignored_counter.
But no-one other than the store() thread will turn on that flag.

The question here is could this control block be reused to another AtomicSharedPtr before it's been destroyed Should there be a use case for this? If yes, so my algorithm may be not sufficient again.
The only point that I think the PUMPED flag can be cleared back to zero is when the global refcount = 0.
For the case of single linked list lock-free stack I think this limitation is not a problem.
It's not the case in general though.

For the folly lib, how folly can magically decrease the global refcount to zero after all of that? Genuinely curiously though

@vtyulb
Copy link
Owner

vtyulb commented Dec 19, 2024

I added this in my TLA+ Spec. The ideal is that if an control blocked has already destroyed by free(). It can be reused. https://github.com/QuangTung97/tla-plus/blob/master/AtomicPtrV2/AtomicPtrV2.tla#L62

So I can check whether the ABA problem can cause problem. Currently there is not. Hopefully for real implementation it's also correct.

When after you make your own TLA+ model of your ideal (or folly's idea), could you send me yours for me to study / check. Thanks you first.

I will have new year holiday soon, but I'm not sure if I could use it. No promises here.

For the case whether there's ABA problem when set the PUMPED flag. I think it will not. Because only one store() thread can set the PUMPED. There is no concurrent set here. Other load() threads can race with the store() thread to increase the ignored_counter. But no-one other than the store() thread will turn on that flag.

The question here is could this control block be reused to another AtomicSharedPtr before it's been destroyed Should there be a use case for this? If yes, so my algorithm may be not sufficient again. The only point that I think the PUMPED flag can be cleared back to zero is when the global refcount = 0. For the case of single linked list lock-free stack I think this limitation is not a problem. It's not the case in general though.

You never know. Someone can set dequeue, and use control block in 2 pointers. Someone can reuse control blocks all the time, because it's too slow to allocate memory. Memory allocation is never lock-free, so you should expect anything.

For the folly lib, how folly can magically decrease the global refcount to zero after all of that? Genuinely curiously though

When they create AtomicSharedPtr, they reserve 8192 on global refcount: https://github.com/facebook/folly/blob/main/folly/concurrency/AtomicSharedPtr.h#L246
When they release, they release only needed diff: https://github.com/facebook/folly/blob/main/folly/concurrency/AtomicSharedPtr.h#L257
And finally, when they fill danger (local refcount uses almost the full reserve on global refcount), they do exchange in batches: https://github.com/facebook/folly/blob/main/folly/concurrency/AtomicSharedPtr.h#L343

I can't verify their code, but I trust it works. I believe mine works too 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants