Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context concurrency and parallelism issues #1766

Open
arielvalentin opened this issue Nov 21, 2024 · 9 comments
Open

Context concurrency and parallelism issues #1766

arielvalentin opened this issue Nov 21, 2024 · 9 comments
Labels

Comments

@arielvalentin
Copy link
Contributor

The OpenTelemetry::Context uses a fiber local Ruby Array to keep track of the active Context instance.

There are at least 2 instances in gems that are breaking encapsulation and copying fiber local variables to child threads:

This leads to concurrency and parallel processing issues because multi-threads are pushing and popping entries on the same Stack, resulting in OpenTelemetry::Context::DetachError as well as non-deterministic behavior where the OpenTelemetry::Context.current method returns the wrong instance of the context object.

Helpers that rely on the implicitly active Context, like OpenTelemetry::Trace.current_span, will end up receiving the wrong Span object instance.

There is a proposed fixed here that would end up using Immutable Arrays, however this will increase the number of object allocations per Context stack mutation:

#1760

Another option I have considered is to propose using Thread Local variables instead of Fiber local variables, however that would result in concurrency issues when using Fibers where multiple fibers would have access to the same Array instance. We would need a mechanism to seed a spawned thread with a snapshot of the stack so that there could be some consistent context propagation for the gems listed above.

I would like to get some feedback from @open-telemetry/ruby-maintainers on whether or not there are addition alternatives we can consider.

@ioquatix
Copy link

ioquatix commented Nov 22, 2024

@arielvalentin thanks for your great summary of an issue, including the links to real source code which exhibits the problem. I'm from the Ruby core team and work on concurrency and issues like this.

Copying around thread locals like this is like copying around private instance variables between objects, it's a horrible idea and should never be done.

There is now a feature in Ruby, which does this and is expected to do this: Fiber.storage

irb(main):001> Fiber[:foo] = :bar
=> :bar
irb(main):002* Thread.new do
irb(main):003*   pp Fiber[:foo]
irb(main):004> end
=> #<Thread:0x000074fcdef31970 (irb):2 run>
:bar

The fiber storage is inherited by copy so mutations in child keys isn't propagated upwards, but you can use mutable values (which must be thread safe) for this.

Using thread locals for "per request state" is an extremely bad idea.

@arielvalentin
Copy link
Contributor Author

arielvalentin commented Nov 26, 2024

@ioquatix Thanks for sharing your perspective here.

I suspect that library maintainers are doing this out of convenience without knowing the negative impact that it has on their users.

In our case I would say we generally want implicit immutable state staring within the same thread, but likely explicit sharing when spawning fibers or threads.

As you pointed out if we did use fiber storage, we are still in need an efficient fiber local Immutable Stack or Array data structure to avoid non-deterministic behavior.

There are also other use cases that our users tend to rely on for resetting some state process forking. I personally rely on ActiveSupport::ForkTracker to address some of these, but in the case of the SDK the code checks for when process id changes to reset state in the child that was captured by the parent process.

I know it is quite risky to implement (e.g. fork bombing) but has there been any discussion to add lifecycle hooks to register custom code when processes, threads, or fibers are spawned?

@arielvalentin
Copy link
Contributor Author

arielvalentin commented Nov 26, 2024

@ioquatix A related problem we have is finding a solution for re-entrant locks on our thread safe structures.

The SDK currently use mutexes to ensure thread safety when mutating a Span. We have a use case where we need to allow the thread to re-enter a mutex described here. I haven't done much in the way of exploring any additional options but your input would be helpful here:

#1740

@ioquatix
Copy link

ioquatix commented Nov 27, 2024

We should strongly encourage library authors to avoid this kind of code, and graphql-ruby are going to fix their implementation: rmosolgo/graphql-ruby#5173 (comment)

Fiber storage supports immutable updates, e.g.

Fiber[:thing] = Fiber[:thing].with(...)

(assuming #with returns a new instance with the updated state).

I know it is quite risky to implement (e.g. fork bombing) but has there been any discussion to add lifecycle hooks to register custom code when processes, threads, or fibers are spawned?

Yes, see Process._fork. As for threads and fibers, it's generally a bad idea.

For thread and fiber locals, you may prefer to use attributes, these are not going to be copied by the code previously discussed, e.g.

Thread.attr_accessor :mylibrary_mything

Thread.current.mylibrary_mything = ... # proper thread local.

@fbogsany
Copy link
Contributor

For thread and fiber locals, you may prefer to use attributes, these are not going to be copied by the code previously discussed

I benchmarked that approach and several others, result summary is #1760 (comment). For the recursive benchmark, a Fiber attribute is 20% slower than Thread.current fiber-local storage. Concurrent Ruby's FiberLocalVar is faster than a Fiber attribute, but relies on Fiber.storage for its implementation.

@ioquatix
Copy link

I would try to avoid having a dependency on concurrent-ruby it is a lot to pull in.

@ioquatix
Copy link

You should try Thread.attr ... or Fiber.attr ... it gets optimised because of object shapes. It will be the fastest IIRC.

Copy link
Contributor

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

@ivoanjo
Copy link

ivoanjo commented Jan 2, 2025

On a tiny note, while the current storage mechanism is quite flexible internally, it does make it quite hard external tools such as profilers to get this data to include as correlation (e.g. DataDog/dd-trace-rb#3984 is quite a maze). #842 is an earlier discussion of this too.

With opentelemetry also now having support for profiling, this use case may become more common (e.g. beyond just Datadog's Ruby profiler).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants