[NO-TICKET] Fix breaking application boot due to concurrency issue in tracing #4303
+25
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR fixes #2851 . As explained in the original bug report, there's a concurrency issue that can be triggered by the environment logger running concurrently with activating an instrumentation.
Motivation:
Fix a long-standing concurrency issue.
Change log entry
Yes. Fix breaking application boot due to concurrency issue in tracing
Additional Notes:
To fix this issue, I've made two changes:
Introduced a mutex to protect the
@instrumented_integrations
, thus making sure no two threads can be touching it at the same timeTook advantage of the fact that
#instrumented_integrations
was marked as private, and only being used by the environment logger and telemetry (e.g. read-only usage) to return a copy of the data.This way, we can safely iterate on the data while reconfiguration is happening concurrently.
I believe the overhead of this lock is negligible, since we don't need to read this information very often.
How to test the change?
I wrote a reproducer to be able to see this issue easily.
The reproducer is in two parts --
concurrent_bug_repo.rb
:and the following change to the environment logger:
That is, I specifically injected a
sleep
in the environment logger, and made it run on a background thread, triggering the exact same issue that had been plaguing us for years.Running the above change without my PR will trigger the issue, and with my PR it no longer does -- since the environment logger now gets its own copy of the instrumented integrations.