-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[prometheus] Fixed a race condition happening in simultaneous OpenMetrics and PlainText requests #5517
Conversation
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5517 +/- ##
==========================================
+ Coverage 83.38% 85.49% +2.11%
==========================================
Files 297 289 -8
Lines 12531 12515 -16
==========================================
+ Hits 10449 10700 +251
+ Misses 2082 1815 -267
Flags with carried forward coverage won't be shown. Click here to find out more.
|
src/OpenTelemetry.Exporter.Prometheus.HttpListener/Internal/PrometheusCollectionManager.cs
Outdated
Show resolved
Hide resolved
9c65172
to
224d1bd
Compare
test/OpenTelemetry.Exporter.Prometheus.AspNetCore.Tests/PrometheusExporterMiddlewareTests.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Vishwesh Bankwar <[email protected]>
The race condition is still present and need more work. I'm converting it to draft for now. |
Thank you for looking into this @hez2010! I think we need some architecture change here. The original design caches the response with the assumption that the same response payload can be used for all scrapers. With the ability to support both Prometheus ( |
Now it's ready for review. |
test/OpenTelemetry.Exporter.Prometheus.AspNetCore.Tests/PrometheusExporterMiddlewareTests.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some minor issues regarding markdownlint and line-ending, LGTM once those are fixed.
I pushed a new commit to simplify the code, and resolved another race condition due to the bad exchange of |
@hez2010 - Could you elaborate what was the race condition? |
We are neither using You can observe the issue by running the test I added (and better to increase the request times). Without the change, the test sometimes never finishes. |
This issue would surface even without any of these changes? Meaning, if we just run the test with single value passed here |
Seems that I accidentally broke the cache. Working on a fix. |
It turns out to be an invalid assert in the test. |
Yes. My newly added test sometimes never finishes against the existing code as well. |
All things should work expected now. PTAL. |
Thanks for confirming @hez2010. Looks like this PR is now trying to solve two issues.
Here is what I think we should do.
|
I can separate this PR into two parts as you mentioned, but the tests will gonna fail without the fix for the concurrency issue (I think you may have noticed that some previous CI run timeouts because they failed to complete the test due to this) so that the PR will not be a good state that can be merged IMO. |
For the test, you can change it to sequential execution for now. The test should validate that the exporter can serve both the formats. When working on concurrency issue, can add the parallel test back. |
This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day. |
@hez2010 - Checking to see if you are planning to do the changes based on the discussion above? |
Thanks for pinging! Sorry for the delay. I have been busy recently so may not be able to apply the changes right now. |
This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day. |
I can probably pick this up tonight if others are busy. |
I've raised a PR for the first issue, which is splitting the buffers here: #5623 It also optimizes the generation to only the type requested, instead of generating both open metrics and plain. |
closing in favor of #5623 |
Fixes #5515
Changes
Fixed a race condition which can cause returning an incorrect response while OpenMetrics requests and PlainText requests are being processed simultaneously.
Merge requirement checklist