Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrement in_use_connections when a connection is not reusable #267

Closed
wants to merge 1 commit into from

Conversation

pcorey
Copy link

@pcorey pcorey commented Feb 24, 2024

Fixes #257

@sneako
Copy link
Owner

sneako commented Mar 13, 2024

Thanks for looking at this @pcorey ! @oliveigah do you mind taking a quick look at this? I'm surprised this seems to fix the issue. To me it looks like this could cause the values to become negative, but maybe I'm not thinking this through enough 🤔

@oliveigah
Copy link
Contributor

Thanks for looking at this @pcorey ! @oliveigah do you mind taking a quick look at this? I'm surprised this seems to fix the issue. To me it looks like this could cause the values to become negative, but maybe I'm not thinking this through enough 🤔

Sure! Gonna take a look at this today after work. When I have more info I will post it here.

@sneako
Copy link
Owner

sneako commented Mar 13, 2024

Thank you so much! No rush at all 🙏

@oliveigah
Copy link
Contributor

oliveigah commented Mar 13, 2024

After gathering some context I agree with @sneako! It's not clear to me that this change solves the problem, in fact I did a test considering idle connection termination and this changes breaks it. The test is in this gist: https://gist.github.com/oliveigah/cbe7fa1ecb711a6006af5dd05c6803e7

The assert that breaks is the third call to get_pool_status :

assert {:ok,
            [
              %PoolMetrics{
                pool_index: 1,
                pool_size: 10,
                available_connections: 2,
                in_use_connections: 8
              }
            ]} = Finch.get_pool_status(finch_name, shp)

The result with the proposed changes are:

{:ok, [%Finch.HTTP1.PoolMetrics{available_connections: 10, in_use_connections: 0, pool_index: 1, pool_size: 10}]}

This happens because when the connection is idle, every time the user makes a checkout 2 calls to PoolMetrics.maybe_add/2 are made, one with -1 and other with 1 and in fact the metric is temporarily negative, and will be permanentlly negative after the checkin.

{:ok, [%Finch.HTTP1.PoolMetrics{available_connections: 18, in_use_connections: -8, pool_index: 1, pool_size: 10}]}

In summary, I think this change does not address the real problem we might have here.

About the root cause of the problem my thoughts are:

Gonna try to emulate some other scenarios to see if I can find any other possible root cause. @pcorey did this pool have a high throughput? I'm also suspecting about some weird interaction between metrics and pool termination, can you check if you have pool termination events in this pool?

@pcorey
Copy link
Author

pcorey commented Mar 18, 2024

Thanks for looking into this @oliveigah! To add to the confusion, my "fix", as you guessed, didn't actually work. I was conflating metrics from three different tenants. Two of those tenants are behaving normally; available_connections stays around 50 and in_use_connections stays around 0.

The third tenant is the one having issues. Here's a plot of it's available_connections (blue) vs in_use_connections (orange) in the 24 hours after a fresh deployment with mainline finch (0.18.0):

Screenshot 2024-03-18 at 6 07 36 AM

This tenant's Finch pool does experience higher load than the other two. The current request pattern is ~1 hour of 30-50 requests/minute followed by ~20 minutes of 1-5 requests/minute:

Screenshot 2024-03-18 at 6 16 02 AM

all of them are errors and might be noisy on your logs

I'm not seeing any strange or unexpected errors in my logs. The only thing that may possibly be related is that all three of my tenants get connection closed errors from the upstream host (%Mint.TransportError{reason: :closed}).

can you check if you have pool termination events in this pool?

I haven't had any pool termination events ([:finch, :pool_max_idle_time_exceeded]) in the time range we're looking at.

@oliveigah
Copy link
Contributor

oliveigah commented Mar 19, 2024

Thanks for the details @pcorey!

The only problem I can see is some unexpected error on the Mint.HTTP.request/5 call on Finch.HTTP1.Conn

case Mint.HTTP.request(
conn.mint,
req.method,
full_path,
req.headers,
stream_or_body(req.body)
) do
{:ok, mint, ref} ->

I've done a test that artificially raise when calling it and then I got the same behaviour you described! The test is here: https://gist.github.com/oliveigah/a7cf1c7931440ba6c647866fd3e99d74

The best way to solve this is to add a new callback to nimble pool when the request is canceled https://github.com/dashbitco/nimble_pool/blob/deae8ffdb52ce5265a68299eea3e5074f1e8d5a7/lib/nimble_pool.ex#L443-L444.

I've implemented a naive version of this new callback locally and it solves the problem. Gonna open an issue on nimble pool in order to gather some thoughts about this new callback.

Then we can use it here and see if it solves your problem! Is it ok to you @sneako?

@oliveigah
Copy link
Contributor

PR was merged on nimble_pool. dashbitco/nimble_pool#44

Since the maintainer asked for us try this fix before releasing a new version of nimble pool, I've implemented it on my fork https://github.com/oliveigah/finch using it from master instead of hex.

All of my previously broken tests have been fixed with the changes, but if you could test it on your system and report the results would be awesome @pcorey! Let me know if I can help with anything.

@pcorey
Copy link
Author

pcorey commented Mar 23, 2024

@oliveigah I'm deploying with your fork now. I'll let you know how things look first thing Monday morning. Thanks for doing all of this!

@pcorey
Copy link
Author

pcorey commented Mar 25, 2024

@oliveigah Everything is looking great in my metrics. It looks like your fix worked! Should I close this PR and you can submit your own?
Screenshot 2024-03-25 at 9 36 35 AM

@oliveigah
Copy link
Contributor

Awesome @pcorey, thanks for the feedback!

Let's wait for a new nimble pool's relelase and then I can subimit the PR with the handle_cancelled/2 implementation and the added tests.

@oliveigah
Copy link
Contributor

I've opened the new PR with the implementation of the new callback and added tests. I think we can cotinue this conversation over there. #268

@pcorey pcorey closed this Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reported in_use_connections grows beyond pool_size bounds
3 participants