Decrement in_use_connections when a connection is not reusable #267

pcorey · 2024-02-24T21:14:28Z

Fixes #257

sneako · 2024-03-13T12:12:27Z

Thanks for looking at this @pcorey ! @oliveigah do you mind taking a quick look at this? I'm surprised this seems to fix the issue. To me it looks like this could cause the values to become negative, but maybe I'm not thinking this through enough 🤔

oliveigah · 2024-03-13T12:16:20Z

Thanks for looking at this @pcorey ! @oliveigah do you mind taking a quick look at this? I'm surprised this seems to fix the issue. To me it looks like this could cause the values to become negative, but maybe I'm not thinking this through enough 🤔

Sure! Gonna take a look at this today after work. When I have more info I will post it here.

sneako · 2024-03-13T12:18:05Z

Thank you so much! No rush at all 🙏

oliveigah · 2024-03-13T23:23:17Z

After gathering some context I agree with @sneako! It's not clear to me that this change solves the problem, in fact I did a test considering idle connection termination and this changes breaks it. The test is in this gist: https://gist.github.com/oliveigah/cbe7fa1ecb711a6006af5dd05c6803e7

The assert that breaks is the third call to get_pool_status :

assert {:ok,
            [
              %PoolMetrics{
                pool_index: 1,
                pool_size: 10,
                available_connections: 2,
                in_use_connections: 8
              }
            ]} = Finch.get_pool_status(finch_name, shp)

The result with the proposed changes are:

{:ok, [%Finch.HTTP1.PoolMetrics{available_connections: 10, in_use_connections: 0, pool_index: 1, pool_size: 10}]}

This happens because when the connection is idle, every time the user makes a checkout 2 calls to PoolMetrics.maybe_add/2 are made, one with -1 and other with 1 and in fact the metric is temporarily negative, and will be permanentlly negative after the checkin.

{:ok, [%Finch.HTTP1.PoolMetrics{available_connections: 18, in_use_connections: -8, pool_index: 1, pool_size: 10}]}

In summary, I think this change does not address the real problem we might have here.

About the root cause of the problem my thoughts are:

The only way we can have an every increasing in_use_connections is by checkouts being completed without a checkin counterpart
Reading through nimble pool's code https://github.com/dashbitco/nimble_pool/blob/deae8ffdb52ce5265a68299eea3e5074f1e8d5a7/lib/nimble_pool.ex#L432-L462 we may investigate if we are landing on any scenario that does not call send(pid, {__MODULE__, :checkin, ref, client_state}) all of them are errors and might be noisy on your logs if you can share them with us @pcorey would be awesome

Gonna try to emulate some other scenarios to see if I can find any other possible root cause. @pcorey did this pool have a high throughput? I'm also suspecting about some weird interaction between metrics and pool termination, can you check if you have pool termination events in this pool?

pcorey · 2024-03-18T12:20:44Z

Thanks for looking into this @oliveigah! To add to the confusion, my "fix", as you guessed, didn't actually work. I was conflating metrics from three different tenants. Two of those tenants are behaving normally; available_connections stays around 50 and in_use_connections stays around 0.

The third tenant is the one having issues. Here's a plot of it's available_connections (blue) vs in_use_connections (orange) in the 24 hours after a fresh deployment with mainline finch (0.18.0):

This tenant's Finch pool does experience higher load than the other two. The current request pattern is ~1 hour of 30-50 requests/minute followed by ~20 minutes of 1-5 requests/minute:

all of them are errors and might be noisy on your logs

I'm not seeing any strange or unexpected errors in my logs. The only thing that may possibly be related is that all three of my tenants get connection closed errors from the upstream host (%Mint.TransportError{reason: :closed}).

can you check if you have pool termination events in this pool?

I haven't had any pool termination events ([:finch, :pool_max_idle_time_exceeded]) in the time range we're looking at.

oliveigah · 2024-03-19T00:41:48Z

Thanks for the details @pcorey!

The only problem I can see is some unexpected error on the Mint.HTTP.request/5 call on Finch.HTTP1.Conn

finch/lib/finch/http1/conn.ex

Lines 115 to 122 in fb52686

    
           case Mint.HTTP.request( 
        
                  conn.mint, 
        
                  req.method, 
        
                  full_path, 
        
                  req.headers, 
        
                  stream_or_body(req.body) 
        
                ) do 
        
             {:ok, mint, ref} ->

I've done a test that artificially raise when calling it and then I got the same behaviour you described! The test is here: https://gist.github.com/oliveigah/a7cf1c7931440ba6c647866fd3e99d74

The best way to solve this is to add a new callback to nimble pool when the request is canceled https://github.com/dashbitco/nimble_pool/blob/deae8ffdb52ce5265a68299eea3e5074f1e8d5a7/lib/nimble_pool.ex#L443-L444.

I've implemented a naive version of this new callback locally and it solves the problem. Gonna open an issue on nimble pool in order to gather some thoughts about this new callback.

Then we can use it here and see if it solves your problem! Is it ok to you @sneako?

oliveigah · 2024-03-23T03:20:42Z

PR was merged on nimble_pool. dashbitco/nimble_pool#44

Since the maintainer asked for us try this fix before releasing a new version of nimble pool, I've implemented it on my fork https://github.com/oliveigah/finch using it from master instead of hex.

All of my previously broken tests have been fixed with the changes, but if you could test it on your system and report the results would be awesome @pcorey! Let me know if I can help with anything.

pcorey · 2024-03-23T20:06:46Z

@oliveigah I'm deploying with your fork now. I'll let you know how things look first thing Monday morning. Thanks for doing all of this!

pcorey · 2024-03-25T15:37:02Z

@oliveigah Everything is looking great in my metrics. It looks like your fix worked! Should I close this PR and you can submit your own?

oliveigah · 2024-03-25T16:53:43Z

Awesome @pcorey, thanks for the feedback!

Let's wait for a new nimble pool's relelase and then I can subimit the PR with the handle_cancelled/2 implementation and the added tests.

oliveigah · 2024-03-26T00:05:29Z

I've opened the new PR with the implementation of the new callback and added tests. I think we can cotinue this conversation over there. #268

Decrement in_use_connections when a connection is not reusable

3ea1ae2

Fixes sneako#257

oliveigah mentioned this pull request Mar 19, 2024

New worker callback for canceled requests dashbitco/nimble_pool#43

Closed

oliveigah mentioned this pull request Mar 25, 2024

feat: implement handle_cancelled/3 worker callback dashbitco/nimble_pool#44

Merged

oliveigah mentioned this pull request Mar 26, 2024

feat: implement handle_cancelled/2 pool callback #268

Merged

pcorey closed this Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decrement in_use_connections when a connection is not reusable #267

Decrement in_use_connections when a connection is not reusable #267

pcorey commented Feb 24, 2024

sneako commented Mar 13, 2024

oliveigah commented Mar 13, 2024

sneako commented Mar 13, 2024

oliveigah commented Mar 13, 2024 •

edited

Loading

pcorey commented Mar 18, 2024

oliveigah commented Mar 19, 2024 •

edited

Loading

oliveigah commented Mar 23, 2024

pcorey commented Mar 23, 2024

pcorey commented Mar 25, 2024

oliveigah commented Mar 25, 2024

oliveigah commented Mar 26, 2024

Decrement in_use_connections when a connection is not reusable #267

Decrement in_use_connections when a connection is not reusable #267

Conversation

pcorey commented Feb 24, 2024

sneako commented Mar 13, 2024

oliveigah commented Mar 13, 2024

sneako commented Mar 13, 2024

oliveigah commented Mar 13, 2024 • edited Loading

pcorey commented Mar 18, 2024

oliveigah commented Mar 19, 2024 • edited Loading

oliveigah commented Mar 23, 2024

pcorey commented Mar 23, 2024

pcorey commented Mar 25, 2024

oliveigah commented Mar 25, 2024

oliveigah commented Mar 26, 2024

oliveigah commented Mar 13, 2024 •

edited

Loading

oliveigah commented Mar 19, 2024 •

edited

Loading