-
Notifications
You must be signed in to change notification settings - Fork 308
PoolError after release to Heroku #1541
Comments
I've removed IRQ from this. This isn't immediately threatening production right now. I have added it to Infrastructure, however. |
Ok. I have no news from Heroku so far. I am not sure what to expect since we do not have the logs. I've given them only the traceback from Sentry. |
I have access to the Heroku ticket now, btw. They can manually cc: other users. |
Looks like this isn't the first time we've noticed this: #632. We basically ignored the issue there. |
|
|
We should track our pool size in Librato. Reticketed as #1580. |
Hooray I'm helping! For what it's worth you can see the number of current connections you have open on Heroku with You can also query |
Thanks @catsby. :-) |
Dropping from Infrastructure. Not much to do until it recurs. @zwn I would even consider closing this until it recurs. |
@whit537 Agreed. |
Interesting that we spiked to 15 an hour ago (I actually noticed it but didn't think anything of it). |
A couple thoughts:
|
Here are the log messages for the growth from 4 to 30, 96, and 100 db connections:
|
So the question is, what happened between 4 and 30? |
We have 4 db connections at 09:19:43 and 30 at 09:20:59. I see a handful of H12s ending at 00:23:37 and then they start again at 09:20:43 and then we get "a lot" of them. Full docs on H12: |
|
|
I kicked off #1596 from here. |
Current hypothesis is that VACUUMing due to homepage updating is slowing down the db. Slow database queries cause requests to pile up behind each other, further slowing down the database, leading to congestion collapse. IRC |
How do we test our our hypothesis? |
Can you see what your database is up when these connections spike? Use pg:ps, pg:locks or pg:blocking for starters. The latter two are in pg-extras: https://github.com/heroku/heroku-pg-extras |
|
|
|
We're struggling again: Comments above are |
Note that payday continues to run (#1597). |
So setting the timeout didn't help :(. So the problem most likely will not be slow db queries. Could we be leaking connections? How do we find out what all the threads are doing? The 1m load going to 5 is not good either :(. That means that for a minute we had on average 5 tasks/threads not blocked on I/O but ready to run. That is very strange. |
I've tested the behavior locally and it is not what we want. TRUNCATE locks the table. Before we fill it up with new data, all SELECTs on the table block until the transaction with the TRUNCATE commits or rollbacks. That would mean that the homepage would be out of order the this time (several seconds). |
I have scaled back to running 80 threads instead of 200. I remember there was something special about how threads in python behave and that it is not good idea to have a lot of them (something to do with GIL, Global Interpreter Lock). Anyway, now we should not run out of connections to the db because we have less threads (80) than connections (100). Also the statement_timeout is back on. It has been off during payday running and we might have not restarted when turning it back on (it applies only to new connections). That could have been the reason for the last blackout. The strange thing is that there is nothing in the logs regarding log db queries (all queries over 50ms are logged). That makes me think that db might not be the reason. My current suspect is threading in python. |
We do not have to stay with threads. Multiple processes seem to be available in heroku dyno. >>> import multiprocessing
>>> from multiprocessing import Process
>>> def f(name):
... print 'hello', name
...
>>> p = Process(target=f, args=('bob',))
>>> p.start()
>>> hello bob
p.join()
>>> We would need to make sure we do not cache volatile things in memory. Something can be learned from Sentry deployment for taking advantage from multiple processes: http://justcramer.com/2013/06/27/serving-python-web-applications/. I believe this could be done within one heroku dyno (limited just by available RAM). |
As for the number of workers - here is the recommendation from gunicorn
Heroku dyno reports 4 cpus. |
|
8h ago gittip.com served 38k RPM and didn't die (38 connections to the db, load approaching 2) |
Cool. Good move.
Don't get your hopes up. We've seen request spikes before w/o a corresponding blackout, see above at #1541 (comment).
There's a lot of room between 200 threads and 1 thread. Let's live with 80 for a while and see how it behaves, imo. :-)
We're definitely doing this right now: #715. |
Oops, sorry! :-) |
I've just lowered the number of threads to 40. |
Have we solved this? We didn't have any trouble on Friday the 25th with the Salon traffic, but that was less than half the traffic that took us down on the 17th. I think the problem is as simple as we were allowing more request workers than database connections. If that's true we should be able to reproduce the crash in a test environment. |
Closing per IRC. |
https://postgres.heroku.com/blog/past/2013/11/22/connection_limit_guidance/ |
Only applies to new "Heroku Postgres 2.0" plans, Standard Yanari, etc: https://www.heroku.com/pricing#postgres Old Crane plans and such are considered "legacy" but they're still provision-able and will be around for a while. |
https://help.heroku.com/tickets/100017
When I deployed version 10.1.27 the application started to raise PoolError for all requests. I had to restart the dyno to get the application back up.
The text was updated successfully, but these errors were encountered: