PoolError after release to Heroku #1541

zbynekwinkler · 2013-10-04T19:07:48Z

https://help.heroku.com/tickets/100017

When I deployed version 10.1.27 the application started to raise PoolError for all requests. I had to restart the dyno to get the application back up.

chadwhitacre · 2013-10-07T16:04:27Z

I've removed IRQ from this. This isn't immediately threatening production right now.

I have added it to Infrastructure, however.

zbynekwinkler · 2013-10-07T17:12:55Z

Ok. I have no news from Heroku so far. I am not sure what to expect since we do not have the logs. I've given them only the traceback from Sentry.

chadwhitacre · 2013-10-07T18:25:15Z

I have access to the Heroku ticket now, btw. They can manually cc: other users.

chadwhitacre · 2013-10-10T16:58:08Z

Looks like this isn't the first time we've noticed this: #632. We basically ignored the issue there.

chadwhitacre · 2013-10-10T17:00:05Z

Hello –

File "psycopg2/pool.py", line 89, in _getconn
raise PoolError("connection pool exhausted")

It looks like your applications connection pool was exhausted, I'm not well versed in python and likely unfamiliar with how you're doing pool management in python, but it does not appear to be an issue with the database itself.

I can tell that your application normally has 4-6 database connections, but at one point on October 4th it jumped up to 50. The technical limitation of the Crane plan is 500 and I'm not seeing anything that suggests a database issue, so I suspect your pool limit in-app is 50.

Restarting the application obviously clears the connection pool, so it sounds like a connection leak.

Regards,
+Clint
Heroku Postgres

chadwhitacre · 2013-10-10T17:00:32Z

Thanks Clint. The standard Postgres driver for Python is called psycopg2. It provides connection pooling, which we're using via a wrapper lib called postgres.py. In our application we read DATABASE_MAXCONN from the environment and use that to configure postgres.py/psycopg2. Currently we have that set to 100 but iirc it was 50 at some point in the past (not seeing any changes to that envvar in the app's activity log).

Anyway, the connection leak explanation sounds plausible, so we'll explore that. Thanks for taking a look! :-)

chadwhitacre · 2013-10-10T17:02:49Z

We should track our pool size in Librato. Reticketed as #1580.

catsby · 2013-10-10T19:28:26Z

Hooray I'm helping! For what it's worth you can see the number of current connections you have open on Heroku with pg:info. You can inspect them (some) with pg:ps, and if you install the pg-extras plugin you can see more with pg:locks.

You can also query pg_stat_activity directly.

chadwhitacre · 2013-10-10T19:51:11Z

Thanks @catsby. :-)

chadwhitacre · 2013-10-11T21:27:45Z

Turns out Librato gives us db connection count for free, so we've added it to the dashboard (#1580):

It looks like this isn't a steady leak, at least. I guess we'll have to wait for this to recur and see if we see anything on that chart.

chadwhitacre · 2013-10-12T03:22:05Z

Dropping from Infrastructure. Not much to do until it recurs.

@zwn I would even consider closing this until it recurs.

zbynekwinkler · 2013-10-12T06:56:13Z

@whit537 Agreed.

chadwhitacre · 2013-10-16T16:30:52Z

Okay, just saw it again:

I did a heroku restart and we got back on track.

chadwhitacre · 2013-10-16T16:33:03Z

Interesting that we spiked to 15 an hour ago (I actually noticed it but didn't think anything of it).

chadwhitacre · 2013-10-16T16:35:36Z

Looking back three hours I'm seeing a few more tiny spikes:

chadwhitacre · 2013-10-16T16:41:20Z

A couple thoughts:

Most of our traffic is serving widget.html and public.json.
However, we're not seeing an abnormal spike in requests, so it doesn't seem that we're collapsing in the face of overwhelming traffic.
Number of database connections and response time are probably in a feedback loop. The longer a db connection is held, the longer a request takes, and the more likely we are to open a new db connection.

chadwhitacre · 2013-10-16T16:45:46Z

Here are the log messages for the growth from 4 to 30, 96, and 100 db connections:

chadwhitacre · 2013-10-16T16:46:11Z

So the question is, what happened between 4 and 30?

chadwhitacre · 2013-10-16T16:57:41Z

We have 4 db connections at 09:19:43 and 30 at 09:20:59.

I see a handful of H12s ending at 00:23:37 and then they start again at 09:20:43 and then we get "a lot" of them.

Full docs on H12:

https://devcenter.heroku.com/articles/request-timeout

chadwhitacre · 2013-10-16T16:59:52Z

Heroku recommends setting a timeout within your application and keeping the value well under 30 seconds, such as 10 or 15 seconds.

chadwhitacre · 2013-10-16T17:04:58Z

get some visibility into request queue times

chadwhitacre · 2013-10-16T18:33:15Z

IRC

chadwhitacre · 2013-10-16T18:33:28Z

I kicked off #1596 from here.

chadwhitacre · 2013-10-16T19:20:59Z

Current hypothesis is that VACUUMing due to homepage updating is slowing down the db. Slow database queries cause requests to pile up behind each other, further slowing down the database, leading to congestion collapse. IRC

chadwhitacre · 2013-10-16T19:34:43Z

How do we test our our hypothesis?

catsby · 2013-10-16T20:05:44Z

Can you see what your database is up when these connections spike? Use pg:ps, pg:locks or pg:blocking for starters. The latter two are in pg-extras: https://github.com/heroku/heroku-pg-extras

chadwhitacre · 2013-10-17T14:48:51Z

 procpid | relname | transactionid | granted |     query_snippet     |       age       
---------+---------+---------------+---------+-----------------------+-----------------
   31134 |         |               | t       | <IDLE> in transaction | 00:00:00.013118
   31252 |         |               | t       |                      +| 00:00:00
         |         |               |         |      SELECT          +| 
         |         |               |         |        pg_stat_ac     | 
(2 rows)

chadwhitacre · 2013-10-17T14:49:13Z

-------------+--------------------+-------------------+--------------+-------------------+------------------
(0 rows)

chadwhitacre · 2013-10-17T14:49:27Z

 procpid | source | running_for | waiting | query 
---------+--------+-------------+---------+-------
(0 rows)

chadwhitacre · 2013-10-17T15:01:16Z

We're struggling again:

Comments above are pg:locks, pg:blocking, and pg:ps. I'm also looking at the pg_stat_activity table, all but one or two connections are current_query <IDLE>. I restarted twice (these show up as spikes to 100% in Aspen Utilization), and was surprised that db connections didn't immediately drop to zero.

chadwhitacre · 2013-10-17T15:07:32Z

Ten minutes later, the system seems to have stabilized:

chadwhitacre · 2013-10-17T15:07:54Z

Note that payday continues to run (#1597).

zbynekwinkler · 2013-10-17T15:14:12Z

So setting the timeout didn't help :(. So the problem most likely will not be slow db queries.

Could we be leaking connections?
http://stackoverflow.com/questions/13236160/is-there-a-timeout-for-postgresql-connections

How do we find out what all the threads are doing? The 1m load going to 5 is not good either :(. That means that for a minute we had on average 5 tasks/threads not blocked on I/O but ready to run. That is very strange.

chadwhitacre · 2013-10-17T16:22:25Z

@zwn Note that over on #1597 I removed the timeout during the payday run.

zbynekwinkler · 2013-10-17T18:30:55Z

Let's implement TRUNCATE for the homepage updater

I've tested the behavior locally and it is not what we want. TRUNCATE locks the table. Before we fill it up with new data, all SELECTs on the table block until the transaction with the TRUNCATE commits or rollbacks. That would mean that the homepage would be out of order the this time (several seconds).

chadwhitacre · 2013-10-17T20:15:45Z

IRC

zbynekwinkler · 2013-10-17T21:00:26Z

I have scaled back to running 80 threads instead of 200. I remember there was something special about how threads in python behave and that it is not good idea to have a lot of them (something to do with GIL, Global Interpreter Lock). Anyway, now we should not run out of connections to the db because we have less threads (80) than connections (100).

Also the statement_timeout is back on. It has been off during payday running and we might have not restarted when turning it back on (it applies only to new connections). That could have been the reason for the last blackout. The strange thing is that there is nothing in the logs regarding log db queries (all queries over 50ms are logged). That makes me think that db might not be the reason. My current suspect is threading in python.

zbynekwinkler · 2013-10-17T21:19:59Z

We do not have to stay with threads. Multiple processes seem to be available in heroku dyno.

>>> import multiprocessing
>>> from multiprocessing import Process
>>> def f(name):
...     print 'hello', name
... 
>>> p = Process(target=f, args=('bob',))
>>> p.start()
>>> hello bob
p.join()
>>>

We would need to make sure we do not cache volatile things in memory. Something can be learned from Sentry deployment for taking advantage from multiple processes: http://justcramer.com/2013/06/27/serving-python-web-applications/. I believe this could be done within one heroku dyno (limited just by available RAM).

zbynekwinkler · 2013-10-17T21:33:55Z

As for the number of workers - here is the recommendation from gunicorn
http://docs.gunicorn.org/en/latest/design.html

DO NOT scale the number of workers to the number of clients you expect to have. Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second.

Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with.

Heroku dyno reports 4 cpus.

zbynekwinkler · 2013-10-17T21:36:34Z

Always remember, there is such a thing as too many workers. After a point your worker processes will start thrashing system resources decreasing the throughput of the entire system.

zbynekwinkler · 2013-10-18T06:25:04Z

8h ago gittip.com served 38k RPM and didn't die (38 connections to the db, load approaching 2)

chadwhitacre · 2013-10-18T22:03:03Z

Anyway, now we should not run out of connections to the db because we have less threads (80) than connections (100).

Cool. Good move.

8h ago gittip.com served 38k RPM and didn't die (38 connections to the db, load approaching 2)

Don't get your hopes up. We've seen request spikes before w/o a corresponding blackout, see above at #1541 (comment).

We do not have to stay with threads.

There's a lot of room between 200 threads and 1 thread. Let's live with 80 for a while and see how it behaves, imo. :-)

We would need to make sure we do not cache volatile things in memory.

We're definitely doing this right now: #715.

chadwhitacre · 2013-10-21T14:57:18Z

Let's live with 80 for a while ...

I mean, even 80 is high. I just noticed a spike to 48 (60%). We should keep an eye on utilization and tune accordingly.

chadwhitacre · 2013-10-21T15:00:35Z

Oops, sorry! :-)

zbynekwinkler · 2013-10-22T19:55:59Z

I've just lowered the number of threads to 40.

chadwhitacre · 2013-10-28T14:58:51Z

Have we solved this? We didn't have any trouble on Friday the 25th with the Salon traffic, but that was less than half the traffic that took us down on the 17th.

I think the problem is as simple as we were allowing more request workers than database connections. If that's true we should be able to reproduce the crash in a test environment.

chadwhitacre · 2013-10-28T17:59:24Z

Closing per IRC.

zbynekwinkler · 2013-11-22T22:55:07Z

https://postgres.heroku.com/blog/past/2013/11/22/connection_limit_guidance/
The new connection limit for an entry level postgres production db is 60. I don't know if that applies to the current plan too but it will anyway as soon as we upgrade to 9.3 ( #1158). So we should keep that in mind and keep the total number of threads under this number.

catsby · 2013-11-22T23:10:51Z

I don't know if that applies to the current plan too

Only applies to new "Heroku Postgres 2.0" plans, Standard Yanari, etc: https://www.heroku.com/pricing#postgres

Old Crane plans and such are considered "legacy" but they're still provision-able and will be around for a while.

chadwhitacre mentioned this issue Oct 4, 2013

set up a syslog drain #1545

Closed

chadwhitacre mentioned this issue Oct 10, 2013

log Postgres connection pool size to Librato #1580

Closed

zbynekwinkler closed this as completed Oct 12, 2013

chadwhitacre reopened this Oct 16, 2013

zbynekwinkler mentioned this issue Oct 19, 2013

Clean up db schema #1549

Closed

chadwhitacre closed this as completed Oct 21, 2013

chadwhitacre reopened this Oct 21, 2013

chadwhitacre closed this as completed Oct 28, 2013

chadwhitacre mentioned this issue Dec 4, 2013

post-mortem DHH crash #1712

Closed

This was referenced Jul 30, 2014

Payday rewrite, part 5 #2579

Merged

bring back statement_timeout? #2601

Closed

PoolError after release to Heroku #1541

PoolError after release to Heroku #1541

Comments

zbynekwinkler commented Oct 4, 2013

chadwhitacre commented Oct 7, 2013

zbynekwinkler commented Oct 7, 2013

chadwhitacre commented Oct 7, 2013

chadwhitacre commented Oct 10, 2013

chadwhitacre commented Oct 10, 2013

chadwhitacre commented Oct 10, 2013

chadwhitacre commented Oct 10, 2013

catsby commented Oct 10, 2013

chadwhitacre commented Oct 10, 2013

chadwhitacre commented Oct 11, 2013

chadwhitacre commented Oct 12, 2013

zbynekwinkler commented Oct 12, 2013

chadwhitacre commented Oct 16, 2013

chadwhitacre commented Oct 16, 2013

chadwhitacre commented Oct 16, 2013

chadwhitacre commented Oct 16, 2013

chadwhitacre commented Oct 16, 2013

chadwhitacre commented Oct 16, 2013

chadwhitacre commented Oct 16, 2013

chadwhitacre commented Oct 16, 2013

chadwhitacre commented Oct 16, 2013

chadwhitacre commented Oct 16, 2013

chadwhitacre commented Oct 16, 2013

chadwhitacre commented Oct 16, 2013

chadwhitacre commented Oct 16, 2013

catsby commented Oct 16, 2013

chadwhitacre commented Oct 17, 2013

chadwhitacre commented Oct 17, 2013

chadwhitacre commented Oct 17, 2013

chadwhitacre commented Oct 17, 2013

chadwhitacre commented Oct 17, 2013

chadwhitacre commented Oct 17, 2013

zbynekwinkler commented Oct 17, 2013

chadwhitacre commented Oct 17, 2013

zbynekwinkler commented Oct 17, 2013

chadwhitacre commented Oct 17, 2013

zbynekwinkler commented Oct 17, 2013

zbynekwinkler commented Oct 17, 2013

zbynekwinkler commented Oct 17, 2013

zbynekwinkler commented Oct 17, 2013

zbynekwinkler commented Oct 18, 2013

chadwhitacre commented Oct 18, 2013

chadwhitacre commented Oct 21, 2013

chadwhitacre commented Oct 21, 2013

zbynekwinkler commented Oct 22, 2013

chadwhitacre commented Oct 28, 2013

chadwhitacre commented Oct 28, 2013

zbynekwinkler commented Nov 22, 2013

catsby commented Nov 22, 2013