Skip to content
This repository has been archived by the owner on Feb 8, 2018. It is now read-only.

PoolError after release to Heroku #1541

Closed
zbynekwinkler opened this issue Oct 4, 2013 · 56 comments
Closed

PoolError after release to Heroku #1541

zbynekwinkler opened this issue Oct 4, 2013 · 56 comments

Comments

@zbynekwinkler
Copy link
Contributor

https://help.heroku.com/tickets/100017

When I deployed version 10.1.27 the application started to raise PoolError for all requests. I had to restart the dyno to get the application back up.

@chadwhitacre
Copy link
Contributor

I've removed IRQ from this. This isn't immediately threatening production right now.

I have added it to Infrastructure, however.

@zbynekwinkler
Copy link
Contributor Author

Ok. I have no news from Heroku so far. I am not sure what to expect since we do not have the logs. I've given them only the traceback from Sentry.

@chadwhitacre
Copy link
Contributor

I have access to the Heroku ticket now, btw. They can manually cc: other users.

@chadwhitacre
Copy link
Contributor

Looks like this isn't the first time we've noticed this: #632. We basically ignored the issue there.

@chadwhitacre
Copy link
Contributor

Hello –

File "psycopg2/pool.py", line 89, in _getconn
raise PoolError("connection pool exhausted")

It looks like your applications connection pool was exhausted, I'm not well versed in python and likely unfamiliar with how you're doing pool management in python, but it does not appear to be an issue with the database itself.

I can tell that your application normally has 4-6 database connections, but at one point on October 4th it jumped up to 50. The technical limitation of the Crane plan is 500 and I'm not seeing anything that suggests a database issue, so I suspect your pool limit in-app is 50.

Restarting the application obviously clears the connection pool, so it sounds like a connection leak.

Regards,
+Clint
Heroku Postgres

@chadwhitacre
Copy link
Contributor

Thanks Clint. The standard Postgres driver for Python is called psycopg2. It provides connection pooling, which we're using via a wrapper lib called postgres.py. In our application we read DATABASE_MAXCONN from the environment and use that to configure postgres.py/psycopg2. Currently we have that set to 100 but iirc it was 50 at some point in the past (not seeing any changes to that envvar in the app's activity log).

Anyway, the connection leak explanation sounds plausible, so we'll explore that. Thanks for taking a look! :-)

@chadwhitacre
Copy link
Contributor

We should track our pool size in Librato. Reticketed as #1580.

@catsby
Copy link

catsby commented Oct 10, 2013

Hooray I'm helping! For what it's worth you can see the number of current connections you have open on Heroku with pg:info. You can inspect them (some) with pg:ps, and if you install the pg-extras plugin you can see more with pg:locks.

You can also query pg_stat_activity directly.

@chadwhitacre
Copy link
Contributor

Thanks @catsby. :-)

@chadwhitacre
Copy link
Contributor

Turns out Librato gives us db connection count for free, so we've added it to the dashboard (#1580):

screen shot 2013-10-11 at 5 25 51 pm

It looks like this isn't a steady leak, at least. I guess we'll have to wait for this to recur and see if we see anything on that chart.

@chadwhitacre
Copy link
Contributor

Dropping from Infrastructure. Not much to do until it recurs.

@zwn I would even consider closing this until it recurs.

@zbynekwinkler
Copy link
Contributor Author

@whit537 Agreed.

@chadwhitacre
Copy link
Contributor

Okay, just saw it again:

screen shot 2013-10-16 at 12 29 53 pm

I did a heroku restart and we got back on track.

@chadwhitacre
Copy link
Contributor

Interesting that we spiked to 15 an hour ago (I actually noticed it but didn't think anything of it).

@chadwhitacre
Copy link
Contributor

Looking back three hours I'm seeing a few more tiny spikes:

screen shot 2013-10-16 at 12 34 53 pm

@chadwhitacre
Copy link
Contributor

A couple thoughts:

  • Most of our traffic is serving widget.html and public.json.
  • However, we're not seeing an abnormal spike in requests, so it doesn't seem that we're collapsing in the face of overwhelming traffic.
  • Number of database connections and response time are probably in a feedback loop. The longer a db connection is held, the longer a request takes, and the more likely we are to open a new db connection.

@chadwhitacre
Copy link
Contributor

So the question is, what happened between 4 and 30?

@chadwhitacre
Copy link
Contributor

We have 4 db connections at 09:19:43 and 30 at 09:20:59.

I see a handful of H12s ending at 00:23:37 and then they start again at 09:20:43 and then we get "a lot" of them.

Full docs on H12:

https://devcenter.heroku.com/articles/request-timeout

@chadwhitacre
Copy link
Contributor

Heroku recommends setting a timeout within your application and keeping the value well under 30 seconds, such as 10 or 15 seconds.

@chadwhitacre
Copy link
Contributor

get some visibility into request queue times

@chadwhitacre
Copy link
Contributor

IRC

@chadwhitacre
Copy link
Contributor

I kicked off #1596 from here.

@chadwhitacre
Copy link
Contributor

Current hypothesis is that VACUUMing due to homepage updating is slowing down the db. Slow database queries cause requests to pile up behind each other, further slowing down the database, leading to congestion collapse. IRC

@chadwhitacre
Copy link
Contributor

How do we test our our hypothesis?

@catsby
Copy link

catsby commented Oct 16, 2013

Can you see what your database is up when these connections spike? Use pg:ps, pg:locks or pg:blocking for starters. The latter two are in pg-extras: https://github.com/heroku/heroku-pg-extras

@chadwhitacre
Copy link
Contributor

 procpid | relname | transactionid | granted |     query_snippet     |       age       
---------+---------+---------------+---------+-----------------------+-----------------
   31134 |         |               | t       | <IDLE> in transaction | 00:00:00.013118
   31252 |         |               | t       |                      +| 00:00:00
         |         |               |         |      SELECT          +| 
         |         |               |         |        pg_stat_ac     | 
(2 rows)

@chadwhitacre
Copy link
Contributor

-------------+--------------------+-------------------+--------------+-------------------+------------------
(0 rows)

@chadwhitacre
Copy link
Contributor

 procpid | source | running_for | waiting | query 
---------+--------+-------------+---------+-------
(0 rows)

@chadwhitacre
Copy link
Contributor

We're struggling again:

screen shot 2013-10-17 at 10 58 18 am

Comments above are pg:locks, pg:blocking, and pg:ps. I'm also looking at the pg_stat_activity table, all but one or two connections are current_query <IDLE>. I restarted twice (these show up as spikes to 100% in Aspen Utilization), and was surprised that db connections didn't immediately drop to zero.

@chadwhitacre
Copy link
Contributor

Ten minutes later, the system seems to have stabilized:

screen shot 2013-10-17 at 11 06 44 am

@chadwhitacre
Copy link
Contributor

Note that payday continues to run (#1597).

@zbynekwinkler
Copy link
Contributor Author

So setting the timeout didn't help :(. So the problem most likely will not be slow db queries.

Could we be leaking connections?
http://stackoverflow.com/questions/13236160/is-there-a-timeout-for-postgresql-connections

How do we find out what all the threads are doing? The 1m load going to 5 is not good either :(. That means that for a minute we had on average 5 tasks/threads not blocked on I/O but ready to run. That is very strange.

@chadwhitacre
Copy link
Contributor

@zwn Note that over on #1597 I removed the timeout during the payday run.

@zbynekwinkler
Copy link
Contributor Author

Let's implement TRUNCATE for the homepage updater

I've tested the behavior locally and it is not what we want. TRUNCATE locks the table. Before we fill it up with new data, all SELECTs on the table block until the transaction with the TRUNCATE commits or rollbacks. That would mean that the homepage would be out of order the this time (several seconds).

@chadwhitacre
Copy link
Contributor

IRC

@zbynekwinkler
Copy link
Contributor Author

I have scaled back to running 80 threads instead of 200. I remember there was something special about how threads in python behave and that it is not good idea to have a lot of them (something to do with GIL, Global Interpreter Lock). Anyway, now we should not run out of connections to the db because we have less threads (80) than connections (100).

Also the statement_timeout is back on. It has been off during payday running and we might have not restarted when turning it back on (it applies only to new connections). That could have been the reason for the last blackout. The strange thing is that there is nothing in the logs regarding log db queries (all queries over 50ms are logged). That makes me think that db might not be the reason. My current suspect is threading in python.

@zbynekwinkler
Copy link
Contributor Author

We do not have to stay with threads. Multiple processes seem to be available in heroku dyno.

>>> import multiprocessing
>>> from multiprocessing import Process
>>> def f(name):
...     print 'hello', name
... 
>>> p = Process(target=f, args=('bob',))
>>> p.start()
>>> hello bob
p.join()
>>> 

We would need to make sure we do not cache volatile things in memory. Something can be learned from Sentry deployment for taking advantage from multiple processes: http://justcramer.com/2013/06/27/serving-python-web-applications/. I believe this could be done within one heroku dyno (limited just by available RAM).

@zbynekwinkler
Copy link
Contributor Author

As for the number of workers - here is the recommendation from gunicorn
http://docs.gunicorn.org/en/latest/design.html

DO NOT scale the number of workers to the number of clients you expect to have. Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second.

Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with.

Heroku dyno reports 4 cpus.

@zbynekwinkler
Copy link
Contributor Author

Always remember, there is such a thing as too many workers. After a point your worker processes will start thrashing system resources decreasing the throughput of the entire system.

@zbynekwinkler
Copy link
Contributor Author

8h ago gittip.com served 38k RPM and didn't die (38 connections to the db, load approaching 2)

@chadwhitacre
Copy link
Contributor

Anyway, now we should not run out of connections to the db because we have less threads (80) than connections (100).

Cool. Good move.

8h ago gittip.com served 38k RPM and didn't die (38 connections to the db, load approaching 2)

Don't get your hopes up. We've seen request spikes before w/o a corresponding blackout, see above at #1541 (comment).

We do not have to stay with threads.

There's a lot of room between 200 threads and 1 thread. Let's live with 80 for a while and see how it behaves, imo. :-)

We would need to make sure we do not cache volatile things in memory.

We're definitely doing this right now: #715.

@chadwhitacre
Copy link
Contributor

Let's live with 80 for a while ...

I mean, even 80 is high. I just noticed a spike to 48 (60%). We should keep an eye on utilization and tune accordingly.

screen shot 2013-10-21 at 10 55 57 am

@chadwhitacre
Copy link
Contributor

Oops, sorry! :-)

@zbynekwinkler
Copy link
Contributor Author

I've just lowered the number of threads to 40.

@chadwhitacre
Copy link
Contributor

Have we solved this? We didn't have any trouble on Friday the 25th with the Salon traffic, but that was less than half the traffic that took us down on the 17th.

screen shot 2013-10-28 at 10 57 05 am

I think the problem is as simple as we were allowing more request workers than database connections. If that's true we should be able to reproduce the crash in a test environment.

@chadwhitacre
Copy link
Contributor

Closing per IRC.

@zbynekwinkler
Copy link
Contributor Author

https://postgres.heroku.com/blog/past/2013/11/22/connection_limit_guidance/
The new connection limit for an entry level postgres production db is 60. I don't know if that applies to the current plan too but it will anyway as soon as we upgrade to 9.3 ( #1158). So we should keep that in mind and keep the total number of threads under this number.

@catsby
Copy link

catsby commented Nov 22, 2013

I don't know if that applies to the current plan too

Only applies to new "Heroku Postgres 2.0" plans, Standard Yanari, etc: https://www.heroku.com/pricing#postgres

Old Crane plans and such are considered "legacy" but they're still provision-able and will be around for a while.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants