You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When working with large proxy pools (e.g., Apify RESIDENTIAL), I observe significant RAM usage growth. Memory consumption increases by more than a gigabyte within a few minutes during active scraping.
AFAIK this works just fine in the JS version, it feels like a memory leak in the python version somewhere to me. We should investigate why this happens, I don't see why 1000 sessions should be a problem.
Either way, a complete reproduction for this is needed unless you plan to deal with this on your own.
creating an HTTP session for each proxy
There should be a session per proxy. Or better say, there should be a proxy per session, its the other way around, we create sessions, and they are represented by a proxy.
When working with large proxy pools (e.g., Apify RESIDENTIAL), I observe significant RAM usage growth. Memory consumption increases by more than a gigabyte within a few minutes during active scraping.
My version:
The issue appears to be related to creating an HTTP session for each proxy https://github.com/apify/crawlee-python/blob/master/src/crawlee/http_clients/_httpx.py#L132 combined with the high default maximum SessionPool size of 1000 sessions. This results in the crawler creating a new HTTP session for almost every new request during its initial run.
The absence of cleanup logic for created HTTP sessions will likely worsen the situation when the proxy pool contains a large number of "bad" proxies.
The text was updated successfully, but these errors were encountered: