Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High RAM consumption by any HttpCrawlers when working with large proxy pools #895

Open
Mantisus opened this issue Jan 10, 2025 · 1 comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@Mantisus
Copy link
Collaborator

Mantisus commented Jan 10, 2025

When working with large proxy pools (e.g., Apify RESIDENTIAL), I observe significant RAM usage growth. Memory consumption increases by more than a gigabyte within a few minutes during active scraping.

My version:
The issue appears to be related to creating an HTTP session for each proxy https://github.com/apify/crawlee-python/blob/master/src/crawlee/http_clients/_httpx.py#L132 combined with the high default maximum SessionPool size of 1000 sessions. This results in the crawler creating a new HTTP session for almost every new request during its initial run.

The absence of cleanup logic for created HTTP sessions will likely worsen the situation when the proxy pool contains a large number of "bad" proxies.

@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 10, 2025
@B4nan
Copy link
Member

B4nan commented Jan 10, 2025

AFAIK this works just fine in the JS version, it feels like a memory leak in the python version somewhere to me. We should investigate why this happens, I don't see why 1000 sessions should be a problem.

Either way, a complete reproduction for this is needed unless you plan to deal with this on your own.

creating an HTTP session for each proxy

There should be a session per proxy. Or better say, there should be a proxy per session, its the other way around, we create sessions, and they are represented by a proxy.

@vdusek vdusek added the bug Something isn't working. label Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants