-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggesting new way to schedule requests #119
Comments
Hi, thank you for your input. A while ago I was thinking something similar but couldn't pursue the implementation. It would be great if you could go ahead with the PR. |
Hey! Quick question! was dis finally implemented? Thank you! |
Hey! |
I wrote a temporary patch for it. |
Hi.
This approach, adding new requests when spider is idle, works good but I think we can improve it. Here is my idea:
Imagine that we configured our spider to handle hight load(as example):
(I know is not good to make so many requests but there are cases when we can do that)
Now, according to docs https://doc.scrapy.org/en/latest/topics/signals.html#spider-idle an idle state
is "which means the spider has no further":
Why do we need to wait until items from pipeline are being processed? There may be DB insert and other things that can slow it down but we don't need to wait for that, we can process new requests meantime. But it waits for all and then new batch of requests are added. My solution is to have a task that runs each x seconds which will check the scheduler queue size and add new requests even if there are already. Example (prototype code):
This way we can keep some requests in queue always so that spider does not go idle(we still can use idle case). This way we can keep the spider always busy and make it finish sooner and at the same time we have a reasonable amount of requests in queue and fetching new batch size from DB.
Let me know what you think about this approach. I can contribute with a PR.
Thx.
The text was updated successfully, but these errors were encountered: