Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: stopping the crawlers gracefully with BasicCrawler.stop() #2792

Merged
merged 12 commits into from
Jan 20, 2025

Conversation

barjin
Copy link
Contributor

@barjin barjin commented Jan 6, 2025

Allows users to call crawler.stop() to gracefully stop the crawler.

Currently, the crawlers are "stateless", i.e. calling:

async requestHandler({ crawler, enqueueLinks }) {
   await enqueueLinks();
   crawler.stop();
}

...

await crawler.run(['example.com']);
await crawler.run();

Will only crawl example.com once, then stop and purge the RQ / dataset, so the second crawler.run() call will yield no results.

I suppose this is expected, but we could easily add a crawler.pause() method, which would keep the inner state for resuming with crawler.run().

Closes #2777

@barjin barjin requested review from janbuchar and B4nan January 6, 2025 15:51
@barjin barjin self-assigned this Jan 6, 2025
@github-actions github-actions bot added this to the 105th sprint - Tooling team milestone Jan 6, 2025
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 6, 2025
@B4nan
Copy link
Member

B4nan commented Jan 9, 2025

I suppose this is expected

yes

but we could easily add a crawler.pause() method, which would keep the inner state for resuming with crawler.run().

makes sense

B4nan

This comment was marked as resolved.

@B4nan

This comment was marked as resolved.

@janbuchar janbuchar requested a review from Pijukatel January 9, 2025 09:48
@janbuchar

This comment was marked as resolved.

@B4nan
Copy link
Member

B4nan commented Jan 9, 2025

I suppose this is expected, but we could easily add a crawler.pause() method, which would keep the inner state for resuming with crawler.run().

Btw the purging in run is configurable, you can opt out of that via crawler.run([], { purgeRequestQueue: false }), I guess that should work already even with the stop() method?

Copy link

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a question.

In Python PR
stop was implemented by modifying __is_finished_function, so it avoids calling abort explicitly. Here it explicitly calls abort. I guess we should decide on one way and align both?

@barjin
Copy link
Contributor Author

barjin commented Jan 9, 2025

More of a question.

In Python PR stop was implemented by modifying __is_finished_function, so it avoids calling abort explicitly. Here it explicitly calls abort. I guess we should decide on one way and align both?

// Gracefully starve the this.autoscaledPool, so it doesn't start new tasks. Resolves once the pool is cleared.
this.autoscaledPool
?.pause()
// Resolves the `autoscaledPool.run()` promise in the `BasicCrawler.run()` method. Since the pool is already paused, it resolves immediately and doesn't kill any tasks.
.then(async () => this.autoscaledPool?.abort())

See the comments - abort is only called after a pause() call, which only resolves once there are no tasks running (basically emulates setting the finished flag you have in Python). The subsequent abort call will therefore just clean up after the Pool (and most importantly, it will cause the autoscaledPool.run() call to resolve, which unblocks the crawler and causes it to finish normally).

TLDR: I think we both arrived at the same result, this version uses the AutoscaledPool methods directly (not sure whether there is full parity in Python, might be not possible).

barjin and others added 2 commits January 9, 2025 13:22
Co-authored-by: Martin Adámek <[email protected]>
Co-authored-by: Martin Adámek <[email protected]>
@barjin

This comment was marked as off-topic.

@barjin

This comment was marked as resolved.

@barjin barjin force-pushed the feat/crawler-stop branch from ba466a7 to e7388de Compare January 10, 2025 09:46
Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'm fine with merging this as is, but please wait for an OK from @B4nan I guess 🙂

@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Jan 18, 2025
@barjin barjin force-pushed the feat/crawler-stop branch from 40674ee to 8e9ea84 Compare January 20, 2025 07:28
@barjin
Copy link
Contributor Author

barjin commented Jan 20, 2025

E2E tests are passing 🎉

merging now

@barjin barjin merged commit af2966f into master Jan 20, 2025
12 checks passed
@barjin barjin deleted the feat/crawler-stop branch January 20, 2025 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: implement a way to stop crawler from the user function
4 participants