Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement a way to stop crawler from the user function #2777

Open
barjin opened this issue Dec 18, 2024 · 5 comments · May be fixed by #2792
Open

feat: implement a way to stop crawler from the user function #2777

barjin opened this issue Dec 18, 2024 · 5 comments · May be fixed by #2792
Assignees
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@barjin
Copy link
Contributor

barjin commented Dec 18, 2024

This is a parity-tracking issue for this PR in Crawlee for Python: apify/crawlee-python#651

Currently, to stop the crawler instance, the users can only call the BasicCrawler.teardown() method, which is both undocumented (has the @ignore TypeDoc decorator) and not exactly named well.

The crawler.stop() implementation in Crawlee for Python forces the AutoscaledPool to not take any more tasks, but to gracefully finish the ones that are in currently in progress. This is different from the AutoscaledPool.abort method (called by crawler.teardown()), which according to the docstring abandons the running tasks on spot ("all running tasks will be left in their current state").

More context / discussion at https://apify.slack.com/archives/CD0SF6KD4/p1734526549266519

@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Dec 18, 2024
@danielcrabtree
Copy link

In addition, there is no way to stop the periodicLogger. This continues to fire after a CriticalError even if teardown() is used.

@barjin
Copy link
Contributor Author

barjin commented Jan 6, 2025

Interesting, @danielcrabtree - can you please provide a minimal reproducible scenario? From looking into the code, it seems that the periodicLogger should definitely stop once the BasicCrawler.run promise has resolved.

@danielcrabtree
Copy link

@barjin I'm using PlaywrightCrawler and if you simply throw new CriticalError("Issue"); within requestHandler, the onrejected side of the run promise is called. If you run with DEBUG level logs, you'll continue to see Crawled 0/1 pages, 0 failed requests, desired concurrency 2. printed repeatedly by the periodicLogger. Since the stop method for the periodicLogger is within the closure, there is no way to stop this externally.

If you look in the BasicCrawler code, you'll see there is only the one call to periodicLogger.stop(); and it happens outside any finally block, so it doesn't get executed in the case of an exception. I think the crawler should ensure it is properly tidied up and everything stopped in the case that an exception is going to bubble up through the run promise.

@barjin
Copy link
Contributor Author

barjin commented Jan 10, 2025

I see, thank you for your detailed description! I'll handle this in the currently open PR, as it seems (at least tangentially) related. Cheers!

@barjin
Copy link
Contributor Author

barjin commented Jan 10, 2025

On second thought, the BasicCrawler.run method would deserve a more complex refactor to ensure we clean up after the crawler in all possible scenarios. I submitted a separate issue (#2807) for this - feel free to add more details there, if anything comes to your mind. Thank you for bringing this up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants