feat: implement a way to stop crawler from the user function #2777

barjin · 2024-12-18T13:17:38Z

This is a parity-tracking issue for this PR in Crawlee for Python: apify/crawlee-python#651

Currently, to stop the crawler instance, the users can only call the BasicCrawler.teardown() method, which is both undocumented (has the @ignore TypeDoc decorator) and not exactly named well.

The crawler.stop() implementation in Crawlee for Python forces the AutoscaledPool to not take any more tasks, but to gracefully finish the ones that are in currently in progress. This is different from the AutoscaledPool.abort method (called by crawler.teardown()), which according to the docstring abandons the running tasks on spot ("all running tasks will be left in their current state").

More context / discussion at https://apify.slack.com/archives/CD0SF6KD4/p1734526549266519

The text was updated successfully, but these errors were encountered:

danielcrabtree · 2025-01-04T03:07:58Z

In addition, there is no way to stop the periodicLogger. This continues to fire after a CriticalError even if teardown() is used.

barjin · 2025-01-06T14:53:26Z

Interesting, @danielcrabtree - can you please provide a minimal reproducible scenario? From looking into the code, it seems that the periodicLogger should definitely stop once the BasicCrawler.run promise has resolved.

danielcrabtree · 2025-01-07T01:38:35Z

@barjin I'm using PlaywrightCrawler and if you simply throw new CriticalError("Issue"); within requestHandler, the onrejected side of the run promise is called. If you run with DEBUG level logs, you'll continue to see Crawled 0/1 pages, 0 failed requests, desired concurrency 2. printed repeatedly by the periodicLogger. Since the stop method for the periodicLogger is within the closure, there is no way to stop this externally.

If you look in the BasicCrawler code, you'll see there is only the one call to periodicLogger.stop(); and it happens outside any finally block, so it doesn't get executed in the case of an exception. I think the crawler should ensure it is properly tidied up and everything stopped in the case that an exception is going to bubble up through the run promise.

barjin · 2025-01-10T09:15:12Z

I see, thank you for your detailed description! I'll handle this in the currently open PR, as it seems (at least tangentially) related. Cheers!

barjin · 2025-01-10T10:25:18Z

On second thought, the BasicCrawler.run method would deserve a more complex refactor to ensure we clean up after the crawler in all possible scenarios. I submitted a separate issue (#2807) for this - feel free to add more details there, if anything comes to your mind. Thank you for bringing this up!

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Dec 18, 2024

B4nan assigned barjin Jan 6, 2025

barjin linked a pull request Jan 6, 2025 that will close this issue

feat: stopping the crawlers gracefully with BasicCrawler.stop() #2792

Open

barjin mentioned this issue Jan 10, 2025

bug: incorrect BasicCrawler clean up after CriticalError #2807

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement a way to stop crawler from the user function #2777

feat: implement a way to stop crawler from the user function #2777

barjin commented Dec 18, 2024

danielcrabtree commented Jan 4, 2025

barjin commented Jan 6, 2025

danielcrabtree commented Jan 7, 2025

barjin commented Jan 10, 2025

barjin commented Jan 10, 2025

feat: implement a way to stop crawler from the user function #2777

feat: implement a way to stop crawler from the user function #2777

Comments

barjin commented Dec 18, 2024

danielcrabtree commented Jan 4, 2025

barjin commented Jan 6, 2025

danielcrabtree commented Jan 7, 2025

barjin commented Jan 10, 2025

barjin commented Jan 10, 2025