Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: stopping the crawlers gracefully with BasicCrawler.stop() #2792

Merged
merged 12 commits into from
Jan 20, 2025
17 changes: 17 additions & 0 deletions packages/basic-crawler/src/internals/basic-crawler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -974,6 +974,23 @@ export class BasicCrawler<Context extends CrawlingContext = BasicCrawlingContext
return stats;
}

/**
* Gracefully stops the current run of the crawler.
*
* All the tasks active at the time of calling this method will be allowed to finish.
*/
stop(message: string = 'This crawler has been gracefully stopped.') {
barjin marked this conversation as resolved.
Show resolved Hide resolved
// Gracefully starve the this.autoscaledPool, so it doesn't start new tasks. Resolves once the pool is cleared.
this.autoscaledPool
?.pause()
// Resolves the `autoscaledPool.run()` promise in the `BasicCrawler.run()` method. Since the pool is already paused, it resolves immediately and doesn't kill any tasks.
.then(async () => this.autoscaledPool?.abort())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the motivation for not making the method async and ignoring the promise instead? feels handy to be able to await it (and users can always decide themselves to ignore the promise, unlike with the current solution)

also I don't think you need the async in here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I planned it like this in the beginning, see 284274b

The reasons for moving to this was:
a. parity w/ Python version
b.

async requestHandler({ crawler }) {
    if (iFoundWhatISearchedFor) {
       await requestHandler.stop();
    }
}

^ this looks like a completely harmless piece of code, but will cause deadlock. requestHandler is waiting for autoscaledPool.pause, which is waiting for all tasks to finish (including the aforementioned requestHandler).

also I don't think you need the async in here

You and me both, but @typescript-eslint/promise-function-async thinks otherwise.

Copy link
Contributor Author

@barjin barjin Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I'm definitely for making this async, but we need to address these somehow. Also, note that even with the implementation from 284274b , crawler.stop resolves once autoscaledPool.abort() has resolved, but that (afaik) doesn't mean the crawler.run() has resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not sure that there is a good reason for an async stop() instead of non-async one. If you are expected to call it from a request handler, waiting for the crawler to stop there leads to a deadlock and the reasons are quite intuitive.

Should someone need to wait for the crawler to finish outside of the request handler, they can always wait for the promise from crawler.run() to resolve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always a bit wary about floating promises, but as long as you guys are fine with it (and there is the catch clause to stop it from killing the whole process), let's see how it goes.

I'll still add the promised tests from #2803 , just to stay on the safer side.

.then(() => this.log.info(message))
.catch((err) => {
this.log.error('Error stopping the crawler:', err);
barjin marked this conversation as resolved.
Show resolved Hide resolved
});
}

async getRequestQueue() {
if (!this.requestQueue && this.requestList) {
this.log.warningOnce(
Expand Down
Loading