Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust Timeout for Cancelling Dagster Run to Support Large Sitemap Crawls #55

Open
webb-ben opened this issue Dec 1, 2024 · 1 comment

Comments

@webb-ben
Copy link
Member

webb-ben commented Dec 1, 2024

Currently, when deploying the scheduler for the live Geoconnex sitemap, we encounter issues while crawling sitemaps that contain around 50,000 entries. This is due to the automatic cancellation of any Dagster run where a step exceeds 180 seconds. As a result, the process is prematurely interrupted, making it difficult to reliably crawl large sitemaps.

Expected Outcome:

We need to adjust the timeout configuration to a more reasonable value that allows the crawling of large sitemaps (with up to 50,000 entries) without triggering premature cancellations. The goal is to enable smooth and reliable crawling of these large datasets without interference from the current Dagster monitoring and cancellation system.

@webb-ben
Copy link
Member Author

webb-ben commented Dec 13, 2024

Note length of crawls of reference features:

sitemap Time number of items
ref_princi_aq_princi_aq__0 0:16:47 unknown
ref_gages_gages__0 11:16:16 50,000
ref_mainstems_mainstems__0 8:41:53 34,082
ref_gages_gages__1 10:02:30 50,000
ref_gages_gages__2 7:43:41 50,000
ref_gages_gages__3 9:17:44 37,570

The total length of time to run is the cumulative combination of Gleaner and Nabu prov release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant