Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOAA harvest job stuck because of large file size #4965

Open
rshewitt opened this issue Nov 4, 2024 · 3 comments
Open

NOAA harvest job stuck because of large file size #4965

rshewitt opened this issue Nov 4, 2024 · 3 comments
Assignees
Labels
bug Software defect or bug component/catalog Related to catalog component playbooks/roles O&M Operations and maintenance tasks for the Data.gov platform

Comments

@rshewitt
Copy link
Contributor

rshewitt commented Nov 4, 2024

noaa-nesdis-ncei-accessions has some datasets which cause an out-of-memory error in catalog-fetch ( i.e. the log message is "Killed" ). related to 1487. here's a dataset which managed to be created after increasing catalog-fetch memory but because of its size the server responds with a 500 in the UI.

How to reproduce

  1. harvest the source

Expected behavior

the job is completed without timeout.

Actual behavior

the job is stuck and times out after the 72 hour limit.

Sketch

[Notes or a checklist reflecting our understanding of the selected approach]

@rshewitt rshewitt added the bug Software defect or bug label Nov 4, 2024
@FuhuXia
Copy link
Member

FuhuXia commented Nov 4, 2024

For this particular case, the stuck job is directly related to tons of tags(keywords) in some xml records, 34,866 to be exact for the sampled one. Large file size is also because of tons of tags. So we can set a max limit of tags allowed, we can reject this kind of non-sense records and not get job stuck.

Rejecting records based on file size may be too broad.

image

@btylerburton
Copy link
Contributor

To Fuhu's point, we should set a reasonably high limit for each field, publicize it somewhere, and then hard fail the datasets when they exceed that limit. In H2.0 we can even throw custom errors to highlight this.

@FuhuXia
Copy link
Member

FuhuXia commented Nov 5, 2024

If we can set the limit ridiculously high, say, 3000, maybe we can get away without publicizing it, because it will be really rare for any record to reach that limit. And when it does, people will know why the dataset fails to be harvested because it is ridiculous. Who would create a dataset with 1500 resources or 3000 keywords.

@hkdctol hkdctol added component/catalog Related to catalog component playbooks/roles O&M Operations and maintenance tasks for the Data.gov platform labels Nov 7, 2024
@hkdctol hkdctol moved this to 📥 Queue in data.gov team board Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug component/catalog Related to catalog component playbooks/roles O&M Operations and maintenance tasks for the Data.gov platform
Projects
Status: 📥 Queue
Development

No branches or pull requests

4 participants