Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[21pt] Segmentation Fault #1376

Open
RobHanna-NOAA opened this issue Dec 11, 2024 · 2 comments · Fixed by #1382
Open

[21pt] Segmentation Fault #1376

RobHanna-NOAA opened this issue Dec 11, 2024 · 2 comments · Fixed by #1382

Comments

@RobHanna-NOAA
Copy link
Contributor

RobHanna-NOAA commented Dec 11, 2024

Note: Now part of 1377 EPIC: FIM Sys Admin Tasks (and a few related FIM tasks)

A number of machines have shown segmentation faults against branches when running FIM_pipeline. This is cause branches to fail, especially large branches, and making machines slow to a crawl for long periods.

A lot of research and testing has already gone into this and is actively being worked on. The amount of research may end up being very large. We have already put in a lot of tests and research to narrow down what the problem is, where is it occurring and how to fix it. It is a very elusive problem.

Segmentation Faults are almost always related to a memory issues and memory leaks and historically have been randomly seen inside docker containers where a task was aborted and another kicked off. A hard stop of a task in a Docker container (CTRL-C), especially in code running multi-processing, leave a large number of orphan data in memory. To fix this, we generally just close the docker container, restart and it is fine. However, this is not the problem on this latest segmentation fault issue.

This problem is making most fim_pipeline runs on EC2's fail on some branches, but not all and not consistently. We are also seeing various weirdness in post processing and eval_plots. It is unknown if they are related.

Various notes (and Rob has a ton more)

  • We first saw it when creating a docker update for dep-bot, but we now feel very confident that it is unrelated as we can create the fault on current dev containers

  • We have seen that if we drop the number of jobs (ie.. -jh to a lower value, it seems to lessen the possibility of the error, but not consistently.

  • While unconfirmed, it may have been less prone to the error on a special EC2 called "xxsx_Rob_clean", which is a raw new image created again base 6, but had no updates applied.

  • Some machines have been showing more and more errors at the Ubuntu OS level and some of the security upgrades might be related to the problem. I have seen some updates to various installed software showing fault errors. I don't' know which and did not record them but we can review update logs and see if it gets any clues. Examples of possible software that could be related are thing like QGIS patches, VSCode, LibreOffice, Notepad++, etc but could also just be one of the Ubuntu OS level tools. Lots of research might be required here.

  • A test has not been done in AWS Step functions to see if it shows up there.

  • It tends to fail on larger HUCs. Many of our tests were run against 07080103 07080208 12040101 17050114 12040103 12090301 19020302 with the 17x and 19x failing. I don't think we saw other HUCs fail but tests are needed.

  • When it does fail, it almost always failed somewhere inside one of the py files from delineate_hydros_and_produce_HAND.sh. But which py file inside of it showing the actual fail it inconsistent. This does suggest, no matter what that some objects somewhere in the py files of that code chunk have memory leaks. This should be reviewed no matter what and Matt is already on this part.

  • We are seeing a large number of warnings around those HUCs that might offer some clues

  • The errors are not always obvious, but do show up in the final post processing error log rollup file. Those can be traced back to see the true error of segmentation fault.

  • There are a ton of other notes on this topic, but above are some of the key ones so far

  • One possibility is that when Docker runs, it uses some disk space as part of its running and even part of its memory overflow. Our security update have been updating docker and may not be allocating enough space anymore.

  • It could be a wide number of things from Docker engine, to an Ubuntu OS, issue, to who knows what. It does not seem to be a code issue specifically as we see the error (or at least similar errors) outside docker containers. But.. there is independently a fix that should be reviewed in delineate_hydros_and_produce_HAND.sh. (mentioned above)

  • We have a number of cards that may play into this story. It is not out of possibility that something in the Ubuntu OS has been compromised. The main card for that one maybe part of 1363 where we are looking at upgrading, rebuilding or replacing our EC2.

@RobHanna-NOAA
Copy link
Contributor Author

I am chasing down possible problems related to limits built into the docker engine settings. It might also explain why historically I have not seen Docker make very good usage of memory, which could be mostly code object usage, but I would expect some memory bumps but don't see them often.

@RobHanna-NOAA
Copy link
Contributor Author

A partial solution was merged today, but the overall issue is still WIP

@RobHanna-NOAA RobHanna-NOAA reopened this Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants