[21pt] Segmentation Fault #1376

RobHanna-NOAA · 2024-12-11T19:08:51Z

Note: Now part of 1377 EPIC: FIM Sys Admin Tasks (and a few related FIM tasks)

A number of machines have shown segmentation faults against branches when running FIM_pipeline. This is cause branches to fail, especially large branches, and making machines slow to a crawl for long periods.

A lot of research and testing has already gone into this and is actively being worked on. The amount of research may end up being very large. We have already put in a lot of tests and research to narrow down what the problem is, where is it occurring and how to fix it. It is a very elusive problem.

Segmentation Faults are almost always related to a memory issues and memory leaks and historically have been randomly seen inside docker containers where a task was aborted and another kicked off. A hard stop of a task in a Docker container (CTRL-C), especially in code running multi-processing, leave a large number of orphan data in memory. To fix this, we generally just close the docker container, restart and it is fine. However, this is not the problem on this latest segmentation fault issue.

This problem is making most fim_pipeline runs on EC2's fail on some branches, but not all and not consistently. We are also seeing various weirdness in post processing and eval_plots. It is unknown if they are related.

Various notes (and Rob has a ton more)

We first saw it when creating a docker update for dep-bot, but we now feel very confident that it is unrelated as we can create the fault on current dev containers
We have seen that if we drop the number of jobs (ie.. -jh to a lower value, it seems to lessen the possibility of the error, but not consistently.
While unconfirmed, it may have been less prone to the error on a special EC2 called "xxsx_Rob_clean", which is a raw new image created again base 6, but had no updates applied.
Some machines have been showing more and more errors at the Ubuntu OS level and some of the security upgrades might be related to the problem. I have seen some updates to various installed software showing fault errors. I don't' know which and did not record them but we can review update logs and see if it gets any clues. Examples of possible software that could be related are thing like QGIS patches, VSCode, LibreOffice, Notepad++, etc but could also just be one of the Ubuntu OS level tools. Lots of research might be required here.
A test has not been done in AWS Step functions to see if it shows up there.
It tends to fail on larger HUCs. Many of our tests were run against 07080103 07080208 12040101 17050114 12040103 12090301 19020302 with the 17x and 19x failing. I don't think we saw other HUCs fail but tests are needed.
When it does fail, it almost always failed somewhere inside one of the py files from delineate_hydros_and_produce_HAND.sh. But which py file inside of it showing the actual fail it inconsistent. This does suggest, no matter what that some objects somewhere in the py files of that code chunk have memory leaks. This should be reviewed no matter what and Matt is already on this part.
We are seeing a large number of warnings around those HUCs that might offer some clues
The errors are not always obvious, but do show up in the final post processing error log rollup file. Those can be traced back to see the true error of segmentation fault.
There are a ton of other notes on this topic, but above are some of the key ones so far
One possibility is that when Docker runs, it uses some disk space as part of its running and even part of its memory overflow. Our security update have been updating docker and may not be allocating enough space anymore.
It could be a wide number of things from Docker engine, to an Ubuntu OS, issue, to who knows what. It does not seem to be a code issue specifically as we see the error (or at least similar errors) outside docker containers. But.. there is independently a fix that should be reviewed in delineate_hydros_and_produce_HAND.sh. (mentioned above)
We have a number of cards that may play into this story. It is not out of possibility that something in the Ubuntu OS has been compromised. The main card for that one maybe part of 1363 where we are looking at upgrading, rebuilding or replacing our EC2.

The text was updated successfully, but these errors were encountered:

RobHanna-NOAA · 2024-12-16T22:46:57Z

I am chasing down possible problems related to limits built into the docker engine settings. It might also explain why historically I have not seen Docker make very good usage of memory, which could be mostly code object usage, but I would expect some memory bumps but don't see them often.

RobHanna-NOAA · 2025-01-04T00:01:30Z

A partial solution was merged today, but the overall issue is still WIP

RobHanna-NOAA assigned RobHanna-NOAA and mluck Dec 11, 2024

RobHanna-NOAA added the High Priority label Dec 11, 2024

RobHanna-NOAA mentioned this issue Dec 11, 2024

EPIC: Rebuild / Upgraded AWS EC2's #1363

Open

RobHanna-NOAA added the Sys Admin label Dec 11, 2024

RobHanna-NOAA mentioned this issue Dec 11, 2024

EPIC: FIM Sys Admin Tasks (and a few related FIM tasks) #1377

Open

mluck mentioned this issue Dec 19, 2024

[1pt] PR: Clean up Python files in FIM pipeline #1382

Merged

20 tasks

CarsonPruitt-NOAA linked a pull request Jan 3, 2025 that will close this issue

[1pt] PR: Clean up Python files in FIM pipeline #1382

Merged

20 tasks

CarsonPruitt-NOAA closed this as completed in #1382 Jan 3, 2025

RobHanna-NOAA reopened this Jan 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[21pt] Segmentation Fault #1376

[21pt] Segmentation Fault #1376

RobHanna-NOAA commented Dec 11, 2024 •

edited

Loading

RobHanna-NOAA commented Dec 16, 2024

RobHanna-NOAA commented Jan 4, 2025

[21pt] Segmentation Fault #1376

[21pt] Segmentation Fault #1376

Comments

RobHanna-NOAA commented Dec 11, 2024 • edited Loading

RobHanna-NOAA commented Dec 16, 2024

RobHanna-NOAA commented Jan 4, 2025

RobHanna-NOAA commented Dec 11, 2024 •

edited

Loading