-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic error during training #8
Comments
Just wanted to note that the second error happens just because we have the prediction command ( |
Hello Philippe and Ricardo, thanks for opening this issue! The error you describe sounds like something is wrong with reading and/or writing the model input and target subtoms as mrc files. Unfortunately, I have no idea as to what that might be, and the fact that it happens at random makes it hard to debug, but I will try my best! Is the error message you shared the complete one? If not, could you please share the full message? That might help me find a place to start. Best, |
Hi Simon, Thanks for looking into it nevertheless! The entire error is this:
and actually I just saw another one but which seems to originate from the same problem:
Cheers, |
Hi Philippe, thank you for sharing the full error messages. The problem is indeed related to reading and writing the headers of the subtomogram mrc files. For that DDW uses the In the meantime, if you are still using DDW and encounter another mrc-header-related error, it would be great if you could find out which subtomogram is causing the error and send me the corrupted file. This would be helpful for debugging, since I don't know what exactly is causing the header error. To find the corrupted subtomogram after the error has occurred, you could for example try to open all subtomograms in your DDW fitting data directory (e.g. Best, |
Hi Philippe and Ricardo, just a quick heads up: After thinking about this issue for a while, I think that the cleanest solution would be to save the subtomograms as Best, |
Dear Simon, Thanks a lot! I think the Kind regards, |
Dear Philippe, I have just pushed the version with the If this fix has resolved your issue, I will merge the changes into the main branch. Please let me know if you have any questions or problems! Best, |
Thanks a lot Simon! We will test it. If I may give some extra feedback on this: as far as we know the issue happens at random, which I guess is related to the very intensive I/O when dealing with many subtomogram files. I'm not sure if just changing the file format will prevent this from happening again. My suspicion is some thermal fluctuation on the GPUs generates weird numbers (like NaN or None) in the subtomograms during fitting, which ultimately causes the MRC header calculation to fail. |
That's an interesting point! If we indeed get random
I am very curious about the outcome of your tests! Let me know if you have any problems. |
I have also had this error with fit-model in the last week. My scripts overwrote the log file, but when I ran on the same tomogram several times it happened at various different epochs from 250-850. Most of the errors happened closer to epoch 850 to 250, which I think suggests that it's a random event that has a chance of happening every epoch (maybe more likely as time goes on). I'm interested to hear if the new branch fixed it. I previously ran without any training issues on a different dataset, but with refine-tomogram I kept getting an "Errno 28: no space left on device error." I managed to fix that by setting TMPDIR (in our slurm system) to be in my filespace on our gpfs storage system instead of on the node. |
Thanks for sharing @RHennellJames! The fact that you get the same error suggests that the issue does not only occur for a certain hardware setup, which is good to know. @Phaips @rdrighetto did switching to the |
Hi @SimWdm! We finally got to test the To summarize, the new code works, but it seems to be slower and we cannot really say whether it solves the issue, given the random nature of the problem. We will be on lab retreat for the next couple days but I'd be happy to share more details once we're back. In any case, thanks a lot for looking into this! |
Thanks for the update @rdrighetto. I will ask our Scientific Computing manager to install it here and see if it solves the problem for me as well. Best, |
Thanks for testing the fix @rdrighetto!
The longer runtime could be due to less efficient reading/writing of the
That is another interesting observation! As the seed was the same in both runs, I would have expected identical results as well! Do training and validation curves look different as well? Have you tried to exactly reproduce results with the
Have a good time at the lab retreat! I am sure we'll eventually sort out all these subtomo-related problems! 🙂 🙏🏼 |
It seems that the torch branch also fixed the problem for me, at least on the dateset where I had the issue before. Thanks very much for sorting this! |
That's great news @RHennellJames! Was the model fitting slower for you as well? |
Hi @SimWdm, the per-epoch time looked to be the same. I'm not sure if the total time was longer as it never got to completion for this dataset with the old code |
Hi @SimWdm,
They look very similar but not identical: Are there any random factors playing in a role in the fitting process that are outside the scope of the enforced random seed?
I am carrying out this test right now, both for the original .mrc subtomos as well as the torch_subtomos, will report back soon 😉 |
Thanks for the update @rdrighetto!
Initially, I thought that everything should be seeded, but your question made me check again and I noticed that the random rotation of the sub-tomograms during model fitting are indeed not seeded (see This should be easy to fix. For clarity, I will open another issue on the reproducibility and link this one. 🤓 Thank you for paying close attention to your experiments, which helps un-cover these nasty little flaws! 🙂 Edit: The new issue is here #9 (comment) |
I was running each test (MRC vs. torch subtomos) identically one more time, and in this second run, the MRC subtomos have ran into the sporadic error while the Thanks again @SimWdm for the prompt responses here! |
Let me summarise: As observed by @RHennellJames and @rdrighetto (thanks again to both of you!), the torch subtomo solution does not seem to cause any slowdowns and has not caused any crashes during data loading so far. Any objections? 🙂 |
Sounds good to me! |
Hi, first of all, thanks so much for this discussion - it really helped me make sense of the errors I was getting! Secondly, I wanted to share my experience: I tried both the MRC and Torch subtomo approaches, and both worked well in general. However, I found that the Torch subtomo solution succeeded for a tomogram where the MRC subtomo approach had previously failed due to the header error you described. I think I agree with @rdrighetto's assessment - it may be an I/O issue causing the MRC error. That said, I ran into similar issues with a different dataset while using the .pt subtomos. The process ran for 779 epochs before failing, and I was able to restart from the last checkpoint and push it to completion, but it shows that a similar error is still possible with the Torch subtomo approach. Here’s the relevant part of the error message, which to me it feels it may be related to I/O issues:
|
Apologies @SimWdm - I just saw that you merged the .pt subtomos approach to master as I was writing this. I am still inclined to agree that it probably makes sense - but worth keeping in mind the issue may not be completely gone :) |
Thanks for sharing your experience @TomasPascoa! Any help is highly appreciated! 😅 |
@TomasPascoa confirms my fears have become true 😅 |
Hi @SimWdm, thanks for your help! Unfortunately I continue to experience problems similar to @TomasPascoa so I just wanted to give you another data point. I have been trying to debug this with limited success. I printed out the path to the subtomo which causes this issue and can't seem to find any problems with it. It is the correct size and I can load and print out values if I load the subtomo by hand in a python shell. I will let you know if I manage the solve the issue but it does seems to be quite random.
|
Thank you for joining the discussion and for looking into this @henrynjones! I have just created a new branch
While this is definitely a bit hacky, I think it might work because, as @henrynjones reports, the subtomo that causes the error during model fitting can be safely loaded later on. @henrynjones, @rdrighetto, @Phaips, @TomasPascoa Please feel free to try this hotfix on the new Let me know if you run into any issues with the new branch! Best, |
@SimWdm I have started testing the new branch and am running into some issues but it is hard for me to disentangle what might be a problem with my configuration (I am using the NERSC HPC system). I do usually seem to get this error within the update_subtomo_missing_wedge call (For debugging I made this happen after every epoch) but here I don't get a traceback at all which is strange. It also seems strange that it only failed once and then seemed to continue until something found a problem. I also changed the print statements in
|
Thanks for testing the fix so quickly, @henrynjones! From the output, it looks like safe_load successfully prevented the RuntimeError we saw before, but another error seems to occur further down, without showing a clear error message. Is that correct? Unfortunately, I don’t have experience with NERSC HPC. It looks like the system uses SLURM, though. Would it be possible for you to test the fix directly on a server, without using SLURM? One other thing I noticed in your output is that the progress bar doesn’t seem to appear in chronological order and includes some strange characters. Did you see the same behavior with the code in the main branch, or is this new? |
Thanks for implementing this @SimWdm! I have just tested the fix in two replicate runs on the same tomogram. In the first run, the RuntimeError came up once but was successfully prevented by your safe_load fix, as @henrynjones also reported. So that's great! And the job ran to completion. However, in my second run, I instead got an UnpicklingError:
So, it seems to me that perhaps the sporadic error was sometimes a RuntimeError and sometimes a UnpicklingError? And now the safe_load fix is only accounting for RuntimeError? I'm tempted to try to modify two lines of the safe_load function (see below) and re-run the job until I see that it also successfully prevented the UnpicklingError
What do you think? |
Thanks for trying the fix @TomasPascoa! Glad to see that worked at least once. I have just updated the branch and |
Hi, @henrynjones and @TomasPascoa have you had the chance to try the new fix yet? |
Hi @SimWdm, sorry for my delay. I managed to get the fix to work for me with a few retries during training, but I was still experiencing very slow training compared to what I observed on an interactive node on the cluster I use. I realized that I was observing the loading issue because I was explicitly initializing training with multiple tasks which I think may have lead to simultaneously file reading. So for me it was more of a beginner multi-threading/ cluster user issue and I had no issues and faster training by simply letting your code and pytorch handle things. Thanks! |
Thanks for the update @henrynjones! 😄
Does that mean that after you applied the latest |
Hi @SimWdm I am terribly sorry - somehow I missed the notification of your message! Yes, your fix seems to have solved it for me. I now routinely queue up ddw for a batch of tomos and it always runs nicely to completion. The issue seems fixed to me. No failed runs since then! |
These are great news! It seems like we can finally close this issue for good! I will merge the branch with the fix into main. Huge thanks to @Phaips, @rdrighetto, @RHennellJames, @henrynjones and @TomasPascoa for your help and your patience! It's people like you who make open source so awesome! 🙏🏼 😊 |
Hi,
I often get this error during training:
which is then followed by this error:
the jobs will continue the training by setting
resume_from_checkpoint: ".."
to the last val_loss checkpoint. The weird thing is that not all jobs fail like this, only sometimes, sporadically very early in the process or later or never. Re-running it once (or multiple times) eventually they finish (and produce amazing tomograms! :D)Thanks for the help :)
Cheers,
Philippe
The text was updated successfully, but these errors were encountered: