Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic error during training #8

Closed
Phaips opened this issue Jul 31, 2024 · 37 comments · Fixed by #12 or #24
Closed

Sporadic error during training #8

Phaips opened this issue Jul 31, 2024 · 37 comments · Fixed by #12 or #24
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@Phaips
Copy link

Phaips commented Jul 31, 2024

Hi,
I often get this error during training:

"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcinterpret
er.py", line 192, in _read_header
    raise ValueError("Couldn't read enough bytes for MRC header")
ValueError: Couldn't read enough bytes for MRC header

which is then followed by this error:

FileNotFoundError: [Errno 2] No such file or directory: 
'home/recons/ddw/8/logs/version_0/checkpoints/epoch/epoch=999.ckpt

the jobs will continue the training by setting resume_from_checkpoint: ".." to the last val_loss checkpoint. The weird thing is that not all jobs fail like this, only sometimes, sporadically very early in the process or later or never. Re-running it once (or multiple times) eventually they finish (and produce amazing tomograms! :D)

Thanks for the help :)
Cheers,
Philippe

@rdrighetto
Copy link

Just wanted to note that the second error happens just because we have the prediction command (ddw refine-tomogram) in our script, so if training fails the checkpoint expected for prediction will not be there and it will fail too.
The real error is the first one and indeed it seems to happen at random 🙁

@SimWdm
Copy link
Member

SimWdm commented Jul 31, 2024

Hello Philippe and Ricardo,

thanks for opening this issue!

The error you describe sounds like something is wrong with reading and/or writing the model input and target subtoms as mrc files. Unfortunately, I have no idea as to what that might be, and the fact that it happens at random makes it hard to debug, but I will try my best!

Is the error message you shared the complete one? If not, could you please share the full message? That might help me find a place to start.

Best,
Simon

@Phaips
Copy link
Author

Phaips commented Jul 31, 2024

Hi Simon,

Thanks for looking into it nevertheless! The entire error is this:

ValueError: Caught ValueError in DataLoader worker process 9.
Original Traceback (most recent call last):
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/DeepDeWedge/ddw/utils/su
btomo_dataset.py", line 61, in __getitem__
    subtomo0 = load_mrc_data(subtomo0_file)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/DeepDeWedge/ddw/utils/mr
ctools.py", line 12, in load_mrc_data
    with mrcfile.open(mrc_file, permissive=True) as mrc:
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/load_functio
ns.py", line 139, in open
    return NewMrc(name, mode=mode, permissive=permissive,
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcfile.py",
 line 115, in __init__
    self._read(header_only)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcfile.py",
 line 131, in _read
    super(MrcFile, self)._read(header_only)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcinterpret
er.py", line 170, in _read
    self._read_header()
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcinterpret
er.py", line 192, in _read_header
    raise ValueError("Couldn't read enough bytes for MRC header")
ValueError: Couldn't read enough bytes for MRC header

and actually I just saw another one but which seems to originate from the same problem:

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/DeepDeWedge/ddw/utils/su
btomo_dataset.py", line 61, in __getitem__
    subtomo0 = load_mrc_data(subtomo0_file)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/DeepDeWedge/ddw/utils/mr
ctools.py", line 14, in load_mrc_data
    data = torch.tensor(mrc.data)
RuntimeError: Could not infer dtype of NoneType

Cheers,
Philippe

@SimWdm
Copy link
Member

SimWdm commented Aug 1, 2024

Hi Philippe,

thank you for sharing the full error messages. The problem is indeed related to reading and writing the headers of the subtomogram mrc files. For that DDW uses the mrcfile package, and, as far as some research online has shown, other people have already had similar-seeming issues with mrcfile and headers. I will further look into this today and tomorrow and will let you know as soon as I have made any progress on this.

In the meantime, if you are still using DDW and encounter another mrc-header-related error, it would be great if you could find out which subtomogram is causing the error and send me the corrupted file. This would be helpful for debugging, since I don't know what exactly is causing the header error.

To find the corrupted subtomogram after the error has occurred, you could for example try to open all subtomograms in your DDW fitting data directory (e.g. PATH_TO_DDW_PROJECT/subtomos/fitting_subtomos/) directory in Python using the mrcfile package, and at least one should give you a header error. Let me know if you need help with this.

Best,
Simon

@SimWdm
Copy link
Member

SimWdm commented Aug 5, 2024

Hi Philippe and Ricardo,

just a quick heads up: After thinking about this issue for a while, I think that the cleanest solution would be to save the subtomograms as .pt files using PyTorch, which gets rid of .mrc headers altogether.
I have already implemented this new strategy in a separate branch and am currently running a sanity check model fitting to make sure that the changes have not broken anything. Once the check is successful, you can test the changes on your data and if your problem is solved, I will merge the changes into main.

Best,
Simon

@Phaips
Copy link
Author

Phaips commented Aug 5, 2024

Dear Simon,

Thanks a lot! I think the .pt seems like a good fix! Happy to test it as soon as it passes the checks.
I tried to read all the fitting_subtomos as well as the val_subtomos from a job which encountered the error from above. However, this error did not pop up and I was able to read in all the .mrc files in python without an error...

Kind regards,
Philippe

@SimWdm
Copy link
Member

SimWdm commented Aug 6, 2024

Dear Philippe,

I have just pushed the version with the .pt subtomos to a new branch called torch_subtomos. You can test this version by switching to the torch_subtomos branch and reinstalling DDW (re-running pip install . inside the DeepDeWedge/ directory should be sufficient). If your installation was successful, all subtomos should be saved with a .pt extension.

If this fix has resolved your issue, I will merge the changes into the main branch.

Please let me know if you have any questions or problems!

Best,
Simon

@rdrighetto
Copy link

Thanks a lot Simon! We will test it.

If I may give some extra feedback on this: as far as we know the issue happens at random, which I guess is related to the very intensive I/O when dealing with many subtomogram files. I'm not sure if just changing the file format will prevent this from happening again. My suspicion is some thermal fluctuation on the GPUs generates weird numbers (like NaN or None) in the subtomograms during fitting, which ultimately causes the MRC header calculation to fail.
Another possibility is related to our filesystem. The script we normally use to run ddw copies the entire "project" to a local SSD in the computing node to optimize performance. It could be some random failure in this process. Now I'm curious to see if the problem also happens when not using the SSD but only the cluster parallel filesystem. The drawback is that will certainly be slower.
Anyway, happy to test your solution!

@SimWdm
Copy link
Member

SimWdm commented Aug 6, 2024

My suspicion is some thermal fluctuation on the GPUs generates weird numbers (like NaN or None) in the subtomograms during fitting, which ultimately causes the MRC header calculation to fail.

That's an interesting point! If we indeed get random NaN or None values, I would probably have to implement some kind of checking routine that tries to load sub-tomograms and re-saves them if loading fails or something in that direction.

Anyway, happy to test your solution!

I am very curious about the outcome of your tests! Let me know if you have any problems.

@RHennellJames
Copy link

I have also had this error with fit-model in the last week. My scripts overwrote the log file, but when I ran on the same tomogram several times it happened at various different epochs from 250-850. Most of the errors happened closer to epoch 850 to 250, which I think suggests that it's a random event that has a chance of happening every epoch (maybe more likely as time goes on). I'm interested to hear if the new branch fixed it.

I previously ran without any training issues on a different dataset, but with refine-tomogram I kept getting an "Errno 28: no space left on device error." I managed to fix that by setting TMPDIR (in our slurm system) to be in my filespace on our gpfs storage system instead of on the node.

@SimWdm
Copy link
Member

SimWdm commented Aug 29, 2024

Thanks for sharing @RHennellJames! The fact that you get the same error suggests that the issue does not only occur for a certain hardware setup, which is good to know.

@Phaips @rdrighetto did switching to the torch subtomos resolve the issue in your experiments?

@rdrighetto
Copy link

Hi @SimWdm!

We finally got to test the torch_subtomos branch. The good news is that it works. However, in a comparison with the vanilla code based on MRC subtomograms with identical settings, including the same random seed and running on the same hardware, the original code was noticeably faster (461min vs 521min when fitting for 1000 epochs). Also, while both results look good (predicting using the model from the last epoch), they are not identical, which puzzles me. They should look exactly the same, no? Am I missing something?

To summarize, the new code works, but it seems to be slower and we cannot really say whether it solves the issue, given the random nature of the problem. We will be on lab retreat for the next couple days but I'd be happy to share more details once we're back. In any case, thanks a lot for looking into this!

@RHennellJames
Copy link

Thanks for the update @rdrighetto. I will ask our Scientific Computing manager to install it here and see if it solves the problem for me as well.

Best,
Rory

@SimWdm
Copy link
Member

SimWdm commented Sep 2, 2024

Thanks for testing the fix @rdrighetto!

However, in a comparison with the vanilla code based on MRC subtomograms with identical settings, including the same random seed and running on the same hardware, the original code was noticeably faster (461min vs 521min when fitting for 1000 epochs).

The longer runtime could be due to less efficient reading/writing of the torch subtomos. I have to test this in an isolated environment to be on the save side, and I will get back to you once I know more!

Also, while both results look good (predicting using the model from the last epoch), they are not identical, which puzzles me. They should look exactly the same, no? Am I missing something?

That is another interesting observation! As the seed was the same in both runs, I would have expected identical results as well! Do training and validation curves look different as well? Have you tried to exactly reproduce results with the .mrc subtomos before? If so, was it successful?

We will be on lab retreat for the next couple days but I'd be happy to share more details once we're back. In any case, thanks a lot for looking into this!

Have a good time at the lab retreat! I am sure we'll eventually sort out all these subtomo-related problems! 🙂 🙏🏼

@RHennellJames
Copy link

It seems that the torch branch also fixed the problem for me, at least on the dateset where I had the issue before.

Thanks very much for sorting this!

@SimWdm SimWdm self-assigned this Sep 4, 2024
@SimWdm SimWdm added the bug Something isn't working label Sep 4, 2024
@SimWdm
Copy link
Member

SimWdm commented Sep 4, 2024

That's great news @RHennellJames! Was the model fitting slower for you as well?

@RHennellJames
Copy link

Hi @SimWdm, the per-epoch time looked to be the same. I'm not sure if the total time was longer as it never got to completion for this dataset with the old code

@rdrighetto
Copy link

Hi @SimWdm,

Do training and validation curves look different as well?

They look very similar but not identical:

vanilla
test_vanilla
torch_subtomos
test_torch_subtomos

Are there any random factors playing in a role in the fitting process that are outside the scope of the enforced random seed?

Have you tried to exactly reproduce results with the .mrc subtomos before? If so, was it successful?

I am carrying out this test right now, both for the original .mrc subtomos as well as the torch_subtomos, will report back soon 😉

@SimWdm
Copy link
Member

SimWdm commented Sep 7, 2024

Thanks for the update @rdrighetto!

Are there any random factors playing in a role in the fitting process that are outside the scope of the enforced random seed?

Initially, I thought that everything should be seeded, but your question made me check again and I noticed that the random rotation of the sub-tomograms during model fitting are indeed not seeded (see ddw.utils.subtomo_dataset.SubtomoDataset).

This should be easy to fix. For clarity, I will open another issue on the reproducibility and link this one. 🤓

Thank you for paying close attention to your experiments, which helps un-cover these nasty little flaws! 🙂


Edit: The new issue is here #9 (comment)

@rdrighetto
Copy link

I was running each test (MRC vs. torch subtomos) identically one more time, and in this second run, the MRC subtomos have ran into the sporadic error while the torch_subtomos branch ran to completion. So now I'm more confident that the torch subtomos are more stable than the MRC files (but to be honest not yet 100% sure). Regarding performance, the torch subtomos now gave the best performance overall: 513min, so I don't think there's any fundamental bottleneck in comparison to the old code for MRC subtomos -- probably just random fluctuations of our hardware :-)
Regarding the reproducibility, I do have an interesting observation but will continue that discussion in #9.

Thanks again @SimWdm for the prompt responses here!

@SimWdm
Copy link
Member

SimWdm commented Sep 10, 2024

Let me summarise: As observed by @RHennellJames and @rdrighetto (thanks again to both of you!), the torch subtomo solution does not seem to cause any slowdowns and has not caused any crashes during data loading so far.
So I think it should be safe to merge the subtomo branch into main.

Any objections? 🙂

@rdrighetto
Copy link

Sounds good to me!

@TomasPascoa
Copy link

Hi, first of all, thanks so much for this discussion - it really helped me make sense of the errors I was getting!

Secondly, I wanted to share my experience: I tried both the MRC and Torch subtomo approaches, and both worked well in general. However, I found that the Torch subtomo solution succeeded for a tomogram where the MRC subtomo approach had previously failed due to the header error you described. I think I agree with @rdrighetto's assessment - it may be an I/O issue causing the MRC error.

That said, I ran into similar issues with a different dataset while using the .pt subtomos. The process ran for 779 epochs before failing, and I was able to restart from the last checkpoint and push it to completion, but it shows that a similar error is still possible with the Torch subtomo approach. Here’s the relevant part of the error message, which to me it feels it may be related to I/O issues:

_RuntimeError: Caught RuntimeError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/ddw/utils/sub
tomo_dataset.py", line 60, in __getitem__
    subtomo0 = torch.load(subtomo0_file)
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/seriali
zation.py", line 1072, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/seriali
zation.py", line 480, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding
central directory

Epoch 779: 100%|██████████| 116/116 [01:51<00:00,  1.04it/s, loss=30, v_num=0, fitting_loss=29.10, val_loss=29.00]
real    951m23.637s
user    8650m1.955s
sys     198m50.308s_

@SimWdm SimWdm reopened this Sep 12, 2024
@TomasPascoa
Copy link

Apologies @SimWdm - I just saw that you merged the .pt subtomos approach to master as I was writing this. I am still inclined to agree that it probably makes sense - but worth keeping in mind the issue may not be completely gone :)

@SimWdm
Copy link
Member

SimWdm commented Sep 12, 2024

Thanks for sharing your experience @TomasPascoa!
No reasons to apologize: If the error persists (although in a different way), the issue should be kept open, so I have re-opened the issue.

Any help is highly appreciated! 😅

@rdrighetto
Copy link

@TomasPascoa confirms my fears have become true 😅

@SimWdm SimWdm added the help wanted Extra attention is needed label Sep 13, 2024
@SimWdm SimWdm pinned this issue Sep 13, 2024
@henrynjones
Copy link

henrynjones commented Oct 24, 2024

Hi @SimWdm, thanks for your help! Unfortunately I continue to experience problems similar to @TomasPascoa so I just wanted to give you another data point. I have been trying to debug this with limited success. I printed out the path to the subtomo which causes this issue and can't seem to find any problems with it. It is the correct size and I can load and print out values if I load the subtomo by hand in a python shell. I will let you know if I manage the solve the issue but it does seems to be quite random.

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/utils/data/
_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/utils/data/
_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/utils/data/
_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/utils/data/
dataset.py", line 335, in __getitem__
    return self.datasets[dataset_idx][sample_idx]
  File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/ddw/utils/subtomo
_dataset.py", line 61, in __getitem__
    subtomo0 = torch.load(subtomo0_file)
  File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/serializati
on.py", line 1005, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/serializati
on.py", line 457, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding
central directory

@SimWdm
Copy link
Member

SimWdm commented Oct 25, 2024

Thank you for joining the discussion and for looking into this @henrynjones!

I have just created a new branch 8-sporadic-error-during-training in which I replaced the standard torch.load in the dataloader with a safe_load function that tries to load the subtomo and, if a RuntimeError occurs, waits for one second and then tries again:

def safe_load(file_path, max_retries=3, delay=1):
    for attempt in range(max_retries):
        try:
            return torch.load(file_path)
        except RuntimeError as e:
            print(f"Error loading {file_path}")
            if attempt == max_retries - 1:
                raise e  # Reraise if it's the last attempt
            print(f"Retrying in {delay} seconds")
            time.sleep(delay)  # Wait before retrying

While this is definitely a bit hacky, I think it might work because, as @henrynjones reports, the subtomo that causes the error during model fitting can be safely loaded later on.

@henrynjones, @rdrighetto, @Phaips, @TomasPascoa Please feel free to try this hotfix on the new 8-sporadic-error-during-training branch.
I suggest writing all command line output to a log file (e.g. ddw fit-model [YOUR ARGS HERE] > log.txt) such that we can see if safe_load excepted any RuntimeError.

Let me know if you run into any issues with the new branch!

Best,
Simon 😊

@henrynjones
Copy link

henrynjones commented Oct 25, 2024

@SimWdm I have started testing the new branch and am running into some issues but it is hard for me to disentangle what might be a problem with my configuration (I am using the NERSC HPC system). I do usually seem to get this error within the update_subtomo_missing_wedge call (For debugging I made this happen after every epoch) but here I don't get a traceback at all which is strange. It also seems strange that it only failed once and then seemed to continue until something found a problem. I also changed the print statements in safe_load to have flush=True to ensure I see things in time. I again checked that the problem file can be loaded without issues in a shell. Note that this is after the second epoch (epoch 1) so I fully completed an earlier update_subtomo_missing_wedge call after epoch 0.

Epoch 1, global step 216: 'fitting_loss' reached 392410.87500 (best 392410.87500), saving model to 'data_prep/logs/version_2/checkpoints/fitting_loss/epoch=1-fitting_loss=392410.87500.ckpt' as top 5
Epoch 1, global step 216: 'val_loss' reached 389776.81250 (best 389776.81250), saving model to 'data_prep/logs/version_2/checkpoints/val_loss/epoch=1-val_loss=389776.81250-v1.ckpt' as top 3




^MUpdating subtomo missing wedges:   0%|          | 0/135 [00:00<?, ?it/s]^[[A^MUpdating subtomo missing wedges:   0%|          | 0/135 [00:21<?, ?it/s]
^MUpdating subtomo missing wedges:   0%|          | 0/135 [00:00<?, ?it/s]^[[A
^MUpdating subtomo missing wedges:   0%|          | 0/135 [00:00<?, ?it/s]^[[A^MUpdating subtomo missing wedges:   0%|          | 0/135 [00:28<?, ?it/s]
^MUpdating subtomo missing wedges:   1%|          | 1/135 [00:40<1:31:06, 40.79s/it]^[[A
^M                                                                        
^[[A^MEpoch 1: 100%|â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>| 135/135 [14:23<00:00,  6.40s/it, loss=3.9e+05, v_num=2, val_loss=3.9e+5, fitting_loss=3.92e+5]
Error loading ./data_prep/subtomos/fitting_subtomos/subtomo0/7.pt
Retrying in 1 seconds
^MUpdating subtomo missing wedges:   1%|â<96><8f>         | 2/135 [00:46<44:24, 20.03s/it]  ^[[A
^MUpdating subtomo missing wedges:   2%|â<96><8f>         | 3/135 [00:52<29:57, 13.61s/it]^[[A
^MUpdating subtomo missing wedges:   3%|â<96><8e>         | 4/135 [00:57<22:46, 10.43s/it]^[[A
^MUpdating subtomo missing wedges:   0%|          | 0/135 [00:00<?, ?it/s]^[[A
^MUpdating subtomo missing wedges:   4%|â<96><8e>         | 5/135 [01:03<18:49,  8.69s/it]^[[A
^MUpdating subtomo missing wedges:   1%|          | 1/135 [00:40<1:30:01, 40.31s/it]^[[A
^MUpdating subtomo missing wedges:   4%|â<96><8d>         | 6/135 [01:08<16:15,  7.56s/it]^[[A
^MUpdating subtomo missing wedges:   1%|â<96><8f>         | 2/135 [00:45<44:01, 19.86s/it]  ^[[A
^MUpdating subtomo missing wedges:   5%|â<96><8c>         | 7/135 [01:13<14:09,  6.64s/it]^[[A
^MUpdating subtomo missing wedges:   2%|â<96><8f>         | 3/135 [00:51<29:26, 13.39s/it]^[[A
^MUpdating subtomo missing wedges:   6%|â<96><8c>         | 8/135 [01:18<12:58,  6.13s/it]^[[A
^MUpdating subtomo missing wedges:   7%|â<96><8b>         | 9/135 [01:23<12:01,  5.73s/it]^[[A
^MUpdating subtomo missing wedges:   3%|â<96><8e>         | 4/135 [00:57<22:39, 10.38s/it]^[[A
^MUpdating subtomo missing wedges:   7%|â<96><8b>         | 10/135 [01:28<11:22,  5.46s/it]^[[A
^MUpdating subtomo missing wedges:   4%|â<96><8e>         | 5/135 [01:02<18:45,  8.66s/it]^[[A
^MUpdating subtomo missing wedges:   8%|â<96><8a>         | 11/135 [01:33<11:04,  5.36s/it]^[[A
^MUpdating subtomo missing wedges:   4%|â<96><8d>         | 6/135 [01:08<16:20,  7.60s/it]^[[A
^MUpdating subtomo missing wedges:   9%|â<96><89>         | 12/135 [01:38<10:35,  5.17s/it]^[[A
^MUpdating subtomo missing wedges:   5%|â<96><8c>         | 7/135 [01:13<14:14,  6.68s/it]^[[A
^MUpdating subtomo missing wedges:  10%|â<96><89>         | 13/135 [01:42<10:07,  4.98s/it]^[[A
^MUpdating subtomo missing wedges:   6%|â<96><8c>         | 8/135 [01:18<12:53,  6.09s/it]^[[A
^MUpdating subtomo missing wedges:  10%|â<96><88>         | 14/135 [01:47<09:46,  4.85s/it]^[[A
^MUpdating subtomo missing wedges:   7%|â<96><8b>         | 9/135 [01:23<12:03,  5.74s/it]^[[A
^MUpdating subtomo missing wedges:  11%|â<96><88>         | 15/135 [01:51<09:25,  4.72s/it]^[[A
^MUpdating subtomo missing wedges:   7%|â<96><8b>         | 10/135 [01:28<11:31,  5.53s/it]^[[A
^MUpdating subtomo missing wedges:  12%|â<96><88>â<96><8f>        | 16/135 [01:56<09:15,  4.67s/it]^[[A
^MUpdating subtomo missing wedges:   8%|â<96><8a>         | 11/135 [01:33<11:13,  5.43s/it]^[[A
^MUpdating subtomo missing wedges:  13%|â<96><88>â<96><8e>        | 17/135 [02:00<09:02,  4.59s/it]^[[A

Aborted.

Aborted.
^MEpoch 1: 100%|â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>| 135/135 [17:09<00:00,  7.62s/it, loss=3.9e+05, v_num=2, val_loss=3.9e+5, fitting_loss=3.92e+5]
^M                                                                        
^[[A^MEpoch 1: 100%|â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>| 135/135 [14:26<00:00,  6.42s/it, loss=3.9e+05, v_num=2, val_loss=3.9e+5, fitting_loss=3.92e+5]
^MEpoch 1: 100%|â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>| 135/135 [17:10<00:00,  7.63s/it, loss=3.9e+05, v_num=2, val_loss=3.9e+5, fitting_loss=3.92e+5]
^MUpdating subtomo missing wedges:   9%|â<96><89>         | 12/135 [01:37<10:37,  5.18s/it]^[[A
srun: error: nid001469: task 2: Exited with exit code 1
srun: Terminating StepId=32196346.0
^MUpdating subtomo missing wedges:  13%|â<96><88>â<96><8e>        | 18/135 [02:05<08:54,  4.57s/it]^[[A
srun: error: nid001469: task 0: Exited with exit code 1
slurmstepd: error: *** STEP 32196346.0 ON nid001469 CANCELLED AT 2024-10-25T21:15:51 ***

@SimWdm
Copy link
Member

SimWdm commented Oct 26, 2024

Thanks for testing the fix so quickly, @henrynjones!

From the output, it looks like safe_load successfully prevented the RuntimeError we saw before, but another error seems to occur further down, without showing a clear error message. Is that correct?

Unfortunately, I don’t have experience with NERSC HPC. It looks like the system uses SLURM, though. Would it be possible for you to test the fix directly on a server, without using SLURM?

One other thing I noticed in your output is that the progress bar doesn’t seem to appear in chronological order and includes some strange characters. Did you see the same behavior with the code in the main branch, or is this new?

@TomasPascoa
Copy link

TomasPascoa commented Oct 27, 2024

Thanks for implementing this @SimWdm!

I have just tested the fix in two replicate runs on the same tomogram. In the first run, the RuntimeError came up once but was successfully prevented by your safe_load fix, as @henrynjones also reported. So that's great! And the job ran to completion.

However, in my second run, I instead got an UnpicklingError:

UnpicklingError: Caught UnpicklingError in DataLoader worker process 6.
Original Traceback (most recent call last):
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/dacostac/.conda/envs/DeepDeWedge/ddw/utils/subtomo_dataset.py",
line 74, in __getitem__
    subtomo0 = safe_load(subtomo0_file)
  File "/home/dacostac/.conda/envs/DeepDeWedge/ddw/utils/subtomo_dataset.py",
line 19, in safe_load
    return torch.load(file_path)
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/seriali
zation.py", line 1114, in load
    return _legacy_load(
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/seriali
zation.py", line 1338, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: A load persistent id instruction was encountered,
but no persistent_load function was specified.

So, it seems to me that perhaps the sporadic error was sometimes a RuntimeError and sometimes a UnpicklingError? And now the safe_load fix is only accounting for RuntimeError?

I'm tempted to try to modify two lines of the safe_load function (see below) and re-run the job until I see that it also successfully prevented the UnpicklingError

except (RuntimeError, UnpicklingError) as e:
          print(f"Error loading {file_path}: {e}")

What do you think?

@SimWdm
Copy link
Member

SimWdm commented Oct 28, 2024

Thanks for trying the fix @TomasPascoa! Glad to see that worked at least once.

I have just updated the branch and safe_load should now catch any exception (except Exception as e:). Hopefully this works the UnpicklingError as well 🙏🏼 .

@SimWdm
Copy link
Member

SimWdm commented Nov 28, 2024

Hi, @henrynjones and @TomasPascoa have you had the chance to try the new fix yet?
No worries if not—I'm just checking to see if the issue has been finally resolved for good. 😅 🙏🏼

@henrynjones
Copy link

Hi, @henrynjones and @TomasPascoa have you had the chance to try the new fix yet? No worries if not—I'm just checking to see if the issue has been finally resolved for good. 😅 🙏🏼

Hi @SimWdm, sorry for my delay. I managed to get the fix to work for me with a few retries during training, but I was still experiencing very slow training compared to what I observed on an interactive node on the cluster I use. I realized that I was observing the loading issue because I was explicitly initializing training with multiple tasks which I think may have lead to simultaneously file reading. So for me it was more of a beginner multi-threading/ cluster user issue and I had no issues and faster training by simply letting your code and pytorch handle things. Thanks!

@SimWdm
Copy link
Member

SimWdm commented Nov 29, 2024

Thanks for the update @henrynjones! 😄

I managed to get the fix to work for me with a few retries during training

Does that mean that after you applied the latest try: ... except: fix, the code still crashed a few times and you had to restart it? If so, did you manage to find out if the crash was related to the loading bug or if it was caused by something else that is specifically related to your system, like SLURM?

@TomasPascoa
Copy link

Hi @SimWdm

I am terribly sorry - somehow I missed the notification of your message!

Yes, your fix seems to have solved it for me. I now routinely queue up ddw for a batch of tomos and it always runs nicely to completion. The issue seems fixed to me. No failed runs since then!

@SimWdm
Copy link
Member

SimWdm commented Jan 14, 2025

These are great news! It seems like we can finally close this issue for good! I will merge the branch with the fix into main.

Huge thanks to @Phaips, @rdrighetto, @RHennellJames, @henrynjones and @TomasPascoa for your help and your patience! It's people like you who make open source so awesome! 🙏🏼 😊

@SimWdm SimWdm linked a pull request Jan 14, 2025 that will close this issue
@SimWdm SimWdm unpinned this issue Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
6 participants