Sporadic error during training #8

Phaips · 2024-07-31T09:49:48Z

Hi,
I often get this error during training:

"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcinterpret
er.py", line 192, in _read_header
    raise ValueError("Couldn't read enough bytes for MRC header")
ValueError: Couldn't read enough bytes for MRC header

which is then followed by this error:

FileNotFoundError: [Errno 2] No such file or directory: 
'home/recons/ddw/8/logs/version_0/checkpoints/epoch/epoch=999.ckpt

the jobs will continue the training by setting resume_from_checkpoint: ".." to the last val_loss checkpoint. The weird thing is that not all jobs fail like this, only sometimes, sporadically very early in the process or later or never. Re-running it once (or multiple times) eventually they finish (and produce amazing tomograms! :D)

Thanks for the help :)
Cheers,
Philippe

The text was updated successfully, but these errors were encountered:

rdrighetto · 2024-07-31T09:57:49Z

Just wanted to note that the second error happens just because we have the prediction command (ddw refine-tomogram) in our script, so if training fails the checkpoint expected for prediction will not be there and it will fail too.
The real error is the first one and indeed it seems to happen at random 🙁

SimWdm · 2024-07-31T13:22:29Z

Hello Philippe and Ricardo,

thanks for opening this issue!

The error you describe sounds like something is wrong with reading and/or writing the model input and target subtoms as mrc files. Unfortunately, I have no idea as to what that might be, and the fact that it happens at random makes it hard to debug, but I will try my best!

Is the error message you shared the complete one? If not, could you please share the full message? That might help me find a place to start.

Best,
Simon

Phaips · 2024-07-31T13:51:03Z

Hi Simon,

Thanks for looking into it nevertheless! The entire error is this:

ValueError: Caught ValueError in DataLoader worker process 9.
Original Traceback (most recent call last):
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/DeepDeWedge/ddw/utils/su
btomo_dataset.py", line 61, in __getitem__
    subtomo0 = load_mrc_data(subtomo0_file)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/DeepDeWedge/ddw/utils/mr
ctools.py", line 12, in load_mrc_data
    with mrcfile.open(mrc_file, permissive=True) as mrc:
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/load_functio
ns.py", line 139, in open
    return NewMrc(name, mode=mode, permissive=permissive,
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcfile.py",
 line 115, in __init__
    self._read(header_only)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcfile.py",
 line 131, in _read
    super(MrcFile, self)._read(header_only)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcinterpret
er.py", line 170, in _read
    self._read_header()
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcinterpret
er.py", line 192, in _read_header
    raise ValueError("Couldn't read enough bytes for MRC header")
ValueError: Couldn't read enough bytes for MRC header

and actually I just saw another one but which seems to originate from the same problem:

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/DeepDeWedge/ddw/utils/su
btomo_dataset.py", line 61, in __getitem__
    subtomo0 = load_mrc_data(subtomo0_file)
  File 
"/scicore/projects/scicore-p-structsoft/ubuntu/software/DeepDeWedge/ddw/utils/mr
ctools.py", line 14, in load_mrc_data
    data = torch.tensor(mrc.data)
RuntimeError: Could not infer dtype of NoneType

Cheers,
Philippe

SimWdm · 2024-08-01T07:48:05Z

Hi Philippe,

thank you for sharing the full error messages. The problem is indeed related to reading and writing the headers of the subtomogram mrc files. For that DDW uses the mrcfile package, and, as far as some research online has shown, other people have already had similar-seeming issues with mrcfile and headers. I will further look into this today and tomorrow and will let you know as soon as I have made any progress on this.

In the meantime, if you are still using DDW and encounter another mrc-header-related error, it would be great if you could find out which subtomogram is causing the error and send me the corrupted file. This would be helpful for debugging, since I don't know what exactly is causing the header error.

To find the corrupted subtomogram after the error has occurred, you could for example try to open all subtomograms in your DDW fitting data directory (e.g. PATH_TO_DDW_PROJECT/subtomos/fitting_subtomos/) directory in Python using the mrcfile package, and at least one should give you a header error. Let me know if you need help with this.

Best,
Simon

SimWdm · 2024-08-05T08:39:27Z

Hi Philippe and Ricardo,

just a quick heads up: After thinking about this issue for a while, I think that the cleanest solution would be to save the subtomograms as .pt files using PyTorch, which gets rid of .mrc headers altogether.
I have already implemented this new strategy in a separate branch and am currently running a sanity check model fitting to make sure that the changes have not broken anything. Once the check is successful, you can test the changes on your data and if your problem is solved, I will merge the changes into main.

Best,
Simon

Phaips · 2024-08-05T14:00:16Z

Dear Simon,

Thanks a lot! I think the .pt seems like a good fix! Happy to test it as soon as it passes the checks.
I tried to read all the fitting_subtomos as well as the val_subtomos from a job which encountered the error from above. However, this error did not pop up and I was able to read in all the .mrc files in python without an error...

Kind regards,
Philippe

SimWdm · 2024-08-06T08:42:12Z

Dear Philippe,

I have just pushed the version with the .pt subtomos to a new branch called torch_subtomos. You can test this version by switching to the torch_subtomos branch and reinstalling DDW (re-running pip install . inside the DeepDeWedge/ directory should be sufficient). If your installation was successful, all subtomos should be saved with a .pt extension.

If this fix has resolved your issue, I will merge the changes into the main branch.

Please let me know if you have any questions or problems!

Best,
Simon

rdrighetto · 2024-08-06T09:14:22Z

Thanks a lot Simon! We will test it.

If I may give some extra feedback on this: as far as we know the issue happens at random, which I guess is related to the very intensive I/O when dealing with many subtomogram files. I'm not sure if just changing the file format will prevent this from happening again. My suspicion is some thermal fluctuation on the GPUs generates weird numbers (like NaN or None) in the subtomograms during fitting, which ultimately causes the MRC header calculation to fail.
Another possibility is related to our filesystem. The script we normally use to run ddw copies the entire "project" to a local SSD in the computing node to optimize performance. It could be some random failure in this process. Now I'm curious to see if the problem also happens when not using the SSD but only the cluster parallel filesystem. The drawback is that will certainly be slower.
Anyway, happy to test your solution!

SimWdm · 2024-08-06T09:45:19Z

My suspicion is some thermal fluctuation on the GPUs generates weird numbers (like NaN or None) in the subtomograms during fitting, which ultimately causes the MRC header calculation to fail.

That's an interesting point! If we indeed get random NaN or None values, I would probably have to implement some kind of checking routine that tries to load sub-tomograms and re-saves them if loading fails or something in that direction.

Anyway, happy to test your solution!

I am very curious about the outcome of your tests! Let me know if you have any problems.

RHennellJames · 2024-08-26T10:00:10Z

I have also had this error with fit-model in the last week. My scripts overwrote the log file, but when I ran on the same tomogram several times it happened at various different epochs from 250-850. Most of the errors happened closer to epoch 850 to 250, which I think suggests that it's a random event that has a chance of happening every epoch (maybe more likely as time goes on). I'm interested to hear if the new branch fixed it.

I previously ran without any training issues on a different dataset, but with refine-tomogram I kept getting an "Errno 28: no space left on device error." I managed to fix that by setting TMPDIR (in our slurm system) to be in my filespace on our gpfs storage system instead of on the node.

SimWdm · 2024-08-29T11:26:00Z

Thanks for sharing @RHennellJames! The fact that you get the same error suggests that the issue does not only occur for a certain hardware setup, which is good to know.

@Phaips @rdrighetto did switching to the torch subtomos resolve the issue in your experiments?

rdrighetto · 2024-09-02T08:24:18Z

Hi @SimWdm!

We finally got to test the torch_subtomos branch. The good news is that it works. However, in a comparison with the vanilla code based on MRC subtomograms with identical settings, including the same random seed and running on the same hardware, the original code was noticeably faster (461min vs 521min when fitting for 1000 epochs). Also, while both results look good (predicting using the model from the last epoch), they are not identical, which puzzles me. They should look exactly the same, no? Am I missing something?

To summarize, the new code works, but it seems to be slower and we cannot really say whether it solves the issue, given the random nature of the problem. We will be on lab retreat for the next couple days but I'd be happy to share more details once we're back. In any case, thanks a lot for looking into this!

RHennellJames · 2024-09-02T12:00:06Z

Thanks for the update @rdrighetto. I will ask our Scientific Computing manager to install it here and see if it solves the problem for me as well.

Best,
Rory

SimWdm · 2024-09-02T12:17:01Z

Thanks for testing the fix @rdrighetto!

However, in a comparison with the vanilla code based on MRC subtomograms with identical settings, including the same random seed and running on the same hardware, the original code was noticeably faster (461min vs 521min when fitting for 1000 epochs).

The longer runtime could be due to less efficient reading/writing of the torch subtomos. I have to test this in an isolated environment to be on the save side, and I will get back to you once I know more!

Also, while both results look good (predicting using the model from the last epoch), they are not identical, which puzzles me. They should look exactly the same, no? Am I missing something?

That is another interesting observation! As the seed was the same in both runs, I would have expected identical results as well! Do training and validation curves look different as well? Have you tried to exactly reproduce results with the .mrc subtomos before? If so, was it successful?

We will be on lab retreat for the next couple days but I'd be happy to share more details once we're back. In any case, thanks a lot for looking into this!

Have a good time at the lab retreat! I am sure we'll eventually sort out all these subtomo-related problems! 🙂 🙏🏼

RHennellJames · 2024-09-04T15:32:45Z

It seems that the torch branch also fixed the problem for me, at least on the dateset where I had the issue before.

Thanks very much for sorting this!

SimWdm · 2024-09-04T20:32:34Z

That's great news @RHennellJames! Was the model fitting slower for you as well?

RHennellJames · 2024-09-05T07:30:00Z

Hi @SimWdm, the per-epoch time looked to be the same. I'm not sure if the total time was longer as it never got to completion for this dataset with the old code

rdrighetto · 2024-09-06T14:15:58Z

Hi @SimWdm,

Do training and validation curves look different as well?

They look very similar but not identical:

vanilla

torch_subtomos

Are there any random factors playing in a role in the fitting process that are outside the scope of the enforced random seed?

Have you tried to exactly reproduce results with the .mrc subtomos before? If so, was it successful?

I am carrying out this test right now, both for the original .mrc subtomos as well as the torch_subtomos, will report back soon 😉

SimWdm · 2024-09-07T06:37:09Z

Thanks for the update @rdrighetto!

Are there any random factors playing in a role in the fitting process that are outside the scope of the enforced random seed?

Initially, I thought that everything should be seeded, but your question made me check again and I noticed that the random rotation of the sub-tomograms during model fitting are indeed not seeded (see ddw.utils.subtomo_dataset.SubtomoDataset).

This should be easy to fix. For clarity, I will open another issue on the reproducibility and link this one. 🤓

Thank you for paying close attention to your experiments, which helps un-cover these nasty little flaws! 🙂

Edit: The new issue is here #9 (comment)

rdrighetto · 2024-09-08T18:45:55Z

I was running each test (MRC vs. torch subtomos) identically one more time, and in this second run, the MRC subtomos have ran into the sporadic error while the torch_subtomos branch ran to completion. So now I'm more confident that the torch subtomos are more stable than the MRC files (but to be honest not yet 100% sure). Regarding performance, the torch subtomos now gave the best performance overall: 513min, so I don't think there's any fundamental bottleneck in comparison to the old code for MRC subtomos -- probably just random fluctuations of our hardware :-)
Regarding the reproducibility, I do have an interesting observation but will continue that discussion in #9.

Thanks again @SimWdm for the prompt responses here!

SimWdm · 2024-09-10T06:43:41Z

Let me summarise: As observed by @RHennellJames and @rdrighetto (thanks again to both of you!), the torch subtomo solution does not seem to cause any slowdowns and has not caused any crashes during data loading so far.
So I think it should be safe to merge the subtomo branch into main.

Any objections? 🙂

rdrighetto · 2024-09-10T07:58:12Z

Sounds good to me!

TomasPascoa · 2024-09-12T08:54:45Z

Hi, first of all, thanks so much for this discussion - it really helped me make sense of the errors I was getting!

Secondly, I wanted to share my experience: I tried both the MRC and Torch subtomo approaches, and both worked well in general. However, I found that the Torch subtomo solution succeeded for a tomogram where the MRC subtomo approach had previously failed due to the header error you described. I think I agree with @rdrighetto's assessment - it may be an I/O issue causing the MRC error.

That said, I ran into similar issues with a different dataset while using the .pt subtomos. The process ran for 779 epochs before failing, and I was able to restart from the last checkpoint and push it to completion, but it shows that a similar error is still possible with the Torch subtomo approach. Here’s the relevant part of the error message, which to me it feels it may be related to I/O issues:

_RuntimeError: Caught RuntimeError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/ddw/utils/sub
tomo_dataset.py", line 60, in __getitem__
    subtomo0 = torch.load(subtomo0_file)
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/seriali
zation.py", line 1072, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/seriali
zation.py", line 480, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding
central directory

Epoch 779: 100%|██████████| 116/116 [01:51<00:00,  1.04it/s, loss=30, v_num=0, fitting_loss=29.10, val_loss=29.00]
real    951m23.637s
user    8650m1.955s
sys     198m50.308s_

TomasPascoa · 2024-09-12T08:59:21Z

Apologies @SimWdm - I just saw that you merged the .pt subtomos approach to master as I was writing this. I am still inclined to agree that it probably makes sense - but worth keeping in mind the issue may not be completely gone :)

SimWdm · 2024-09-12T09:06:35Z

Thanks for sharing your experience @TomasPascoa!
No reasons to apologize: If the error persists (although in a different way), the issue should be kept open, so I have re-opened the issue.

Any help is highly appreciated! 😅

rdrighetto · 2024-09-12T09:24:25Z

@TomasPascoa confirms my fears have become true 😅

henrynjones · 2024-10-24T23:24:07Z

Hi @SimWdm, thanks for your help! Unfortunately I continue to experience problems similar to @TomasPascoa so I just wanted to give you another data point. I have been trying to debug this with limited success. I printed out the path to the subtomo which causes this issue and can't seem to find any problems with it. It is the correct size and I can load and print out values if I load the subtomo by hand in a python shell. I will let you know if I manage the solve the issue but it does seems to be quite random.

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/utils/data/
_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/utils/data/
_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/utils/data/
_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/utils/data/
dataset.py", line 335, in __getitem__
    return self.datasets[dataset_idx][sample_idx]
  File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/ddw/utils/subtomo
_dataset.py", line 61, in __getitem__
    subtomo0 = torch.load(subtomo0_file)
  File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/serializati
on.py", line 1005, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/serializati
on.py", line 457, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding
central directory

SimWdm · 2024-10-25T14:27:52Z

Thank you for joining the discussion and for looking into this @henrynjones!

I have just created a new branch 8-sporadic-error-during-training in which I replaced the standard torch.load in the dataloader with a safe_load function that tries to load the subtomo and, if a RuntimeError occurs, waits for one second and then tries again:

def safe_load(file_path, max_retries=3, delay=1):
    for attempt in range(max_retries):
        try:
            return torch.load(file_path)
        except RuntimeError as e:
            print(f"Error loading {file_path}")
            if attempt == max_retries - 1:
                raise e  # Reraise if it's the last attempt
            print(f"Retrying in {delay} seconds")
            time.sleep(delay)  # Wait before retrying

While this is definitely a bit hacky, I think it might work because, as @henrynjones reports, the subtomo that causes the error during model fitting can be safely loaded later on.

@henrynjones, @rdrighetto, @Phaips, @TomasPascoa Please feel free to try this hotfix on the new 8-sporadic-error-during-training branch.
I suggest writing all command line output to a log file (e.g. ddw fit-model [YOUR ARGS HERE] > log.txt) such that we can see if safe_load excepted any RuntimeError.

Let me know if you run into any issues with the new branch!

Best,
Simon 😊

henrynjones · 2024-10-25T22:25:38Z

@SimWdm I have started testing the new branch and am running into some issues but it is hard for me to disentangle what might be a problem with my configuration (I am using the NERSC HPC system). I do usually seem to get this error within the update_subtomo_missing_wedge call (For debugging I made this happen after every epoch) but here I don't get a traceback at all which is strange. It also seems strange that it only failed once and then seemed to continue until something found a problem. I also changed the print statements in safe_load to have flush=True to ensure I see things in time. I again checked that the problem file can be loaded without issues in a shell. Note that this is after the second epoch (epoch 1) so I fully completed an earlier update_subtomo_missing_wedge call after epoch 0.

Epoch 1, global step 216: 'fitting_loss' reached 392410.87500 (best 392410.87500), saving model to 'data_prep/logs/version_2/checkpoints/fitting_loss/epoch=1-fitting_loss=392410.87500.ckpt' as top 5
Epoch 1, global step 216: 'val_loss' reached 389776.81250 (best 389776.81250), saving model to 'data_prep/logs/version_2/checkpoints/val_loss/epoch=1-val_loss=389776.81250-v1.ckpt' as top 3




^MUpdating subtomo missing wedges:   0%|          | 0/135 [00:00<?, ?it/s]^[[A^MUpdating subtomo missing wedges:   0%|          | 0/135 [00:21<?, ?it/s]
^MUpdating subtomo missing wedges:   0%|          | 0/135 [00:00<?, ?it/s]^[[A
^MUpdating subtomo missing wedges:   0%|          | 0/135 [00:00<?, ?it/s]^[[A^MUpdating subtomo missing wedges:   0%|          | 0/135 [00:28<?, ?it/s]
^MUpdating subtomo missing wedges:   1%|          | 1/135 [00:40<1:31:06, 40.79s/it]^[[A
^M                                                                        
^[[A^MEpoch 1: 100%|â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>| 135/135 [14:23<00:00,  6.40s/it, loss=3.9e+05, v_num=2, val_loss=3.9e+5, fitting_loss=3.92e+5]
Error loading ./data_prep/subtomos/fitting_subtomos/subtomo0/7.pt
Retrying in 1 seconds
^MUpdating subtomo missing wedges:   1%|â<96><8f>         | 2/135 [00:46<44:24, 20.03s/it]  ^[[A
^MUpdating subtomo missing wedges:   2%|â<96><8f>         | 3/135 [00:52<29:57, 13.61s/it]^[[A
^MUpdating subtomo missing wedges:   3%|â<96><8e>         | 4/135 [00:57<22:46, 10.43s/it]^[[A
^MUpdating subtomo missing wedges:   0%|          | 0/135 [00:00<?, ?it/s]^[[A
^MUpdating subtomo missing wedges:   4%|â<96><8e>         | 5/135 [01:03<18:49,  8.69s/it]^[[A
^MUpdating subtomo missing wedges:   1%|          | 1/135 [00:40<1:30:01, 40.31s/it]^[[A
^MUpdating subtomo missing wedges:   4%|â<96><8d>         | 6/135 [01:08<16:15,  7.56s/it]^[[A
^MUpdating subtomo missing wedges:   1%|â<96><8f>         | 2/135 [00:45<44:01, 19.86s/it]  ^[[A
^MUpdating subtomo missing wedges:   5%|â<96><8c>         | 7/135 [01:13<14:09,  6.64s/it]^[[A
^MUpdating subtomo missing wedges:   2%|â<96><8f>         | 3/135 [00:51<29:26, 13.39s/it]^[[A
^MUpdating subtomo missing wedges:   6%|â<96><8c>         | 8/135 [01:18<12:58,  6.13s/it]^[[A
^MUpdating subtomo missing wedges:   7%|â<96><8b>         | 9/135 [01:23<12:01,  5.73s/it]^[[A
^MUpdating subtomo missing wedges:   3%|â<96><8e>         | 4/135 [00:57<22:39, 10.38s/it]^[[A
^MUpdating subtomo missing wedges:   7%|â<96><8b>         | 10/135 [01:28<11:22,  5.46s/it]^[[A
^MUpdating subtomo missing wedges:   4%|â<96><8e>         | 5/135 [01:02<18:45,  8.66s/it]^[[A
^MUpdating subtomo missing wedges:   8%|â<96><8a>         | 11/135 [01:33<11:04,  5.36s/it]^[[A
^MUpdating subtomo missing wedges:   4%|â<96><8d>         | 6/135 [01:08<16:20,  7.60s/it]^[[A
^MUpdating subtomo missing wedges:   9%|â<96><89>         | 12/135 [01:38<10:35,  5.17s/it]^[[A
^MUpdating subtomo missing wedges:   5%|â<96><8c>         | 7/135 [01:13<14:14,  6.68s/it]^[[A
^MUpdating subtomo missing wedges:  10%|â<96><89>         | 13/135 [01:42<10:07,  4.98s/it]^[[A
^MUpdating subtomo missing wedges:   6%|â<96><8c>         | 8/135 [01:18<12:53,  6.09s/it]^[[A
^MUpdating subtomo missing wedges:  10%|â<96><88>         | 14/135 [01:47<09:46,  4.85s/it]^[[A
^MUpdating subtomo missing wedges:   7%|â<96><8b>         | 9/135 [01:23<12:03,  5.74s/it]^[[A
^MUpdating subtomo missing wedges:  11%|â<96><88>         | 15/135 [01:51<09:25,  4.72s/it]^[[A
^MUpdating subtomo missing wedges:   7%|â<96><8b>         | 10/135 [01:28<11:31,  5.53s/it]^[[A
^MUpdating subtomo missing wedges:  12%|â<96><88>â<96><8f>        | 16/135 [01:56<09:15,  4.67s/it]^[[A
^MUpdating subtomo missing wedges:   8%|â<96><8a>         | 11/135 [01:33<11:13,  5.43s/it]^[[A
^MUpdating subtomo missing wedges:  13%|â<96><88>â<96><8e>        | 17/135 [02:00<09:02,  4.59s/it]^[[A

Aborted.

Aborted.
^MEpoch 1: 100%|â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>| 135/135 [17:09<00:00,  7.62s/it, loss=3.9e+05, v_num=2, val_loss=3.9e+5, fitting_loss=3.92e+5]
^M                                                                        
^[[A^MEpoch 1: 100%|â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>| 135/135 [14:26<00:00,  6.42s/it, loss=3.9e+05, v_num=2, val_loss=3.9e+5, fitting_loss=3.92e+5]
^MEpoch 1: 100%|â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>| 135/135 [17:10<00:00,  7.63s/it, loss=3.9e+05, v_num=2, val_loss=3.9e+5, fitting_loss=3.92e+5]
^MUpdating subtomo missing wedges:   9%|â<96><89>         | 12/135 [01:37<10:37,  5.18s/it]^[[A
srun: error: nid001469: task 2: Exited with exit code 1
srun: Terminating StepId=32196346.0
^MUpdating subtomo missing wedges:  13%|â<96><88>â<96><8e>        | 18/135 [02:05<08:54,  4.57s/it]^[[A
srun: error: nid001469: task 0: Exited with exit code 1
slurmstepd: error: *** STEP 32196346.0 ON nid001469 CANCELLED AT 2024-10-25T21:15:51 ***

SimWdm · 2024-10-26T20:07:57Z

Thanks for testing the fix so quickly, @henrynjones!

From the output, it looks like safe_load successfully prevented the RuntimeError we saw before, but another error seems to occur further down, without showing a clear error message. Is that correct?

Unfortunately, I don’t have experience with NERSC HPC. It looks like the system uses SLURM, though. Would it be possible for you to test the fix directly on a server, without using SLURM?

One other thing I noticed in your output is that the progress bar doesn’t seem to appear in chronological order and includes some strange characters. Did you see the same behavior with the code in the main branch, or is this new?

TomasPascoa · 2024-10-27T18:50:07Z

Thanks for implementing this @SimWdm!

I have just tested the fix in two replicate runs on the same tomogram. In the first run, the RuntimeError came up once but was successfully prevented by your safe_load fix, as @henrynjones also reported. So that's great! And the job ran to completion.

However, in my second run, I instead got an UnpicklingError:

UnpicklingError: Caught UnpicklingError in DataLoader worker process 6.
Original Traceback (most recent call last):
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/dacostac/.conda/envs/DeepDeWedge/ddw/utils/subtomo_dataset.py",
line 74, in __getitem__
    subtomo0 = safe_load(subtomo0_file)
  File "/home/dacostac/.conda/envs/DeepDeWedge/ddw/utils/subtomo_dataset.py",
line 19, in safe_load
    return torch.load(file_path)
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/seriali
zation.py", line 1114, in load
    return _legacy_load(
  File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/seriali
zation.py", line 1338, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: A load persistent id instruction was encountered,
but no persistent_load function was specified.

So, it seems to me that perhaps the sporadic error was sometimes a RuntimeError and sometimes a UnpicklingError? And now the safe_load fix is only accounting for RuntimeError?

I'm tempted to try to modify two lines of the safe_load function (see below) and re-run the job until I see that it also successfully prevented the UnpicklingError

except (RuntimeError, UnpicklingError) as e:
          print(f"Error loading {file_path}: {e}")

What do you think?

SimWdm · 2024-10-28T08:12:40Z

Thanks for trying the fix @TomasPascoa! Glad to see that worked at least once.

I have just updated the branch and safe_load should now catch any exception (except Exception as e:). Hopefully this works the UnpicklingError as well 🙏🏼 .

SimWdm · 2024-11-28T10:32:23Z

Hi, @henrynjones and @TomasPascoa have you had the chance to try the new fix yet?
No worries if not—I'm just checking to see if the issue has been finally resolved for good. 😅 🙏🏼

henrynjones · 2024-11-28T21:54:55Z

Hi, @henrynjones and @TomasPascoa have you had the chance to try the new fix yet? No worries if not—I'm just checking to see if the issue has been finally resolved for good. 😅 🙏🏼

Hi @SimWdm, sorry for my delay. I managed to get the fix to work for me with a few retries during training, but I was still experiencing very slow training compared to what I observed on an interactive node on the cluster I use. I realized that I was observing the loading issue because I was explicitly initializing training with multiple tasks which I think may have lead to simultaneously file reading. So for me it was more of a beginner multi-threading/ cluster user issue and I had no issues and faster training by simply letting your code and pytorch handle things. Thanks!

SimWdm · 2024-11-29T17:14:16Z

Thanks for the update @henrynjones! 😄

I managed to get the fix to work for me with a few retries during training

Does that mean that after you applied the latest try: ... except: fix, the code still crashed a few times and you had to restart it? If so, did you manage to find out if the crash was related to the loading bug or if it was caused by something else that is specifically related to your system, like SLURM?

TomasPascoa · 2025-01-14T10:42:36Z

Hi @SimWdm

I am terribly sorry - somehow I missed the notification of your message!

Yes, your fix seems to have solved it for me. I now routinely queue up ddw for a batch of tomos and it always runs nicely to completion. The issue seems fixed to me. No failed runs since then!

SimWdm · 2025-01-14T13:02:30Z

These are great news! It seems like we can finally close this issue for good! I will merge the branch with the fix into main.

Huge thanks to @Phaips, @rdrighetto, @RHennellJames, @henrynjones and @TomasPascoa for your help and your patience! It's people like you who make open source so awesome! 🙏🏼 😊

SimWdm self-assigned this Sep 4, 2024

SimWdm added the bug Something isn't working label Sep 4, 2024

SimWdm mentioned this issue Sep 7, 2024

Model Fitting Runs Not Perfectly Reproducible #9

Open

SimWdm mentioned this issue Sep 12, 2024

save subtomos as .pt instead of .mrc to avoid potential issues with mrc header #12

Merged

SimWdm linked a pull request Sep 12, 2024 that will close this issue

save subtomos as .pt instead of .mrc to avoid potential issues with mrc header #12

Merged

SimWdm closed this as completed in #12 Sep 12, 2024

SimWdm reopened this Sep 12, 2024

SimWdm added the help wanted Extra attention is needed label Sep 13, 2024

SimWdm pinned this issue Sep 13, 2024

SimWdm linked a pull request Jan 14, 2025 that will close this issue

8 sporadic error during training #24

Merged

SimWdm closed this as completed in #24 Jan 14, 2025

SimWdm unpinned this issue Jan 14, 2025

Sporadic error during training #8

Sporadic error during training #8

Comments

Phaips commented Jul 31, 2024

rdrighetto commented Jul 31, 2024

SimWdm commented Jul 31, 2024

Phaips commented Jul 31, 2024 • edited Loading

SimWdm commented Aug 1, 2024

SimWdm commented Aug 5, 2024

Phaips commented Aug 5, 2024 • edited Loading

SimWdm commented Aug 6, 2024 • edited Loading

rdrighetto commented Aug 6, 2024

SimWdm commented Aug 6, 2024 • edited Loading

RHennellJames commented Aug 26, 2024

SimWdm commented Aug 29, 2024

rdrighetto commented Sep 2, 2024

RHennellJames commented Sep 2, 2024

SimWdm commented Sep 2, 2024 • edited Loading

RHennellJames commented Sep 4, 2024

SimWdm commented Sep 4, 2024

RHennellJames commented Sep 5, 2024

rdrighetto commented Sep 6, 2024

SimWdm commented Sep 7, 2024 • edited Loading

rdrighetto commented Sep 8, 2024

SimWdm commented Sep 10, 2024

rdrighetto commented Sep 10, 2024

TomasPascoa commented Sep 12, 2024

TomasPascoa commented Sep 12, 2024

SimWdm commented Sep 12, 2024

rdrighetto commented Sep 12, 2024

henrynjones commented Oct 24, 2024 • edited Loading

SimWdm commented Oct 25, 2024

henrynjones commented Oct 25, 2024 • edited Loading

SimWdm commented Oct 26, 2024

TomasPascoa commented Oct 27, 2024 • edited Loading

SimWdm commented Oct 28, 2024 • edited Loading

SimWdm commented Nov 28, 2024

henrynjones commented Nov 28, 2024

SimWdm commented Nov 29, 2024

TomasPascoa commented Jan 14, 2025

SimWdm commented Jan 14, 2025 • edited Loading

Phaips commented Jul 31, 2024 •

edited

Loading

Phaips commented Aug 5, 2024 •

edited

Loading

SimWdm commented Aug 6, 2024 •

edited

Loading

SimWdm commented Aug 6, 2024 •

edited

Loading

SimWdm commented Sep 2, 2024 •

edited

Loading

SimWdm commented Sep 7, 2024 •

edited

Loading

henrynjones commented Oct 24, 2024 •

edited

Loading

henrynjones commented Oct 25, 2024 •

edited

Loading

TomasPascoa commented Oct 27, 2024 •

edited

Loading

SimWdm commented Oct 28, 2024 •

edited

Loading

SimWdm commented Jan 14, 2025 •

edited

Loading