Replies: 3 comments 2 replies
-
Hi, can you please try loading WSI images with Tifffile backend instead of the cucim in the tutorial. Replace backend with TiffFile, and make sure that you import WSIReader from monai.data.image_reader as below (in both train and validation loader transforms)
-- also if you don't have tiffile package install, please install it first Let me know if it works |
Beta Was this translation helpful? Give feedback.
-
I made the requested changes, but the issue persists. I did have to change the import from wsi to image. I am going to try and reduce the batch and loader size. I noticed that the process was using a great deal of memory, perhaps there is some thrashing going on. Cody |
Beta Was this translation helpful? Give feedback.
-
I disabled distributed and limited workers, which seemed to reduce memory usage. I noticed that for a single worker on a single GPU there were still spikes of 48 threads (number of physical cores) using lots of memory every second or so. Unfortunately, even with these steps, application exit would trigger the SOFT LOCK errors. The good news is that setting pin_memory=False on the DataLoader seems to have taken care of the issue. There must be something going on between the dataloader and the backend loaders that are causing the issue. Memory usage is under control, training works, and the application exits as expected. |
Beta Was this translation helpful? Give feedback.
-
I don't know if this is a MONAI issue or not, but I wanted to see if anyone else was experiencing this. When running the MONAI Pathology MIL tutorial (https://github.com/Project-MONAI/tutorials/tree/main/pathology/multiple_instance_learning) on my data things seem to work just fine at first then a few errors [1] are thrown, but training (output) seems to continue. At some point in the training (different times, each time) the process seems to hang, but the GPUs are still active. I see no more output from the console, even after waiting for hours. When I exit the process, and sometimes before, I see a second error [2].
I am running on Ubuntu 20.04 /w 4 x A100s. All firmware, drivers, and MONAI code have been updated to the latest. I have not experienced outside of this MONAI tutorial.
If anyone has any thoughts it would be appreciated.
[1] Looks to be the same as an OpenSlide error: openslide/openslide#225
ERROR in line 189 while reading JPEG header:
Not a JPEG file: starts with 0xff 0x11
[2]
kernel:[ 1568.434829] watchdog: BUG: soft lockup - CPU#36 stuck for 22s! [python:18224]
Beta Was this translation helpful? Give feedback.
All reactions