-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU out of memory problem during detector.fit() #93
Comments
Hello, thank you for reporting this problem. I am currently not sure if I understand what is happening.
I completely agree with this sentiment. To my understanding, this is how it is implemented at the moment. In the Maybe, there is a bug somewhere that I can not see right now. Which detector are you using exactly? If your code still causes an OOM, even with a small batch-size, could you provide me with a minimal example where the VRAM usage of the Could it be possible that the OOM does not occur in |
Hello, Thank you for your attention and response. I examined the problem again and I realized that there's no OOM problem with Specifically, I have OOM problem when fitting the following detectors using GPU:
z, y = z.to(device), y.to(device)
...
self.mu = torch.zeros(size=(n_classes, z.shape[-1]), device=device)
self.cov = torch.zeros(size=(z.shape[-1], z.shape[-1]), device=device) from line 92. Fitting the above-mentioned OOD detectors using GPU would probably cause an OOM problem. It depends on the VRAM size in practce. To circumvent this problem, I can fit with CPU instead. Here I got the error: model.to(’cpu’)
detector.fit(data_loader, device=’cpu’)
model.to(’cuda’) To avoid such errors coming up, could you also consider ensuring the model's device, for example by adding Another small issue is concerning def fit_features(self: Self, *args, **kwargs) -> Self:
"""
Not required
"""
raise self
def fit(self: Self, *args, **kwargs) -> Self:
"""
Not required
"""
return self |
Okay, so the problem for KLMatching and SHE seems to be that after the feature extraction, all of the data is moved to the device at once.
If I am not mistaken, the centers For some of the detectors, implementing something like batch-processing would probably be possible. For example, KLMatching could fit on a per-class basis, reducing VRAM usage by factor 1000 in your use-case. Similar for SHE. But: this would slow things down. On the other hand, if you have like 1,000,000 instances (say, ImageNet training set) with 1024 features, that will only be
This is probably a good idea. It might be confusing to some that calling the detector has the side-effect of moving the model to a different device, but for now it is probably to most easy-to-use variant.
Good point. I will add catchall |
I addressed some of the issues. Could you install the branch containing the fixes with pip install git+https://github.com/kkirchheim/pytorch-ood.git@93-gpu-out-of-memory-problem-during-detectorfit and test if you still run OOM? |
I got the reason. It was not the bug of the detectors. It was actually caused by the feature map of ViT and SWIN models that I used (I used the features before average pooling). The features are 50,000491024 in total, ≈10GB. I have switched to the 50,000*1024 features after average pooling. With this number of features, I won't have any OOM problems. Now all the detectors fit well with GPU ! The problem is the device inconsistency error now.
Yes, it's true. To avoid this side-effect, why not move the model (backbone) back to its original device after the calculations inside the detectors, so that the users won't see any change in model.device from the outside ? However, I come across this error running There are still some errors with Mahalanobis detector when I try Besides, in SHE.py, line 99: |
There might be some problems in cases where models are distributed across different devices. Anyway, I think it makes sense to just accept the fact that the underlying model can be moved to a different device for now. As long as this is documented, I think it should be fine.
That is true, I will add a typecheck before the |
Hello, first of all, thanks a lot for your dedication and contributions to developing this library of OoD detectors. It helps a lot with our research. It’s greatly appreciated.
I came across a small issue: When I tried to do
detector.fit()
on ImageNet-1k dataset and I intended to fit the detectors using GPU, the GPUs run out of memory. The reason is that the 50000 embeddings and labels are all stored on GPU memory, which is quite demanding even for running on a server.I think that a more practical way considering the limitation of GPU memory is to store the embeddings and labels on CPU. And when processing them in the upcoming
detector.fit_features()
procedure, iterate through them and transfer each batch of (embeddings, labels) to the GPUs.Related codes in
pytorch_ood.utils
:The text was updated successfully, but these errors were encountered: