-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rtx 3000 series broken compatibility #32
Comments
Also having issues with an RTX 3090. I'm running Stylegan2-ADA on Windows, with tensorflow-gpu 1.14. I haven't attempted training yet, but seeing strange behaviour when attempting inference: Not even sure what the problem could be specifically, since it's not giving any errors, seems to succeed with compilation and it's clearly capable of some inference. Perhaps weird fallback behaviour? |
As far as I know, you will need TensorFlow 1.x built against CUDA 11.1. The TensorFlow build needs to enable compute architecture sm86. This was discussed for Linux containers here: #10 |
@nurpax Thank you very much for your help! It looks like I missed to have the latest docker file you pushed there. |
I'm glad you got it working @JulianPinzaru! Feedback on how it goes after your overnight training is welcome, it's good to get some confirmation that this is working as intended. |
@nurpax
It might be some sort of bug... I disabled it with --metrics=none and then started over and it apparently was computing well, until the moment I realized it was increasing the computation time for every tick. I asked other people from my slack channel, someone used the same docker image and claimed that they have experienced a much longer training time per tick for 2 gpus (rtx 3090) than for 1 gpu (rtx 2080). I was also expecting it to run much quicker due to the mentioned improvements made on this stylegan2-ada repo, in fact I had 2 rtx titans before and it was running really quick on the previous stylegan2 repo ( 512 resolution images for 9~10 minutes per tick ), while here I already went over 48 minutes per tick and don't really know what to expect out of it. Attaching the details of 2 gpus the other guy was trying to run on: |
UPDATED |
Just got a 3090 and the same thing is happening here. Stable ticks on a 1070 (30 mins)but switching to using the docker image and it gets slower each tick. |
UPDATED Looks like people are mentioning that with augmentation enabled it takes them up to 14 minutes on google colab (premium) to train for a single tick... There might be something wrong with it. I noticed that when disabling augmentation, it takes up to 5 minutes 30 seconds for one tick, which looks reasonable and is what I would expect it to be. But it looks to be a huge unexplained overhead when enabling augmentation. I attached screenshots for all of the three cases I tried:
I thought the fixed augmentation wouldn't cause any time issues, but it turned out to be ~55 minutes per tick as well. |
Changed my aug pipeline to just filter and yes, back down to 5 mins a tick. Will have to leave it longer to see if that time increases. I’d guess it’s b,g or c causing issues? |
Thanks for the comments @JulianPinzaru and @nerdyrodent! All our training was done on TF 1.14 (with a TensorFlow that I built from source) using 8xGPU DGX-1 machines and we have not experienced this problem in our training. But quite clearly there is something wrong as the above comments indicate. In the interest of clarity, can you summarize what is known to work and what is known to not work? (Ideally with unmodified stylegan2-ada code, if possible.) There are many variables: augmentation or not, 1 or 2 GPUs, RTX 3090 or Volta, container url/tag etc. I'm not making any promises for a quick fix, but if we can clearly pin down the working and broken configurations, it'll be much easier for us to look into this. I can tabulate working/broken configs in this comment and edit as I get feedback from you. Known to work
Known broken
|
Ubuntu 20.04 Works as expected: Doesn't work as expected (gets slower). Only augpipe changed: (My config uses map=8 rather than map=2 from auto) Update: |
OS: Ubuntu 18.04 During testing I noticed that VIRT memory goes up to 61g, maybe that's one of the effects of heavy loaded augmentation. I made more experiments with augmentation OFF vs augmentation ON, and used python time library to record the critical operations that take longer than usual. I used docker command 1 and docker command 2 mentioned above to do the tests. The screenshots though are made for docker command 2 (using nvidia-docker with suggested params to start the container). I started with recordings for each iteration of the training loop, specifically on "run training ops" for loop. Posting the screenshot below to explain what I did. Further on, I used the following command to train a NON AUGMENTED model:
For augmentation I used fixed aug to emphasize the time it takes. (it takes approximately the same time as with dynamic aug strength growth, I posted it in my previous comment with screenshots of the ticks)
Then using the same approach, I measured the time for the following lines of code (individually): Screens for the aforementioned lines respectively Also tested the timing for those: @nurpax please let me know if it was helpful or if you would like me to benchmark higher hierarchy augmentation options that rely on these lower ones. |
UPDATED
Timing for "
In reality, when the aug is not fixed, it grows to 0.4 and gets those numbers to even higher values in seconds. |
Did some quick and dirty testing with the augpipeline_specs. I'd been using filter, noise and cutout without issue, so I used those as a base. I've also observed the GPU mem ctrl% drop below 10% when things are going slowly. Additionally, the card is typically audible when working as expected. These appeared to have the largest performance impact, based only on GPU mem usage graphs: Example aug pipeline specs tests: This test: |
@nerdyrodent I also have a feeling that it is somehow overusing the hard storage (SSD) and not using RAM or VRAM, some bottleneck present there and it's not clear where. Maybe due to how it operates with CUDA_CACHE directory, not sure. |
I've removed docker from the equation, and I'm still seeing the same behaviour. Using nvidia tensorflow r1.15.4+nv20.11 (via pip) + cuda_11.1.1_455.32.00 Another thing I've noticed is one CPU core will stick it out at 100% for a while, which it doesn't do when not using blit or geom. Now I'm really confused. |
@nurpax |
@nurpax |
@JulianPinzaru Hi! I'm following some of the posts (incl. this one) but alas, we don't have any new updates to this one. This problem does not actively impact our ongoing research projects and we're a small team of researchers with limited time. I will try to update if we find something that may apply here -- but I also prefer not to post updates unless we have fairly high confidence that the ideas/fixes actually help. Can you check the exact version of libcudnn that TensorFlow is using? Reading through the comments: so there is a massive slow down and a memory leak when using any of the --augpipe b* variants? Almost looks like some operations are falling back to CPU when we'd expect them to run on the GPU. |
I'm facing the same issue too, so only use with noaug for my RTX 3090.Wish this can be fixed.Now the only way is I use 2080ti for ada with different GPUs. |
A heads up on TensorFlow 1.x, RTX 3090 support and StyleGAN2-ADA. Our research group is in the process of switching to PyTorch and StyleGAN2 ADA will be our last project written in TensorFlow. We have ported StyleGAN2 ADA to PyTorch and plan on releasing this new codebase as the official StyleGAN2 ADA PyTorch implementation. We hope to release the PyTorch port sometime in January 2021. We expect the problems discussed in this GitHub issue to disappear as we transition to CUDA 11, cuDNN 8.0.x and the latest PyTorch release. |
thanks for the heads up. if there was an alpha branch - unsupported - it would be super. In the mean time - have to piece things together using this repo https://github.com/rosinality/stylegan2-pytorch UPDATE - pytorch yet to release cuda11.1 binaries might have to swap in old GPU to get some work done. UPDATE2 - the tensorflow docker container just slows down irrspective of neural net stuff. |
Salivating to see that one on pytorch! :) I bet it is a future proof decision, as tf 1.15 is not maintained anymore and there is a big pytorch community, willing to get their hands on 3000 series as well. |
I do not expect the pytorch cuda11/cuDNN 8.0x version to resolve what seems rtx3000 series driver issue Im using cuda 11/8.0x (most recent TF1/C11 docker from nvidia, with updated lastest release of cudnn) The problem is in the use of 2 tf functions in augment.py: When disabling these 4 filtering up/downscale invocations not only removes the dramatic performance impact but also removes a approx (res=512) 1gb/h memory leak seemingly triggered by the code behind these operations. As said, I noticed same behavior on a pytorch implementation : (lucidrains/stylegan2-pytorch, although I did not research that up to offending operation level) so it seems to be pointing towards cudnn and/or driver issue. Seemingly, other peeps are seeing the same, also with pytorch: https://github.com/pytorch/pytorch/issues/47039 For those in the mood for a short-term workaround: replace the relevant calls with a less fancy (unfiltered?) scaling not using depthwise convolutions. Do note that in general, depthwise convolutions dont scale well ( Gholami et al. , https://arxiv.org/pdf/1803.10615.pdf ) so not expecting miracles, but the current performance penalty and memory leak seem a bit excessive. 2080ti level should be possible on 3090. |
Hi Bart, feel free to throw up whatever code you have as a gist (might help other people troubleshooting) - https://gist.github.com/BartWMK - then can switch in your sample and see things more clearly. UPDATE - supposedly it’s possible to get 3090 working without docker + ubuntu. (hit a wall with zsh - doesn't correctly find tensorflow packages / just use bash) I recommend using timeshift to snapshot/backup your working system before doing any brain surgery + POPOS to get nvidia drivers up and running out of the box. From @dbkinghorn But I get same error - alue 'sm_86' is not defined for option 'gpu-architecture' / can anyone get working locally without docker? side note - found this stylegan2 pytorch code by @GreenLimeSia repo (seems pretty polished /it's a port function by function with documenting code / already handles all the tensorflow1 - pkl migrations
I got this to work using pytorch cuda 11.0 (even though 11.1 not released yet) Tensorflow2 + stylegan2 a lot of the code gets around compatibility problems by running in compatibility mode - NVidia - this decoractive code to get to tensorflow2 would help out a lot more than pytorch port - there's so many libraries hanging off this stylegan2-ada repo. import tensorflow.compat.v1 as tensorflow
tf = tensorflow
tf.disable_v2_behavior() UPDATE - tensorflow 2 - still gets
seems like nvidia have bumped the cuda toolkit on december 17th to 11.2
Allowed values for this option: 'compute_35','compute_37','compute_50',
|
@johndpope You mentioned me, and this is my key commit for tf2 compatibility: k-l-lambda/stylegan-web@6be1a4f Hope it helps. |
success ! got it working without docker. NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 threw these into my ~/.zshrc file
new terminal window sanity check
nvcc --version this has tensorflow 2 fixes (compatibility mode) (machine is running a bit slow / and chrome is unusually crashing - so beware) thanks again @k-l-lambda |
Upgrading the BASE_IMAGE in Dockerfile (#51) fixed the issue for me. The new 20.12 Docker image contains cuDNN 8.0.5 which according to the release notes contains significant performance improvements for RTX 3090. (The current image 20.10 uses cuDNN 8.0.4). |
I'm using Windows and the disjointed nature of this issue thread is a bit hard to follow. From what I'm seeing, someone here has fixed the issues with the 3090, but I'm unsure what exactly had been done. Does someone have a definitive fix for using the 30 series with this? I bought a 3090 specifically for custom StyleGANs, only to find I can't do them because of this compatibility issue. |
Use latest nvidia driver 460 + cuda 11.2 toolkit - https://developer.nvidia.com/cuda-downloads on host However - adding a few lines to add compatibility mode for tensorflow 2 does get it working. // this line
import tensorflow as tf
// becomes this line
import tensorflow.compat.v1 as tensorflow
tf = tensorflow
tf.disable_v2_behavior() more elaborate fork here |
@nurpax hi ! Is there any chance to see sg-ada on pytorch any time soon? Thanks |
@JulianPinzaru YES! We just published the repo, find your bits at: https://github.com/NVlabs/stylegan2-ada-pytorch I haven't tested the code on RTX 3090 myself. Pretty sure it will require CUDA 11.1 to run and might break on CUDA 11.0. I will be looking into RTX 3090 support this week. |
Yes. Again, it’s not really a ‘conversion’ just running in compatibility mode. There is a way to convert code / but I didn’t go down this route. |
Sorry @Thunder003 / won’t be able to be much more help here. Tensorflow is kinda dead to me now. |
I have also an odd issue with StyleGan2 (official TensorFlow implementation) on RTX 3090 in Windows at the very first stage, running Generator for a test. Is there any solution or fix for this issue?! |
I don't think you should use Tensorflow implementation. Just go for NVLabs Pytorch Stylegan2 (or 3). It works fine on 3000 series. It's also somewhat compatible with older TF trained network pkls (if I am not mistaken). |
Hi! Is there any update or solution for this issue? |
The recommended fix is to switch to either https://github.com/NVlabs/stylegan3 or https://github.com/NVlabs/stylegan2-ada-pytorch both of which are known to work on new hardware and recent versions of PyTorch. |
I tried to install nvidia driver ( 455 ) by myself on my ubuntu 18.04 with python 3.7 and tensorflow 1.14 (also tried 1.15).
It always said it couldn't find a gpu when trying to start training (or other errors like attempting to import cublas.10 files with a failure, while I had cuda 11 installed instead ). I got an rtx 3090 founder edition gpu.
I tried different approaches by reinstalling things and wasted more than 10 hours, it never worked for me. It was working on my titan rtx though, on a few different computer rigs.
Finally I thought that maintainers claimed it is working on their end for rtx 3000, maybe I can try their docker container.
It didn't work initially, then I realized I have a few more steps to do, so I installed nvidia-docker2 ( nvidia-container-toolkit ) thinking that it should certainly work. Unfortunately, it causes errors again:
By googling it I identified that similar errors ( sm_75 ) are occurring when there is code / cuda / driver compatibility issues. At least that's what people say.
Please help with a decent working container version at least.
The text was updated successfully, but these errors were encountered: