-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch-triton-rocm from pytorch-nightly install does not run any more #396
Comments
@jataylo Do you have any comments on this. |
@briansp2020 Thanks for reporting this I will attempt to replicate with the latest nightly. In the meantime I recommend to install from source from the https://github.com/ROCmSoftwarePlatform/triton/tree/pytorch_nightly_11-03-2023 branch. This is the same commit used in PyTorch but this may exclusively be a wheel issue. Or you could try build with our latest triton from the |
Hey @briansp2020 I am unable to replicate this failure by following the nightly install instructions here:
And the softmax test is also passing for me on our latest nightly image
Could you help me get the details of your environment setup e.g. which hardware, docker/bare metal environment? |
Just installed again today and I still see the problem.
I just created a new venv and ran the pip command. It's a docker I built using https://gist.github.com/briansp2020/fd1579b3d7fe4643409593e229fbd26f The hardware is Ryzen 9 7900X, Radeon 7900XTX, ASUS Strix B650E-F, 64GB RAM. Base OS Ubuntu 22.04 server. I just pulled rocm/pytorch-nightly:latest and it is showing the same error. I'll try installing it from the https://github.com/ROCmSoftwarePlatform/triton/tree/pytorch_nightly_11-03-2023 In the past, before pytorch-triton-rocm 2.1.0+e8a35b3968, when I installed from source after installing pytorch-nightly, pip list would show both pytorch-triton-rocm and triron. Is that expected behavior? |
@briansp2020 Thanks for the detailed information. I have reproduced your issue on Navi31 at Pytorch's triton commit. I have also confirmed that this workload passes with the latest commit of triton-mlir branch. If you want to get around this for now I recommend building the triton-mlir branch from source as it will take us a bit of time to get the pytorch commit back in sync.
This is expected, pytorch-triton-rocm is the name of the triton wheel that is installed as a dependency on pytorch, but if triton is built from source it is simply named triton. I recommend to only have one of these installed at a time. |
@jataylo
|
@briansp2020 I believe you are trying to build from the wrong directory, can you try this out?
|
Ok. That seemed to have installed it. Triton still does not seem to work on 7900XTX though as I got a kernel page fault message. :( The problem was that I was running "pip3 install -e ." as instructed on the main page instead of "python setup.py develop" |
@briansp2020 Was this using the softmax example again? I did run this yesterday on 7900XTX (hit a tolerance error but it did execute at least). Could you try to clear the triton cache if you have reinstalled in a pre-existing environment with
Please let me know if this still fails and any additional repro information |
Kernel panic with 7900XTX was a known issue. (ROCm/pytorch#1284, ROCm/AMDMIGraphX#2174 (comment), #223 (comment), ...) I rebooted the system and started a fresh docker and still got the page fault. See this dmesg shows
Currently, I have installed .6.0 dkms driver on the host and am running 5.7.1 rocm runtime environment in a docker container. So, it could be caused by a mismatch between user-land code and kernel module. At the moment, I'm a bit reluctant to go back to 5.7.1 kernel module since my attempt for downgrading kernel module has failed a few times in the past. Hope I provided enough information so that you can track it down. |
This seems likely, I was able to get this working with a 5.7.1 host if you can give that a try. In the meantime I'm working on getting pytorch triton back in sync with triton-mlir to resolve this issue with the pytorch triton wheel. Will keep you posted. |
I reinstalled Ubuntu Server 22.04 and then, installed and built a new docker with ROCm5.7.1. The kernel error message still happens. The problem is very consistent and reproducible for me. If you don't mind, could you tell me what hardware you are using for 7900XTX testing?
|
@jataylo |
Small bump in rocm triton commit pin to resolve reported issue on 7900XTX > RuntimeError: Triton Error [HIP]: Code: 719, Messsage: unspecified launch failure ROCm/triton#396 Pull Request resolved: #114348 Approved by: https://github.com/jeffdaily
Hi @briansp2020 the kernel error was still present with the triton pinned in torch - this should be resolved when the next nightlies are updated later today. For reference we are pinning to this branch (https://github.com/ROCmSoftwarePlatform/triton/tree/pytorch_nightly/23_11_2023) in pytorch nightly now. I was just checking over your log from earlier
I also see this AssertionError on N31 on my side so seems we are at the same point, likely seems to be a tolerance issue. cc: @zhanglx13 P.S I'm also using Linux 5.15.0-89-generic on Ubuntu server environment |
@jataylo Since you can reproduce the assertion error, I'll wait till it gets resolved. I also want to bring to your attention that 03-matrix-multiplication is showing very poor performance. I saw the same issue with MI100 as well. So, I don't know whether it's NAVI31 specific issue or triton ROCm support in general.
I bought MI100 thinking that it would be in a better shape but triton support does not seem much better on MI100. I wish I had access to MI200 series so I can test the software. I hope MI200 & MI300 are better supported. |
Hi @briansp2020, MI100 has much better performance than what you got above, see what we get internally: python 03-matrix-multiplication.pymatmul-performance: |
@scxiao
|
Thanks for the reply. We are aware of the performance gap and agree it can be better (compared to both rocblas as you mentioned here and Triton MI200 numbers we have). We have an internal issue to track it and are working on it. Will let you know if there is any update. |
@scxiao Thank you! |
Here is the numbers on 1 GCD on MI200,
0 1024.0 1024.0 1024.0 60.732002 49.710268 |
I just tried ROCm6.0 release with nightly build of pytorch & locally built triton and I still see page fault error message when running 02-fused-softmax.py
|
I just tried the latest pytorch nightly build and ROCm 6.1 and it seems to work much better. I'll close this now.
|
I noticed that triton build that is installed along with pytorch nightly build was updated recently to 2.1.0+e8a35b3968 and no longer runs. It seems the way triton accesses the hardware has changed. Is there a guide that explains how to enable hip backend for triton?
The text was updated successfully, but these errors were encountered: